[Nutch-general] strange URL filter behavior

Ben Szekely Mon, 23 Apr 2007 09:05:02 -0700

Hi All,
   I'm fairly new to Nutch/Hadoop but I'm starting to get the hang of 
it.  I followed the nutch/hadoop tutorial to fairly successfully get my 
modified version of nutch that downloads atom feed URLs.  The system 
works perfectly in unit test in Eclipse, but has the following strange 
behavior when I run on DFS/Hadoop on my linux deployment machine.  I 
have a particular URL family (same host/path structure with a different 
parameter) that points to my company's intranet blog entries.  When I 
bootstrap my crawler with a URL file of *only* these URLs, the generator 
running on DFS/Hadoop, can't find any URLs to generate.  However, if I 
put in a single URL at the top of the list that is a different host, and 
an external Atom feed, the generator quite happily passes on all feeds 
to the fetcher.  I've played around quite extensively with all the 
various conf files that have URL patterns in them, and tried to make 
them as accepting as possible.  In particular I comment out all the (-) 
patterns and add a (+) catch all at the end.  However, with the same 
configuration, I don't see the behavior in unit test so I hesitate to 
assume that configuration files themselves are the problem.


 Thanks in advance for any help.
   - Ben

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] strange URL filter behavior

Reply via email to