Hi All, I'm fairly new to Nutch/Hadoop but I'm starting to get the hang of it. I followed the nutch/hadoop tutorial to fairly successfully get my modified version of nutch that downloads atom feed URLs. The system works perfectly in unit test in Eclipse, but has the following strange behavior when I run on DFS/Hadoop on my linux deployment machine. I have a particular URL family (same host/path structure with a different parameter) that points to my company's intranet blog entries. When I bootstrap my crawler with a URL file of *only* these URLs, the generator running on DFS/Hadoop, can't find any URLs to generate. However, if I put in a single URL at the top of the list that is a different host, and an external Atom feed, the generator quite happily passes on all feeds to the fetcher. I've played around quite extensively with all the various conf files that have URL patterns in them, and tried to make them as accepting as possible. In particular I comment out all the (-) patterns and add a (+) catch all at the end. However, with the same configuration, I don't see the behavior in unit test so I hesitate to assume that configuration files themselves are the problem.
Thanks in advance for any help. - Ben ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
