Hi Susam My urls file is [EMAIL PROTECTED] conf]$ hadoop dfs -cat urls/urllist.txt http://lucene.apache.org
I'm using the crawl-urlfilter.txt suggested in the tutorial - ie changing +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ to read +^http://([a-z0-9]*\.)*apache.org/ When I run nutch crawl urls -dir crawled.11 -depth 3 it stops at depth 0. After changing crawl-urlfilter.txt to match regex-urlfilter.txt, and running nutch crawl urls -dir crawled.12 -depth 1 I see more output, so maybe the crawl has worked. However I still don't know how to extract the crawled files. Looking at the hadoop.log, I noticed that there was an error message about the robot settings 2008-01-29 07:54:00,508 FATAL api.RobotRulesParser - Agent we advertise (University of Edinburgh) not listed first in 'http.robots.agents' property! So I corrected the settings (in nutch-site.xml) and ran the same crawl again. This time it stopped at depth 0. My hadoop.log is too large to attach, so please let me know if you want me to send it directly. thanks and regards Barry On Tuesday 29 January 2008, Susam Pal wrote: > If the crawl stops at depth=0, it means there is nothing to fetch in > the first fetch cycle itself. Therefore there is no data to extract. > > Also, you mention about crawl-urlfilter.xml in your message. I hope > this was a typo because there is no such file. The actual filter is > crawl-urlfilter.txt. > > For the crawl that stops at depth=1, you can see what has been > crawled in depth 0, in logs/hadoop.log file. See whether any thing > failed in the first depth. > > If you are not able to solve the problem, please provide the following > information along with your query. > > 1. The hadoop.log file in logs directory. > 2. The command you used to run the crawl. > 3. What changes you did in conf/crawl-urlfilter.txt > 4. Does the site you are crawling have link to other pages? > > Regards, > Susam Pal > > On Jan 29, 2008 1:04 AM, Barry Haddow <[EMAIL PROTECTED]> wrote: > > Hi > > > > I'm try to get the nutch/hadoop example from > > http://wiki.apache.org/nutch/NutchHadoopTutorial > > running. > > > > I've set up the urllist.txm and the crawl-urlfilter.xml as instructed in > > the tutorial, but whenever I run the crawl it either reports > > > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=1 - no more URLs to fetch. > > > > or > > > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=0 - no more URLs to fetch. > > > > > > I can't tell if the crawler has managed to fetch any data. How can I > > extract whatever data is has downloaded? > > > > thanks, > > Barry -------------------------------------------------------