Re: Simple crawl fails to find any URLs

Barry Haddow Tue, 29 Jan 2008 01:39:58 -0800

Hi Susam

My urls file is
[EMAIL PROTECTED] conf]$ hadoop dfs -cat urls/urllist.txt
http://lucene.apache.org


I'm using the crawl-urlfilter.txt suggested in the tutorial - ie changing
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read
+^http://([a-z0-9]*\.)*apache.org/

When I run
nutch crawl urls -dir crawled.11 -depth 3 
it stops at depth 0. 

After changing crawl-urlfilter.txt to match regex-urlfilter.txt, and running 
nutch crawl urls -dir crawled.12 -depth 1
I see more output, so maybe the crawl has worked. However I still don't know 
how to extract the crawled files.

Looking at the hadoop.log, I noticed that there was an error message about the 
robot settings
2008-01-29 07:54:00,508 FATAL api.RobotRulesParser - Agent we advertise 
(University of Edinburgh) not listed first in 'http.robots.agents' property!
So I corrected the settings (in nutch-site.xml) and ran the same crawl again. 
This time it stopped at depth 0.

My hadoop.log is too large to attach, so please let me know if you want me to 
send it directly.

thanks and regards
Barry

On Tuesday 29 January 2008, Susam Pal wrote:
> If the crawl stops at depth=0, it means there is nothing to fetch in
> the first fetch cycle itself. Therefore there is no data to extract.
>
> Also, you mention about crawl-urlfilter.xml in your message. I hope
> this was a typo because there is no such file. The actual filter is
> crawl-urlfilter.txt.
>
> For the crawl that stops at depth=1,  you can see what has been
> crawled in depth 0, in logs/hadoop.log file. See whether any thing
> failed in the first depth.
>
> If you are not able to solve the problem, please provide the following
> information along with your query.
>
> 1. The hadoop.log file in logs directory.
> 2. The command you used to run the crawl.
> 3. What changes you did in conf/crawl-urlfilter.txt
> 4. Does the site you are crawling have link to other pages?
>
> Regards,
> Susam Pal
>
> On Jan 29, 2008 1:04 AM, Barry Haddow <[EMAIL PROTECTED]> wrote:
> > Hi
> >
> > I'm try to get the nutch/hadoop example from
> > http://wiki.apache.org/nutch/NutchHadoopTutorial
> > running.
> >
> > I've set up the urllist.txm and the crawl-urlfilter.xml as instructed in
> > the tutorial, but whenever I run the crawl it either reports
> >
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=1 - no more URLs to fetch.
> >
> > or
> >
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> >
> >
> > I can't tell if the crawler has managed to fetch any data. How can I
> > extract whatever data is has downloaded?
> >
> > thanks,
> > Barry

-------------------------------------------------------

Re: Simple crawl fails to find any URLs

Reply via email to