Re: [Nutch-general] Incremental indexing and link exploration, /tmp full, nutch design

Espen Amble Kolstad Tue, 10 Apr 2007 06:54:09 -0700

On Sunday 08 April 2007 10:43:15 class acts wrote:
> Hi All,
>
>    First of all thanks to all the developers working on this project,
> from the looks of it, this project has great potential.  I've been
> playing around with version 0.9 for the past couple of days and I have
> a few questions regarding its usage.
>
> At this time I'm particularly interested in doing the following:
>
> 1. Mirroring a complete website like abc.com without leaving its
> confines.  I followed the 0.8 tutorial and basically did:
>
> # add one site to the list
> mkdir sites && echo 'www.abc.com' > sites/urls
>
> # inject that one site to the WebDB?
> bin/nutch inject crawl/crawldb sites
>
> # generate segments (whatever this means - I assume it will just add
> the www.abc.com link)
> bin/nutch generate crawl/crawldb crawl/segments
>
> # put the segment path into s1 (I use csh, hence the `set`)
> set s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
>
> # seems to fetch 50 or so pages linked from abc.com
> bin/nutch fetch $s1
>
> # put the results in the actual database
> bin/nutch updatedb crawl/crawldb $s1
>
>
> afaik, I think the above does one round of fetching, however there are
> still plenty of pages yet to be fetched.  So I did:
>
> while 1
>    bin/nutch generate crawl/crawldb crawl/segments -topN 50
>    set s1=`ls -d crawl/segments/2* | tail -1`
>    bin/nutch fetch $s1
>    bin/nutch updatedb crawl/crawldb $s1
> end
>
> but I noticed that it just fetches the same 50 sites over and over
> again.  Isn't there a way to basically tell it to just keep fetching
> pages that haven't been fetched yet?  I would assume the first run
> would have found enough links to keep going.  Is the above procedure
> meant to be a onetime run only?  How do I know when the site has been
> completely indexed?


Have you checked your logs. It seems the last step, bin/nutch updatedb ..., is 
failing ?!
It would probably make sense to set -topN higher than 50. topN is the number 
of pages to get for a given segment.

>
>
>
> - Crawling
>
> I'm also interested in performing a crawl of a rather large intranet
> (and maybe even the Internet?).  I ran a crawl yesterday starting at
> www.freebsd.org with depth 30 and topN 500 (topN number might be
> wrong, could be more)  and I noticed that it stopped after only
> downloading 160MB saying:
>
> /tmp: write failed, filesystem is full
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>
> Is there anyway to override where it considers the scratch directory?
> On my freebsd box my tmp partition is 512MB and I'm sure any sane Unix
> distribution out there would be the same (even 512MB is quite high).
> I think the command-line tools shouldn't use /tmp by default if it
> seems to need so much space.  Perhaps using the working directory or
> the crawldb directory is better since the likelyhood of there being
> more free space there is greater.
Set tmp dir in conf/hadoop-site.xml with:
<property>
  <name>hadoop.tmp.dir</name>
  <value>/path</value>
  <description>A base for other temporary directories.</description>
</property>


>
> Anyway, back to the crawling question.  From reading the posts on this
> mailing lists and the nutch wiki, it seems to me that nutch crawl will
> basically crawl the internet up to the point specified by the depth
> argument (not sure what topN really means). I would like to perform a
> crawl starting at some start point A and then do the indexing for
> lucene when it's finished so that i can start mining that data.  I
> would also like to have nutch "continue" the crawl from where it left
> off (not re-crawl the same pages) so that it can add more pages and
> find more links to crawl in the future.  How can I tell nutch to "keep
> going" or "re-crawl" what pages it has already visited?

Default setting is to refetch after 30 days. I guess bin/nutch generate will 
generate empty segments when there are no more pages to fetch ?

You need to run bin/nutch invertlinks and bin/nutch index to be able to search 
your fetched pages. You'll find more in the wiki about this.

We have successfully fetched and indexed (soon) about 70M pages on a cluster 
with 3 nodes + master (jobtracker and namenode), so it does work :) However 
we do not store content, only parsed data.

>
> Thanks in advance for your help


- Espen

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Incremental indexing and link exploration, /tmp full, nutch design

Reply via email to