Hi All,

   First of all thanks to all the developers working on this project,
from the looks of it, this project has great potential.  I've been
playing around with version 0.9 for the past couple of days and I have
a few questions regarding its usage.

At this time I'm particularly interested in doing the following:

1. Mirroring a complete website like abc.com without leaving its
confines.  I followed the 0.8 tutorial and basically did:

# add one site to the list
mkdir sites && echo 'www.abc.com' > sites/urls

# inject that one site to the WebDB?
bin/nutch inject crawl/crawldb sites

# generate segments (whatever this means - I assume it will just add
the www.abc.com link)
bin/nutch generate crawl/crawldb crawl/segments

# put the segment path into s1 (I use csh, hence the `set`)
set s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

# seems to fetch 50 or so pages linked from abc.com
bin/nutch fetch $s1

# put the results in the actual database
bin/nutch updatedb crawl/crawldb $s1


afaik, I think the above does one round of fetching, however there are
still plenty of pages yet to be fetched.  So I did:

while 1
   bin/nutch generate crawl/crawldb crawl/segments -topN 50
   set s1=`ls -d crawl/segments/2* | tail -1`
   bin/nutch fetch $s1
   bin/nutch updatedb crawl/crawldb $s1
end

but I noticed that it just fetches the same 50 sites over and over
again.  Isn't there a way to basically tell it to just keep fetching
pages that haven't been fetched yet?  I would assume the first run
would have found enough links to keep going.  Is the above procedure
meant to be a onetime run only?  How do I know when the site has been
completely indexed?



- Crawling

I'm also interested in performing a crawl of a rather large intranet
(and maybe even the Internet?).  I ran a crawl yesterday starting at
www.freebsd.org with depth 30 and topN 500 (topN number might be
wrong, could be more)  and I noticed that it stopped after only
downloading 160MB saying:

/tmp: write failed, filesystem is full
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)

Is there anyway to override where it considers the scratch directory?
On my freebsd box my tmp partition is 512MB and I'm sure any sane Unix
distribution out there would be the same (even 512MB is quite high).
I think the command-line tools shouldn't use /tmp by default if it
seems to need so much space.  Perhaps using the working directory or
the crawldb directory is better since the likelyhood of there being
more free space there is greater.

Anyway, back to the crawling question.  From reading the posts on this
mailing lists and the nutch wiki, it seems to me that nutch crawl will
basically crawl the internet up to the point specified by the depth
argument (not sure what topN really means). I would like to perform a
crawl starting at some start point A and then do the indexing for
lucene when it's finished so that i can start mining that data.  I
would also like to have nutch "continue" the crawl from where it left
off (not re-crawl the same pages) so that it can add more pages and
find more links to crawl in the future.  How can I tell nutch to "keep
going" or "re-crawl" what pages it has already visited?

Thanks in advance for your help

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to