On Sunday 08 April 2007 10:43:15 class acts wrote: > Hi All, > > First of all thanks to all the developers working on this project, > from the looks of it, this project has great potential. I've been > playing around with version 0.9 for the past couple of days and I have > a few questions regarding its usage. > > At this time I'm particularly interested in doing the following: > > 1. Mirroring a complete website like abc.com without leaving its > confines. I followed the 0.8 tutorial and basically did: > > # add one site to the list > mkdir sites && echo 'www.abc.com' > sites/urls > > # inject that one site to the WebDB? > bin/nutch inject crawl/crawldb sites > > # generate segments (whatever this means - I assume it will just add > the www.abc.com link) > bin/nutch generate crawl/crawldb crawl/segments > > # put the segment path into s1 (I use csh, hence the `set`) > set s1=`ls -d crawl/segments/2* | tail -1` > echo $s1 > > # seems to fetch 50 or so pages linked from abc.com > bin/nutch fetch $s1 > > # put the results in the actual database > bin/nutch updatedb crawl/crawldb $s1 > > > afaik, I think the above does one round of fetching, however there are > still plenty of pages yet to be fetched. So I did: > > while 1 > bin/nutch generate crawl/crawldb crawl/segments -topN 50 > set s1=`ls -d crawl/segments/2* | tail -1` > bin/nutch fetch $s1 > bin/nutch updatedb crawl/crawldb $s1 > end > > but I noticed that it just fetches the same 50 sites over and over > again. Isn't there a way to basically tell it to just keep fetching > pages that haven't been fetched yet? I would assume the first run > would have found enough links to keep going. Is the above procedure > meant to be a onetime run only? How do I know when the site has been > completely indexed?
Have you checked your logs. It seems the last step, bin/nutch updatedb ..., is failing ?! It would probably make sense to set -topN higher than 50. topN is the number of pages to get for a given segment. > > > > - Crawling > > I'm also interested in performing a crawl of a rather large intranet > (and maybe even the Internet?). I ran a crawl yesterday starting at > www.freebsd.org with depth 30 and topN 500 (topN number might be > wrong, could be more) and I noticed that it stopped after only > downloading 160MB saying: > > /tmp: write failed, filesystem is full > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:131) > > Is there anyway to override where it considers the scratch directory? > On my freebsd box my tmp partition is 512MB and I'm sure any sane Unix > distribution out there would be the same (even 512MB is quite high). > I think the command-line tools shouldn't use /tmp by default if it > seems to need so much space. Perhaps using the working directory or > the crawldb directory is better since the likelyhood of there being > more free space there is greater. Set tmp dir in conf/hadoop-site.xml with: <property> <name>hadoop.tmp.dir</name> <value>/path</value> <description>A base for other temporary directories.</description> </property> > > Anyway, back to the crawling question. From reading the posts on this > mailing lists and the nutch wiki, it seems to me that nutch crawl will > basically crawl the internet up to the point specified by the depth > argument (not sure what topN really means). I would like to perform a > crawl starting at some start point A and then do the indexing for > lucene when it's finished so that i can start mining that data. I > would also like to have nutch "continue" the crawl from where it left > off (not re-crawl the same pages) so that it can add more pages and > find more links to crawl in the future. How can I tell nutch to "keep > going" or "re-crawl" what pages it has already visited? Default setting is to refetch after 30 days. I guess bin/nutch generate will generate empty segments when there are no more pages to fetch ? You need to run bin/nutch invertlinks and bin/nutch index to be able to search your fetched pages. You'll find more in the wiki about this. We have successfully fetched and indexed (soon) about 70M pages on a cluster with 3 nodes + master (jobtracker and namenode), so it does work :) However we do not store content, only parsed data. > > Thanks in advance for your help - Espen ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
