We also do depth 1 or two crawls, so the crawldb is also up to date. Be careful with Dmoz, there is a lot of Spam out there. The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.
-----Ursprüngliche Nachricht----- Von: Bartosz Gadzimski [mailto:bartek...@o2.pl] Gesendet: Donnerstag, 19. Februar 2009 14:56 An: nutch-user@lucene.apache.org Betreff: Re: AW: How to index while fetcher works Thanks Nadine, I am few days ahead thanks to your script :) Nutch is really nice pice of software, it just takes time to know it better. Regards, Bartosz Höchstötter Nadine pisze: > Hi. This is my version of an incremental index: I have one working dir for > all the new segments flying in and a routine every four hours to build a new > index for a special webindex folder which is nearly up to date. > I merge segments in another folder with YYYYMMDDHH Pattern in my working > segment dir. With this I can always recognize which segments have already > been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your > fresh webindex segment folder and also everything under $merge_dir (new > index) to your index folder in webindex dir. This dir has same structure as > your working crawl dir. > It is also good for backup reasons. Call the script below with a cron and add > cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb > with this cron, too, as a backup. > > > index_dir=/nutchcrawl/indexes/$CRAWLNAME/index > TIMEH=`date +%Y%m%d%H` > merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH > # Update segments > > for segment in `ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep > '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' ` > do > if [ -d $segment/_temporary ]; then > echo "$segment is temporary" > else > echo "$segment" > segments="$segments $segment" > fi > done > mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH > /bin/nutch mergesegs $mergesegs_dir $segments > > indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH > > NEW=`ls -d /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*` > echo "$NEW" > bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/ > > for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*` > do > allindexes="$allindexes $allindex" > done > > > bin/nutch merge $merge_dir $allindexes > > cheers, Nadine. > > -----UrsprĂźngliche Nachricht----- > Von: DoÄ�acan GĂźney [mailto:doga...@gmail.com] > Gesendet: Donnerstag, 19. Februar 2009 12:35 > An: nutch-user@lucene.apache.org > Betreff: Re: How to index while fetcher works > > Hi, > > > On Thu, Feb 19, 2009 at 13:28, Bartek <bartek...@o2.pl> wrote: > >> Hello, >> >> I started to crawl huge amount of websites (dmoz with no limits in >> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln >> >> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs) >> >> >> This fetching will not stop soon :) so I would like to convert already made >> segments (updatedb, invertlinks, index) but there are parts missing in them: >> >> [r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir >> crawls/segments/20090216142840/ >> >> > > > If you use -dir option then you pass segments directory not individual > segments, e.g: > > bin/nutch invertlinks crawls/linkdb -dir crawls/segments > > which will read every directory under segments > > To pass individual directories skip -dir option: > > bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840 > >> LinkDb: adding segment: >> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate >> >> ... >> >> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not >> exist: >> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data >> >> etc. >> >> When manualy trying to bin/parse segments it says that they are parsed. >> >> >> So my question is how to design whole proces of crawling large amount of >> websites without limiting them for specific domains (like in regular search >> engine eg. google)? >> >> Should I make loops of small amount of links? Like -topN 1000 and then >> updatedb,invertlinks, index ? >> >> >> For now I can start crawling and any data will appear in weeks. >> >> I found that in 1.0 (so made already) you are introducing live indexing in >> nutch. Are there any docs that I can use of ? >> >> Regards, >> Bartosz Gadzimski >> >> >> >> >> > > > >