We also do depth 1 or two crawls, so the crawldb is also up to date.
Be careful with Dmoz, there is a lot of Spam out there.
The loop is also useful for invertinglinks etc. whenever it is important to 
have single segments and not the whole directory. 

-----Ursprüngliche Nachricht-----
Von: Bartosz Gadzimski [mailto:bartek...@o2.pl] 
Gesendet: Donnerstag, 19. Februar 2009 14:56
An: nutch-user@lucene.apache.org
Betreff: Re: AW: How to index while fetcher works

Thanks Nadine, I am few days ahead thanks to your script :)

Nutch is really nice pice of software, it just takes time to know it better.

Regards,
Bartosz

Höchstötter Nadine pisze:
> Hi. This is my version of an incremental index: I have one working dir for 
> all the new segments flying in and a routine every four hours to build a new 
> index for a special webindex folder which is nearly up to date.
> I merge segments in another folder with YYYYMMDDHH Pattern in my working 
> segment dir. With this I can always recognize which segments have already 
> been indexed. Move or copy the merged segment under YYYYMMDDHH folder to your 
> fresh webindex segment folder and also everything under $merge_dir (new 
> index) to your index folder in webindex dir. This dir has same structure as 
> your working crawl dir.
> It is also good for backup reasons. Call the script below with a cron and add 
> cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb 
> with this cron, too, as a backup.
>
>
> index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
> TIMEH=`date +%Y%m%d%H`
> merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
> # Update segments
>
> for segment in `ls -d  /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep 
> '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
> do
> if [ -d $segment/_temporary ]; then 
> echo "$segment is temporary"
> else
> echo "$segment" 
> segments="$segments $segment"
> fi
> done
> mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
> /bin/nutch mergesegs $mergesegs_dir $segments
>
> indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH
>
> NEW=`ls -d  /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
> echo "$NEW"
> bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/
>
> for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
> do
> allindexes="$allindexes $allindex"
> done
>
>
> bin/nutch merge $merge_dir $allindexes
>
> cheers, Nadine.
>
> -----UrsprĂźngliche Nachricht-----
> Von: Do�acan Gßney [mailto:doga...@gmail.com] 
> Gesendet: Donnerstag, 19. Februar 2009 12:35
> An: nutch-user@lucene.apache.org
> Betreff: Re: How to index while fetcher works
>
> Hi,
>
>
> On Thu, Feb 19, 2009 at 13:28, Bartek <bartek...@o2.pl> wrote:
>   
>> Hello,
>>
>> I started to crawl huge amount of websites (dmoz with no limits in
>> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>>
>> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>>
>>
>> This fetching will not stop soon :) so I would like to convert already made
>> segments (updatedb, invertlinks, index) but there are parts missing in them:
>>
>> [r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
>> crawls/segments/20090216142840/
>>
>>     
>
>
> If you use -dir option then you pass segments directory not individual
> segments, e.g:
>
> bin/nutch invertlinks crawls/linkdb -dir crawls/segments
>
> which will read every directory under segments
>
> To pass individual directories skip -dir option:
>
> bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>   
>> LinkDb: adding segment:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>>
>> ...
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist:
>> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>>
>> etc.
>>
>> When manualy trying to bin/parse segments it says that they are parsed.
>>
>>
>> So my question is how to design whole proces of crawling large amount of
>>  websites without limiting them for specific domains (like in regular search
>> engine eg. google)?
>>
>> Should I make loops of small amount of links? Like -topN 1000 and then
>> updatedb,invertlinks, index ?
>>
>>
>> For now I can start crawling and any data will appear in weeks.
>>
>> I found that in 1.0 (so made already) you are introducing live indexing in
>> nutch. Are there any docs that I can use of ?
>>
>> Regards,
>> Bartosz Gadzimski
>>
>>
>>
>>
>>     
>
>
>
>   

Reply via email to