AW: How to index while fetcher works

Höchstötter Nadine Thu, 19 Feb 2009 04:21:37 -0800

Hi. This is my version of an incremental index: I have one working dir for all 
the new segments flying in and a routine every four hours to build a new index 
for a special webindex folder which is nearly up to date.
I merge segments in another folder with YYYYMMDDHH Pattern in my working 
segment dir. With this I can always recognize which segments have already been 
indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh 
webindex segment folder and also everything under $merge_dir (new index) to 
your index folder in webindex dir. This dir has same structure as your working 
crawl dir.
It is also good for backup reasons. Call the script below with a cron and add 
cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with 
this cron, too, as a backup.



index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
TIMEH=`date +%Y%m%d%H`
merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
# Update segments

for segment in `ls -d  /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep 
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
do
if [ -d $segment/_temporary ]; then 
echo "$segment is temporary"
else
echo "$segment" 
segments="$segments $segment"
fi
done
mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
/bin/nutch mergesegs $mergesegs_dir $segments

indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH

NEW=`ls -d  /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
echo "$NEW"
bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/

for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
do
allindexes="$allindexes $allindex"
done


bin/nutch merge $merge_dir $allindexes

cheers, Nadine.

-----Ursprüngliche Nachricht-----
Von: Doğacan Güney [mailto:doga...@gmail.com] 
Gesendet: Donnerstag, 19. Februar 2009 12:35
An: nutch-user@lucene.apache.org
Betreff: Re: How to index while fetcher works

Hi,


On Thu, Feb 19, 2009 at 13:28, Bartek <bartek...@o2.pl> wrote:
> Hello,
>
> I started to crawl huge amount of websites (dmoz with no limits in
> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>
> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>
>
> This fetching will not stop soon :) so I would like to convert already made
> segments (updatedb, invertlinks, index) but there are parts missing in them:
>
> [r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
> crawls/segments/20090216142840/
>


If you use -dir option then you pass segments directory not individual
segments, e.g:

bin/nutch invertlinks crawls/linkdb -dir crawls/segments

which will read every directory under segments

To pass individual directories skip -dir option:

bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
> LinkDb: adding segment:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>
> ...
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>
> etc.
>
> When manualy trying to bin/parse segments it says that they are parsed.
>
>
> So my question is how to design whole proces of crawling large amount of
>  websites without limiting them for specific domains (like in regular search
> engine eg. google)?
>
> Should I make loops of small amount of links? Like -topN 1000 and then
> updatedb,invertlinks, index ?
>
>
> For now I can start crawling and any data will appear in weeks.
>
> I found that in 1.0 (so made already) you are introducing live indexing in
> nutch. Are there any docs that I can use of ?
>
> Regards,
> Bartosz Gadzimski
>
>
>
>



-- 
Doğacan Güney

AW: How to index while fetcher works

Reply via email to