Actually I have :
+ crawl/segments (where segment data is stored)
+ crawl/indexed-segments (where I store the indexed segments)

Then I merge  all indexes in crawl/indexed-segments into the final
crawl/index.

I also wonder why merging segments, I guess the question is about when to
ditch segments that are old and for which you have already recrawled the
data (standard 30 days), but I haven't gone so far myself. I'm considering
removing old segments by scripting.

Not clear why you want to change hadoop-site.xml, but it is only required if
you are going to implement pseudo-distributed mode (I assume you have one
server). I tried & spend some time on Windows and could not get it to work,
since I don't need it for now I gave up.

2009/8/27 <[email protected]>

>
>
>
>
>  As I understood, you suggest to put segment files under segment folder and
> merge indexes. In that case my question is that why we need to merge
> segments, if we can go without merging them. In the mailing lists the only
> thing I found was changing settings in hadoop-site.xml, but it is empty.
> Could please provide some links.
>
>
> Thanks.
> Alex.
> ?
>
>
>
>
> -----Original Message-----
> From: MilleBii <[email protected]>
> To: [email protected]
> Sent: Thu, Aug 27, 2009 2:39 am
> Subject: Re: content of hadoop-site.xml
>
>
>
>
>
>
>
>
>
>
> Not strange, look at the mailing list, their has been lot's of discussions
> on this issue.
> You may want to use the compress option.
> And/or start using hadoop in pseudo-distributed, so that that reduce starts
> consumming the map data, because in 'local' mode you get the map first &
> the
> reduce after so their can be a lot of data in the tmp directory.
>
> segment merge uses a LOT of space, so much that I don't use it anymore. I
> only merge my indexes which are much smaller in my case.
>
>
>
> 2009/8/27 Fuad Efendi <[email protected]>
>
> > Unfortunately, you can't manage disk space usage via configuration
> > parameters... it is not easy... just try to keep your eyes on
> > services/processes/ram/swap (disk swapping happens if RAM is not enough)
> > during merge, even browse file/folders and click 'refresh' button to get
> an
> > idea... it is strange that 50G was not enough to merge 2G, may be problem
> > is
> > somewhere else (OS X specifics for instance)... try to play with Nutch
> with
> > smaller segment sizes and study it's behaviour on your OS...
> > -Fuad
> >
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: August-26-09 6:41 PM
> > To: [email protected]
> > Subject: Re: content of hadoop-site.xml
> >
> >
> >
> >
> >
> >  Thanks for the response.
> >
> > How can I check disk swap?
> > 50GB was before running merge command. When it crashed available space
> was
> > 1
> > kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop
> during
> > merge and after that OS X does not start. I plan to run merge again and
> > need
> > to reduce disk space usage by merge. I have read on the net that for
> > reducing space we must use hadoop-site.xml. But, there is no
> > hadoop-default.xml file and hadoop-site.xml file is empty.
> >
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> > -----Original Message-----
> > From: Fuad Efendi <[email protected]>
> > To: [email protected]
> > Sent: Wed, Aug 26, 2009 3:28 pm
> > Subject: RE: content of hadoop-site.xml
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > You can override default settings (nutch-default.xml) in nutch-site.xml;
> > but
> > it won't help with spacing; empty file is Ok.
> >
> > "merge" may generate temporary files, but 50Gb against 2Gb looks
> extremely
> > strange; try to empty recycle bin for instance... check disk swap... OS
> may
> > report 50G available but you may be out of space... for instance heavy
> disk
> > swap during merge due to low RAM...
> >
> >
> >
> > -Fuad
> > http://www.linkedin.com/in/liferay
> > http://www.tokenizer.org
> >
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: August-26-09 5:33 PM
> > To: [email protected]
> > Subject: content of hadoop-site.xml
> >
> > Hello,
> >
> > ?I have run merge script? to merge two crawl dirs, one 1.6G another
> 120MB.
> > But my MacPro with 50G free space did not start, after merge crashed with
> > no
> > space error. I have been told that OSX got corrupted.
> > I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty.
> Can
> > anyone let me know what must be put inside this file in order for merge
> not
> > to take too much space.
> >
> > Thanks in advance.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> --
> -MilleBii-
>
>
>
>
>
>


-- 
-MilleBii-

Reply via email to