Actually I have : + crawl/segments (where segment data is stored) + crawl/indexed-segments (where I store the indexed segments)
Then I merge all indexes in crawl/indexed-segments into the final crawl/index. I also wonder why merging segments, I guess the question is about when to ditch segments that are old and for which you have already recrawled the data (standard 30 days), but I haven't gone so far myself. I'm considering removing old segments by scripting. Not clear why you want to change hadoop-site.xml, but it is only required if you are going to implement pseudo-distributed mode (I assume you have one server). I tried & spend some time on Windows and could not get it to work, since I don't need it for now I gave up. 2009/8/27 <[email protected]> > > > > > As I understood, you suggest to put segment files under segment folder and > merge indexes. In that case my question is that why we need to merge > segments, if we can go without merging them. In the mailing lists the only > thing I found was changing settings in hadoop-site.xml, but it is empty. > Could please provide some links. > > > Thanks. > Alex. > ? > > > > > -----Original Message----- > From: MilleBii <[email protected]> > To: [email protected] > Sent: Thu, Aug 27, 2009 2:39 am > Subject: Re: content of hadoop-site.xml > > > > > > > > > > > Not strange, look at the mailing list, their has been lot's of discussions > on this issue. > You may want to use the compress option. > And/or start using hadoop in pseudo-distributed, so that that reduce starts > consumming the map data, because in 'local' mode you get the map first & > the > reduce after so their can be a lot of data in the tmp directory. > > segment merge uses a LOT of space, so much that I don't use it anymore. I > only merge my indexes which are much smaller in my case. > > > > 2009/8/27 Fuad Efendi <[email protected]> > > > Unfortunately, you can't manage disk space usage via configuration > > parameters... it is not easy... just try to keep your eyes on > > services/processes/ram/swap (disk swapping happens if RAM is not enough) > > during merge, even browse file/folders and click 'refresh' button to get > an > > idea... it is strange that 50G was not enough to merge 2G, may be problem > > is > > somewhere else (OS X specifics for instance)... try to play with Nutch > with > > smaller segment sizes and study it's behaviour on your OS... > > -Fuad > > > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] > > Sent: August-26-09 6:41 PM > > To: [email protected] > > Subject: Re: content of hadoop-site.xml > > > > > > > > > > > > Thanks for the response. > > > > How can I check disk swap? > > 50GB was before running merge command. When it crashed available space > was > > 1 > > kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop > during > > merge and after that OS X does not start. I plan to run merge again and > > need > > to reduce disk space usage by merge. I have read on the net that for > > reducing space we must use hadoop-site.xml. But, there is no > > hadoop-default.xml file and hadoop-site.xml file is empty. > > > > > > Thanks. > > Alex. > > > > > > > > > > -----Original Message----- > > From: Fuad Efendi <[email protected]> > > To: [email protected] > > Sent: Wed, Aug 26, 2009 3:28 pm > > Subject: RE: content of hadoop-site.xml > > > > > > > > > > > > > > > > > > > > > > You can override default settings (nutch-default.xml) in nutch-site.xml; > > but > > it won't help with spacing; empty file is Ok. > > > > "merge" may generate temporary files, but 50Gb against 2Gb looks > extremely > > strange; try to empty recycle bin for instance... check disk swap... OS > may > > report 50G available but you may be out of space... for instance heavy > disk > > swap during merge due to low RAM... > > > > > > > > -Fuad > > http://www.linkedin.com/in/liferay > > http://www.tokenizer.org > > > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] > > Sent: August-26-09 5:33 PM > > To: [email protected] > > Subject: content of hadoop-site.xml > > > > Hello, > > > > ?I have run merge script? to merge two crawl dirs, one 1.6G another > 120MB. > > But my MacPro with 50G free space did not start, after merge crashed with > > no > > space error. I have been told that OSX got corrupted. > > I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. > Can > > anyone let me know what must be put inside this file in order for merge > not > > to take too much space. > > > > Thanks in advance. > > Alex. > > > > > > > > > > > > > > > > > > > > > > > -- > -MilleBii- > > > > > > -- -MilleBii-
