Doğacan Güney schrieb: > On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<[email protected]> wrote: > >> Hi, >> thanks for your answers, I've configured compression: >> >> mapred.output.compress = true >> mapred.compress.map.output = true >> mapred.output.compression.type= BLOCK >> >> ( in xml format in hadoop-site.xml ) >> >> and it works (and uses less disk space, no more out of disk space >> exception), but merging now takes a really long time. My next question >> is simple: >> Is segment merging necessary step (if I don't need all in one segment >> and do not have optional filtering) or is it ok to proceed with >> indexing ? I ask because many tutorials and most re-crawl scripts have >> this step. >> >> > > Not really. But if you recrawl a lot, old versions of pages will stay > on your disk > taking unnecessary space. > > To improve compression speed, take a look at: > > http://code.google.com/p/hadoop-gpl-compression/ > > Lzo (de)compression is *very* fast. >
i also experience that segement merge heavily requires resources such as cpu and disc although the document collection crawled so far is very small, ~ 25000. segments contains data of 650 mb. its really a showstopper for me. it would be very helpful to have a faq entry or some documentation about how to improve the performance of the segment merge task. > >> Tomislav >> >> >> On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote: >> >>> On Wed, Jul 15, 2009 at 20:45, MilleBii<[email protected]> wrote: >>> >>>> Are you on a single node conf ? >>>> If yes I have the same problem, and some people have suggested earlier to >>>> use the hadoop pseudo-distributed config on a single server. >>>> Others have also suggested to use compress mode of hadoop. >>>> >>> Yes, that's a good point. Playing around with these options may help: >>> >>> mapred.output.compress >>> >>> mapred.output.compression.type (BLOCK may help a lot here) >>> advices >>> mapred.compress.map.output >>> >>> >>> >>>> But I have not been able to make it work on my PC because I get bogged down >>>> by some windows/hadoop compatibility issues. >>>> If you are on Linux you may be more lucky, interested by your results by >>>> the >>>> way, so I know if when moving to Linux I get those problems solved. >>>> >>>> >>>> 2009/7/15 Doğacan Güney <[email protected]> >>>> >>>> >>>>> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one >>>>>> machine contained in 10 segments, using: >>>>>> >>>>>> bin/nutch mergesegs crawl/merge_seg -dir crawl/segments >>>>>> >>>>>> ,but there is not enough space on 500G disk to complete this merge task >>>>>> (getting java.io.IOException: No space left on device in hadoop.log) >>>>>> >>>>>> Shouldn't 500G be enough disk space for this merge? Is this a bug? If >>>>>> this is not a bug, how much disk space is required for this merge? >>>>>> >>>>>> >>>>> A lot :) >>>>> >>>>> Try deleting your hadoop temporary folders. If that doesn't help you >>>>> may try merging >>>>> segment parts one by one. For example, move your content/ directories >>>>> and try merging >>>>> again. If successful you can then merge contents later and move the >>>>> resulting content/ into >>>>> your merge_seg dir. >>>>> >>>>> >>>>>> Tomislav >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Doğacan Güney >>>>> >>>>> >>>> >>>> -- >>>> -MilleBii- >>>> >>>> >>> >>> >> > > > >
