On Wed, Jul 29, 2009 at 13:11, reinhard schwab<[email protected]> wrote: > Doğacan Güney schrieb: >> On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<[email protected]> wrote: >> >>> Hi, >>> thanks for your answers, I've configured compression: >>> >>> mapred.output.compress = true >>> mapred.compress.map.output = true >>> mapred.output.compression.type= BLOCK >>> >>> ( in xml format in hadoop-site.xml ) >>> >>> and it works (and uses less disk space, no more out of disk space >>> exception), but merging now takes a really long time. My next question >>> is simple: >>> Is segment merging necessary step (if I don't need all in one segment >>> and do not have optional filtering) or is it ok to proceed with >>> indexing ? I ask because many tutorials and most re-crawl scripts have >>> this step. >>> >>> >> >> Not really. But if you recrawl a lot, old versions of pages will stay >> on your disk >> taking unnecessary space. >> >> To improve compression speed, take a look at: >> >> http://code.google.com/p/hadoop-gpl-compression/ >> >> Lzo (de)compression is *very* fast. >> > > i also experience that segement merge heavily requires resources such as > cpu and disc although > the document collection crawled so far is very small, ~ 25000. segments > contains data of 650 mb. > its really a showstopper for me. > it would be very helpful to have a faq entry or some documentation > about how to improve the performance of the segment merge task. >
You may be interested in: http://issues.apache.org/jira/browse/NUTCH-650 With hbase integration, we completely do away with many stuff like segment merging. I intend to commit initial hbase code to a nutch branch this week (and write a wiki guide about it). Many features are missing but code should be stable enough to test. >> >>> Tomislav >>> >>> >>> On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote: >>> >>>> On Wed, Jul 15, 2009 at 20:45, MilleBii<[email protected]> wrote: >>>> >>>>> Are you on a single node conf ? >>>>> If yes I have the same problem, and some people have suggested earlier to >>>>> use the hadoop pseudo-distributed config on a single server. >>>>> Others have also suggested to use compress mode of hadoop. >>>>> >>>> Yes, that's a good point. Playing around with these options may help: >>>> >>>> mapred.output.compress >>>> >>>> mapred.output.compression.type (BLOCK may help a lot here) >>>> advices >>>> mapred.compress.map.output >>>> >>>> >>>> >>>>> But I have not been able to make it work on my PC because I get bogged >>>>> down >>>>> by some windows/hadoop compatibility issues. >>>>> If you are on Linux you may be more lucky, interested by your results by >>>>> the >>>>> way, so I know if when moving to Linux I get those problems solved. >>>>> >>>>> >>>>> 2009/7/15 Doğacan Güney <[email protected]> >>>>> >>>>> >>>>>> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one >>>>>>> machine contained in 10 segments, using: >>>>>>> >>>>>>> bin/nutch mergesegs crawl/merge_seg -dir crawl/segments >>>>>>> >>>>>>> ,but there is not enough space on 500G disk to complete this merge task >>>>>>> (getting java.io.IOException: No space left on device in hadoop.log) >>>>>>> >>>>>>> Shouldn't 500G be enough disk space for this merge? Is this a bug? If >>>>>>> this is not a bug, how much disk space is required for this merge? >>>>>>> >>>>>>> >>>>>> A lot :) >>>>>> >>>>>> Try deleting your hadoop temporary folders. If that doesn't help you >>>>>> may try merging >>>>>> segment parts one by one. For example, move your content/ directories >>>>>> and try merging >>>>>> again. If successful you can then merge contents later and move the >>>>>> resulting content/ into >>>>>> your merge_seg dir. >>>>>> >>>>>> >>>>>>> Tomislav >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Doğacan Güney >>>>>> >>>>>> >>>>> >>>>> -- >>>>> -MilleBii- >>>>> >>>>> >>>> >>>> >>> >> >> >> >> > > -- Doğacan Güney
