On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<[email protected]> wrote: > Hi, > thanks for your answers, I've configured compression: > > mapred.output.compress = true > mapred.compress.map.output = true > mapred.output.compression.type= BLOCK > > ( in xml format in hadoop-site.xml ) > > and it works (and uses less disk space, no more out of disk space > exception), but merging now takes a really long time. My next question > is simple: > Is segment merging necessary step (if I don't need all in one segment > and do not have optional filtering) or is it ok to proceed with > indexing ? I ask because many tutorials and most re-crawl scripts have > this step. >
Not really. But if you recrawl a lot, old versions of pages will stay on your disk taking unnecessary space. To improve compression speed, take a look at: http://code.google.com/p/hadoop-gpl-compression/ Lzo (de)compression is *very* fast. > Tomislav > > > On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote: >> On Wed, Jul 15, 2009 at 20:45, MilleBii<[email protected]> wrote: >> > Are you on a single node conf ? >> > If yes I have the same problem, and some people have suggested earlier to >> > use the hadoop pseudo-distributed config on a single server. >> > Others have also suggested to use compress mode of hadoop. >> >> Yes, that's a good point. Playing around with these options may help: >> >> mapred.output.compress >> >> mapred.output.compression.type (BLOCK may help a lot here) >> advices >> mapred.compress.map.output >> >> >> > But I have not been able to make it work on my PC because I get bogged down >> > by some windows/hadoop compatibility issues. >> > If you are on Linux you may be more lucky, interested by your results by >> > the >> > way, so I know if when moving to Linux I get those problems solved. >> > >> > >> > 2009/7/15 Doğacan Güney <[email protected]> >> > >> >> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<[email protected]> wrote: >> >> > Hi, >> >> > I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one >> >> > machine contained in 10 segments, using: >> >> > >> >> > bin/nutch mergesegs crawl/merge_seg -dir crawl/segments >> >> > >> >> > ,but there is not enough space on 500G disk to complete this merge task >> >> > (getting java.io.IOException: No space left on device in hadoop.log) >> >> > >> >> > Shouldn't 500G be enough disk space for this merge? Is this a bug? If >> >> > this is not a bug, how much disk space is required for this merge? >> >> > >> >> >> >> A lot :) >> >> >> >> Try deleting your hadoop temporary folders. If that doesn't help you >> >> may try merging >> >> segment parts one by one. For example, move your content/ directories >> >> and try merging >> >> again. If successful you can then merge contents later and move the >> >> resulting content/ into >> >> your merge_seg dir. >> >> >> >> > Tomislav >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Doğacan Güney >> >> >> > >> > >> > >> > -- >> > -MilleBii- >> > >> >> >> > > -- Doğacan Güney
