Re: mergesegs disk space

Doğacan Güney Tue, 21 Jul 2009 12:04:46 -0700

On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<[email protected]> wrote:
> Hi,
> thanks for your answers, I've configured compression:
>
> mapred.output.compress = true
> mapred.compress.map.output = true
> mapred.output.compression.type= BLOCK
>
> ( in xml format in hadoop-site.xml )
>
> and it works (and uses less disk space, no more out of disk space
> exception), but merging now takes a really long time. My next question
> is simple:
> Is segment merging necessary step (if I don't need all in one segment
> and do not have optional filtering) or is it ok to proceed with
> indexing ? I ask because many tutorials and most re-crawl scripts have
> this step.
>


Not really. But if you recrawl a lot, old versions of pages will stay
on your disk
taking unnecessary space.

To improve compression speed, take a look at:

http://code.google.com/p/hadoop-gpl-compression/

Lzo (de)compression is *very* fast.

> Tomislav
>
>
> On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote:
>> On Wed, Jul 15, 2009 at 20:45, MilleBii<[email protected]> wrote:
>> > Are you on a single node conf ?
>> > If yes I have the same problem, and some people have suggested earlier to
>> > use the hadoop pseudo-distributed config on a single server.
>> > Others have also suggested to use compress mode of hadoop.
>>
>> Yes, that's a good point. Playing around with these options may help:
>>
>> mapred.output.compress
>>
>> mapred.output.compression.type (BLOCK may help a lot here)
>> advices
>> mapred.compress.map.output
>>
>>
>> > But I have not been able to make it work on my PC because I get bogged down
>> > by some windows/hadoop compatibility issues.
>> > If you are on Linux you may be more lucky, interested by your results by 
>> > the
>> > way, so I know if when moving to Linux I get those problems solved.
>> >
>> >
>> > 2009/7/15 Doğacan Güney <[email protected]>
>> >
>> >> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<[email protected]> wrote:
>> >> > Hi,
>> >> > I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one
>> >> > machine contained in 10 segments, using:
>> >> >
>> >> > bin/nutch mergesegs crawl/merge_seg -dir crawl/segments
>> >> >
>> >> > ,but there is not enough space on 500G disk to complete this merge task
>> >> > (getting java.io.IOException: No space left on device in hadoop.log)
>> >> >
>> >> > Shouldn't 500G be enough disk space for this merge? Is this a bug? If
>> >> > this is not a bug, how much disk space is required for this merge?
>> >> >
>> >>
>> >> A lot :)
>> >>
>> >> Try deleting your hadoop temporary folders. If that doesn't help you
>> >> may try merging
>> >> segment parts one by one. For example, move your content/ directories
>> >> and try merging
>> >> again. If successful you can then merge contents later and move the
>> >> resulting content/ into
>> >> your merge_seg dir.
>> >>
>> >> > Tomislav
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Doğacan Güney
>> >>
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>>
>>
>
>



-- 
Doğacan Güney

Re: mergesegs disk space

Reply via email to