Re: mergesegs disk space

reinhard schwab Wed, 29 Jul 2009 03:07:08 -0700

Doğacan Güney schrieb:
> On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<[email protected]> wrote:
>   
>> Hi,
>> thanks for your answers, I've configured compression:
>>
>> mapred.output.compress = true
>> mapred.compress.map.output = true
>> mapred.output.compression.type= BLOCK
>>
>> ( in xml format in hadoop-site.xml )
>>
>> and it works (and uses less disk space, no more out of disk space
>> exception), but merging now takes a really long time. My next question
>> is simple:
>> Is segment merging necessary step (if I don't need all in one segment
>> and do not have optional filtering) or is it ok to proceed with
>> indexing ? I ask because many tutorials and most re-crawl scripts have
>> this step.
>>
>>     
>
> Not really. But if you recrawl a lot, old versions of pages will stay
> on your disk
> taking unnecessary space.
>
> To improve compression speed, take a look at:
>
> http://code.google.com/p/hadoop-gpl-compression/
>
> Lzo (de)compression is *very* fast.
>


i also experience that segement merge heavily requires resources such as
cpu and disc although
the document collection crawled so far is very small, ~ 25000. segments
contains data of 650 mb.
its really a showstopper for me.
it would be very helpful to have a faq entry or some documentation 
about how to improve the performance of the segment merge task.

>   
>> Tomislav
>>
>>
>> On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote:
>>     
>>> On Wed, Jul 15, 2009 at 20:45, MilleBii<[email protected]> wrote:
>>>       
>>>> Are you on a single node conf ?
>>>> If yes I have the same problem, and some people have suggested earlier to
>>>> use the hadoop pseudo-distributed config on a single server.
>>>> Others have also suggested to use compress mode of hadoop.
>>>>         
>>> Yes, that's a good point. Playing around with these options may help:
>>>
>>> mapred.output.compress
>>>
>>> mapred.output.compression.type (BLOCK may help a lot here)
>>> advices
>>> mapred.compress.map.output
>>>
>>>
>>>       
>>>> But I have not been able to make it work on my PC because I get bogged down
>>>> by some windows/hadoop compatibility issues.
>>>> If you are on Linux you may be more lucky, interested by your results by 
>>>> the
>>>> way, so I know if when moving to Linux I get those problems solved.
>>>>
>>>>
>>>> 2009/7/15 Doğacan Güney <[email protected]>
>>>>
>>>>         
>>>>> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<[email protected]> wrote:
>>>>>           
>>>>>> Hi,
>>>>>> I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one
>>>>>> machine contained in 10 segments, using:
>>>>>>
>>>>>> bin/nutch mergesegs crawl/merge_seg -dir crawl/segments
>>>>>>
>>>>>> ,but there is not enough space on 500G disk to complete this merge task
>>>>>> (getting java.io.IOException: No space left on device in hadoop.log)
>>>>>>
>>>>>> Shouldn't 500G be enough disk space for this merge? Is this a bug? If
>>>>>> this is not a bug, how much disk space is required for this merge?
>>>>>>
>>>>>>             
>>>>> A lot :)
>>>>>
>>>>> Try deleting your hadoop temporary folders. If that doesn't help you
>>>>> may try merging
>>>>> segment parts one by one. For example, move your content/ directories
>>>>> and try merging
>>>>> again. If successful you can then merge contents later and move the
>>>>> resulting content/ into
>>>>> your merge_seg dir.
>>>>>
>>>>>           
>>>>>> Tomislav
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>> --
>>>>> Doğacan Güney
>>>>>
>>>>>           
>>>>
>>>> --
>>>> -MilleBii-
>>>>
>>>>         
>>>
>>>       
>>     
>
>
>
>

Re: mergesegs disk space

Reply via email to