Re: mergesegs disk space

Doğacan Güney Wed, 29 Jul 2009 03:29:26 -0700

On Wed, Jul 29, 2009 at 13:11, reinhard schwab<[email protected]> wrote:
> Doğacan Güney schrieb:
>> On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<[email protected]> wrote:
>>
>>> Hi,
>>> thanks for your answers, I've configured compression:
>>>
>>> mapred.output.compress = true
>>> mapred.compress.map.output = true
>>> mapred.output.compression.type= BLOCK
>>>
>>> ( in xml format in hadoop-site.xml )
>>>
>>> and it works (and uses less disk space, no more out of disk space
>>> exception), but merging now takes a really long time. My next question
>>> is simple:
>>> Is segment merging necessary step (if I don't need all in one segment
>>> and do not have optional filtering) or is it ok to proceed with
>>> indexing ? I ask because many tutorials and most re-crawl scripts have
>>> this step.
>>>
>>>
>>
>> Not really. But if you recrawl a lot, old versions of pages will stay
>> on your disk
>> taking unnecessary space.
>>
>> To improve compression speed, take a look at:
>>
>> http://code.google.com/p/hadoop-gpl-compression/
>>
>> Lzo (de)compression is *very* fast.
>>
>
> i also experience that segement merge heavily requires resources such as
> cpu and disc although
> the document collection crawled so far is very small, ~ 25000. segments
> contains data of 650 mb.
> its really a showstopper for me.
> it would be very helpful to have a faq entry or some documentation
> about how to improve the performance of the segment merge task.
>


You may be interested in:

http://issues.apache.org/jira/browse/NUTCH-650

With hbase integration, we completely do away with many stuff like
segment merging.

I intend to commit initial hbase code to a nutch branch this week (and
write a wiki guide
about it). Many features are missing but code should be stable enough to test.

>>
>>> Tomislav
>>>
>>>
>>> On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote:
>>>
>>>> On Wed, Jul 15, 2009 at 20:45, MilleBii<[email protected]> wrote:
>>>>
>>>>> Are you on a single node conf ?
>>>>> If yes I have the same problem, and some people have suggested earlier to
>>>>> use the hadoop pseudo-distributed config on a single server.
>>>>> Others have also suggested to use compress mode of hadoop.
>>>>>
>>>> Yes, that's a good point. Playing around with these options may help:
>>>>
>>>> mapred.output.compress
>>>>
>>>> mapred.output.compression.type (BLOCK may help a lot here)
>>>> advices
>>>> mapred.compress.map.output
>>>>
>>>>
>>>>
>>>>> But I have not been able to make it work on my PC because I get bogged 
>>>>> down
>>>>> by some windows/hadoop compatibility issues.
>>>>> If you are on Linux you may be more lucky, interested by your results by 
>>>>> the
>>>>> way, so I know if when moving to Linux I get those problems solved.
>>>>>
>>>>>
>>>>> 2009/7/15 Doğacan Güney <[email protected]>
>>>>>
>>>>>
>>>>>> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<[email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one
>>>>>>> machine contained in 10 segments, using:
>>>>>>>
>>>>>>> bin/nutch mergesegs crawl/merge_seg -dir crawl/segments
>>>>>>>
>>>>>>> ,but there is not enough space on 500G disk to complete this merge task
>>>>>>> (getting java.io.IOException: No space left on device in hadoop.log)
>>>>>>>
>>>>>>> Shouldn't 500G be enough disk space for this merge? Is this a bug? If
>>>>>>> this is not a bug, how much disk space is required for this merge?
>>>>>>>
>>>>>>>
>>>>>> A lot :)
>>>>>>
>>>>>> Try deleting your hadoop temporary folders. If that doesn't help you
>>>>>> may try merging
>>>>>> segment parts one by one. For example, move your content/ directories
>>>>>> and try merging
>>>>>> again. If successful you can then merge contents later and move the
>>>>>> resulting content/ into
>>>>>> your merge_seg dir.
>>>>>>
>>>>>>
>>>>>>> Tomislav
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Doğacan Güney
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> -MilleBii-
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>



-- 
Doğacan Güney

Re: mergesegs disk space

Reply via email to