Hi John,
I have no idea about that neither.
Justin

On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
[email protected]> wrote:

> Justin,
>
> Thanks for the response.
>
> I was having a similar issue, i was trying to merge the segments for crawls
> during the month of may probably around 13-15GB,  so after everything was
> running it had used tmp space of around 900 GB doesn't seem very efficient.
>
> I will try this out and see if it changes anything.
>
> Do you know if there is any risk in using the following:
> <property>
>   <name>mapred.min.split.size</name>
>   <value>671088640</value>
> </property>
>
> as suggested in the article?
>
> -John
>
> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
>
>  Hi John,
>>
>> I had the same issue before but never found a solution.
>> Here is a workaround mentioned by someone in this mailing list, you may
>> have
>> a try:
>>
>> Seemingly abnormal temp space use by segment merger
>>
>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>>
>> Regards,
>> Justin
>>
>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
>> [email protected]
>>
>>> wrote:
>>>
>>
>>  Ok.
>>>
>>> So a update to this item.
>>>
>>> I did start running nutch with hadoop, I am trying a single node config
>>> just to test it out.
>>>
>>> It took forever to get all of the files in the DFS it was just over 80GB
>>> but it is in there.  So I started the SegmentMerge job, and it is working
>>> flawlessly, still a little slow though.
>>>
>>> Also looking at the stats for the CPU they sometimes go over 20% by not
>>> by
>>> much and not often, the Disk is very lightly taxed, peak was about 20
>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
>>> there.
>>>
>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I
>>> restarted all it is still only using 2 and 1.  Any ideas?  I made that
>>> change in the hadoop-site.xml file BTW.
>>>
>>> -John
>>>
>>>
>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
>>>
>>> John Martyniak wrote:
>>>
>>>>
>>>>  Andrzej,
>>>>> I am a little embarassed asking.  But is there is a setup guide for
>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as setting
>>>>> up for
>>>>> Nutch 0.17 (Which I think is the existing guide out there).
>>>>>
>>>>>
>>>> Basically, yes - but this guide is primarily about the set up of Hadoop
>>>> cluster using the Hadoop pieces distributed with Nutch. As such these
>>>> instructions are already slightly outdated. So it's best simply to
>>>> install a
>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and
>>>> then
>>>> build nutch*.job file separately.
>>>>
>>>> Also I have Hadoop already running for some other applications, not
>>>>
>>>>> associated with Nutch, can I use the same install?  I think that it is
>>>>> the
>>>>> same version that Nutch 1.0 uses.  Or is it just easier to set it up
>>>>> using
>>>>> the nutch config.
>>>>>
>>>>>
>>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of
>>>> the
>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly
>>>> recommend this way, instead of the usual "dirty" way of setting up Nutch
>>>> by
>>>> replicating the local build dir ;)
>>>>
>>>> Just specify the nutch*.job file like this:
>>>>
>>>>      bin/hadoop jar nutch*.job <className> <args ..>
>>>>
>>>> where className and args is one of Nutch command-line tools. You can
>>>> also
>>>> modify slightly the bin/nutch script, so that you don't have to specify
>>>> fully-qualified class names.
>>>>
>>>> --
>>>> Best regards,
>>>> Andrzej Bialecki     <><
>>>> ___. ___ ___ ___ _ _   __________________________________
>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>
>>>>
>>>>
>>>
> John Martyniak
> President
> Before Dawn Solutions, Inc.
> 9457 S. University Blvd #266
> Highlands Ranch, CO 80126
> o: 877-499-1562 x707
> f: 877-499-1562
> c: 303-522-1756
> e: [email protected]
>
>

Reply via email to