Re: Nutch scalability tests

h b Wed, 03 Jul 2013 08:44:33 -0700

Spoke too soon, the fetch completed in 21 min.



On Wed, Jul 3, 2013 at 8:32 AM, h b <hb6...@gmail.com> wrote:

> oh and yes, generate.max.count is set to 5000
>
>
> On Wed, Jul 3, 2013 at 8:29 AM, h b <hb6...@gmail.com> wrote:
>
>> I dropped my webpage database, restarted with 5 seed urls. First fetch
>> completed in a few seconds. The second run, still shows 1 reduce running,
>> although it shows as 100% complete, so my thought is it is writing out to
>> the disk, though it has been about 30+ minutes.
>> Again, I had 80 reducers, when I look at the log of these reducers in the
>> hadoop jobtracker, I see
>>
>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs 
>> in 0 queues
>>
>> in all of them, which leads me to think that the completed 79 reducers 
>> actually fetched nothing, which might explain why this 1 stuck reducer is 
>> working so hard.
>>
>>
>> This may be expected, since I am crawling a single domain. This one reducers 
>> log on the jobtracker however, is empty. Don't know what to make of that.
>>
>>
>>
>>
>>
>> On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> On Tue, Jul 2, 2013 at 3:53 PM, h b <hb6...@gmail.com> wrote:
>>>
>>> > So, I tried this with the generate.max.count property set to 5000,
>>> rebuild
>>> > ant; ant jar; ant job and reran fetch.
>>> > It still appears the same, first 79 reducers zip through and the last
>>> one
>>> > is crawling, literally...
>>> >
>>>
>>> Sorry I should have been more explicit. This property does not directly
>>> affect fetching. It is used when GENERATING fetch lists. Meaning that it
>>> needs to be present and acknowledged at the generate phase... before
>>> fetching is executed.
>>> Besides this, is there any progress being made at all on the last reduce?
>>> if you look at your CPU (and heap) for the box this is running on, it is
>>> usual to notice high levels for both of these respectively. Maybe this
>>> output writer is just taking a good while to write data down to HDFS...
>>> assuming you are using 1.x.
>>>
>>>
>>> >
>>> > As for the logs, I mentioned on one of my earlier threads that when I
>>> run
>>> > from the deploy directory, I am not getting any logs generated.
>>> > I looked for the logs directory under local as well as under deploy,
>>> and
>>> > just to make sure, also in the grid. I do not see the logs directory.
>>> So I
>>> > created it manually under deploy before starting fetch, and still
>>> there is
>>> > nothing in this directory,
>>> >
>>> >
>>> OK so when you run Nutch as a deployed job in your logs are present
>>> within
>>> $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g.
>>> you will be able to see the reduce tasks for the fetch job and you will
>>> also be able to see varying snippets or all of the log here.
>>>
>>
>>
>

Re: Nutch scalability tests

Reply via email to