Thanks for your reply, that was what I suspected, I'll find out exactly
which piece is slow and post back my findings.

My primary suspect is Accumulo as the fetch and parse and index stages
they all go pretty fast, it is only when injecting or updating tables
that the timing gets so long.

Thanks.

On 03/10/2016 04:22 AM, Markus Jelsma wrote:
> Hello - i am not a Nutch 2.x user but a seed file that small should not take 
> so long. There is nothing wrong with a seed file that large, even if it were 
> a few million. Are your mappers/reducers slow? Is your HDFS slow? Or is it 
> Accumulo? 
> Markus
>  
> -----Original message-----
>> From:Luis Magaña <l...@euphorica.com>
>> Sent: Wednesday 9th March 2016 22:07
>> To: user@nutch.apache.org
>> Subject: Large seed Inject Slow to Accumulo
>>
>> Hello,
>>
>> I've setup a small sample hadoop cluster of 6 servers, hdfs, zookeeper,
>> solr and accumulo.
>>
>> I am running nutch on top of the hadoop cluster and injecting 10,000
>> URLs in the seed.txt file.
>>
>> Everything works as it should, nothing breaks, everything indexes, etc.,
>> and the crawl jobs finishes OK. However, the inject stage of those
>> 10,000 URLs takes up to 50 minutes.
>>
>> I wonder if that is a normal time for an inject or if I should be
>> looking at a possible problem (maybe the gora accumulo module?) or if I
>> am simply being naive and my seed.txt should not be so large to begin with.
>>
>> A bit more information about my setup:
>>
>> Hadoop 2.7.2
>> Accumulo 1.5.1
>> Solr 4.10.3
>>
>> Currently accumulo has about 500 tables with some 200 Million entries
>> (not sure if that affects), Accumulo logs show no major errors or
>> warnings or java exceptions either, neither do the mapreduce logs in hadoop.
>>
>> Thank you very much for your help and your excellent crawler.
>>
>>
>> -- 
>> Luis Magaña
>> www.euphorica.com
>>

-- 
Luis Magaña
www.euphorica.com

Reply via email to