Hello - i am not a Nutch 2.x user but a seed file that small should not take so
long. There is nothing wrong with a seed file that large, even if it were a few
million. Are your mappers/reducers slow? Is your HDFS slow? Or is it Accumulo?
Markus
-----Original message-----
> From:Luis Magaña <l...@euphorica.com>
> Sent: Wednesday 9th March 2016 22:07
> To: user@nutch.apache.org
> Subject: Large seed Inject Slow to Accumulo
>
> Hello,
>
> I've setup a small sample hadoop cluster of 6 servers, hdfs, zookeeper,
> solr and accumulo.
>
> I am running nutch on top of the hadoop cluster and injecting 10,000
> URLs in the seed.txt file.
>
> Everything works as it should, nothing breaks, everything indexes, etc.,
> and the crawl jobs finishes OK. However, the inject stage of those
> 10,000 URLs takes up to 50 minutes.
>
> I wonder if that is a normal time for an inject or if I should be
> looking at a possible problem (maybe the gora accumulo module?) or if I
> am simply being naive and my seed.txt should not be so large to begin with.
>
> A bit more information about my setup:
>
> Hadoop 2.7.2
> Accumulo 1.5.1
> Solr 4.10.3
>
> Currently accumulo has about 500 tables with some 200 Million entries
> (not sure if that affects), Accumulo logs show no major errors or
> warnings or java exceptions either, neither do the mapreduce logs in hadoop.
>
> Thank you very much for your help and your excellent crawler.
>
>
> --
> Luis Magaña
> www.euphorica.com
>