Re: Nutch local: large crawls, extremely slow, small solr index

Craig Leinoff Thu, 10 Jul 2014 12:00:39 -0700

Hi Julien,

Thank you so much. I will reply inline.




From:   Julien Nioche <[email protected]>
To:     "[email protected]" <[email protected]>
Date:   07/10/2014 12:17 PM
Subject:        Re: Nutch local: large crawls, extremely slow, small solr 
index



Hi Craig

See comments below, will also comment on your other mail separately :

On 9 July 2014 20:58, Craig Leinoff <[email protected]>> wrote:

>> Hello,
>
>> I have a handful of questions about Nutch, and it's unclear whether 
it's
>> considered "impolite" to combine them all into one. As a result, I'm 
going
>> to start with the most pressing issue first and, I guess, maybe send 
other
>> ones in separate messages?
>

> don't worry - it's not impolite at all.

OK! Thanks!



>
>> Our primary issue is that one (at least one, that is) of our crawls is
>> going very slowly. We have a single system running Nutch in local mode, 
no
>> external Hadoop (at the moment), to a Solr cluster hosted on other 
servers
>> internal to our network.


> I'd really recommend that you run it in pseudo-distributed mode instead 
of
> local. Presumably your server has multiple cores and p-d mode would 
allow
> you to leverage that. You'd also be able to use the MapReduce Web GUI to
> monitor your crawl, check the counters,  and easier access to the logs 
per
> job. Local mode should be used for testing / debugging but not in
> production.

OK. Until this time, I had never heard of "pseudo-distributed" mode, so 
that's really great; thank you for recommending it. The implication, 
seemingly, is that we create a single-node Hadoop cluster on the same 
machine as Nutch, right? I did find this URL: 
https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial but the "as 
explained here" link, which is supposed to link to some sort of "stable" 
documentation for Hadoop is a 404, I guess. :( Presumably this is a 
reasonable analogue: 
http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html


>> We're restricted by security (and then financial) reasons for our 
setup.
>> This server is crawling a number of websites, only one of which should 
be
>> at all "large" and over 100,000 pages. Nutch, its temporary directory, 
its
>> crawls, and all else are running on an NFS mount.
>
>> The crawl we're having an issue with is for a collection that should
>> contain about 30,000 documents. We have Nutch scheduled to crawl it 
once
>> every day. It is set to go for 10 rounds, topN of 50,000. Without a 
doubt,
>> it does all 10 rounds and, I believe, 50,000 records. For some reason, 
the
>> whole process takes a little under 48 hours, and when it has imported
>> everything into Solr, there are indeed, as I said, about 30,000 
documents.
>
>> =====
>
>> A few things of note:
>
>> From the log output of a recent job, I've extracted all the lines 
related
>> to how long each process took. The fetcher usually takes a solid hour, 
but
>> the ParseSegment job always seems to be significantly more. Have a 
look:
>
>> --
>> Injector: finished at 2014-07-05 14:56:52, elapsed: 00:01:57
>> Generator: finished at 2014-07-05 15:01:16, elapsed: 00:04:20
>> Fetcher: finished at 2014-07-05 16:09:10, elapsed: 01:07:52
>> ParseSegment: finished at 2014-07-05 18:25:29, elapsed: 02:16:16
>> CrawlDb update: finished at 2014-07-05 18:30:27, elapsed: 00:04:55
>> LinkDb: finished at 2014-07-05 18:31:26, elapsed: 00:00:56
>> Indexer: finished at 2014-07-05 18:58:34, elapsed: 00:24:47
>> Generator: finished at 2014-07-05 20:01:59, elapsed: 00:04:09
>> Fetcher: finished at 2014-07-05 21:14:24, elapsed: 01:12:22
>> ParseSegment: finished at 2014-07-05 23:30:31, elapsed: 02:16:05
>> CrawlDb update: finished at 2014-07-05 23:35:51, elapsed: 00:05:16
>> LinkDb: finished at 2014-07-05 23:36:57, elapsed: 00:01:03
>> Indexer: finished at 2014-07-06 00:05:58, elapsed: 00:25:46
>> Generator: finished at 2014-07-06 01:10:40, elapsed: 00:04:00
>> Fetcher: finished at 2014-07-06 02:19:49, elapsed: 01:09:07
>> ParseSegment: finished at 2014-07-06 04:19:18, elapsed: 01:59:26
>> CrawlDb update: finished at 2014-07-06 04:24:42, elapsed: 00:05:20
>> LinkDb: finished at 2014-07-06 04:25:44, elapsed: 00:00:59
>> Indexer: finished at 2014-07-06 04:54:13, elapsed: 00:25:29
>> Generator: finished at 2014-07-06 05:59:19, elapsed: 00:04:07
>> Fetcher: finished at 2014-07-06 07:07:44, elapsed: 01:08:21
>> ParseSegment: finished at 2014-07-06 09:15:45, elapsed: 02:07:59
>> CrawlDb update: finished at 2014-07-06 09:21:08, elapsed: 00:05:19
>> LinkDb: finished at 2014-07-06 09:22:09, elapsed: 00:00:59
>> Indexer: finished at 2014-07-06 09:49:52, elapsed: 00:25:03
>> Generator: finished at 2014-07-06 10:55:52, elapsed: 00:04:15
>> Fetcher: finished at 2014-07-06 12:05:19, elapsed: 01:09:23
>> ParseSegment: finished at 2014-07-06 13:59:10, elapsed: 01:53:49
>> CrawlDb update: finished at 2014-07-06 14:05:25, elapsed: 00:06:11
>> LinkDb: finished at 2014-07-06 14:06:38, elapsed: 00:01:09
>> Indexer: finished at 2014-07-06 14:34:24, elapsed: 00:24:42
>> Generator: finished at 2014-07-06 15:38:11, elapsed: 00:04:28
>> Fetcher: finished at 2014-07-06 16:22:35, elapsed: 00:44:21
>> ParseSegment: finished at 2014-07-06 17:36:21, elapsed: 01:13:43
>> CrawlDb update: finished at 2014-07-06 17:40:23, elapsed: 00:04:00
>> LinkDb: finished at 2014-07-06 17:41:12, elapsed: 00:00:46
>> Indexer: finished at 2014-07-06 17:59:21, elapsed: 00:15:30
>> Generator: finished at 2014-07-06 19:04:54, elapsed: 00:05:01
>> Fetcher: finished at 2014-07-06 20:09:03, elapsed: 01:04:06
>> ParseSegment: finished at 2014-07-06 22:10:20, elapsed: 02:01:14
>> CrawlDb update: finished at 2014-07-06 22:15:23, elapsed: 00:04:59
>> LinkDb: finished at 2014-07-06 22:16:24, elapsed: 00:00:59
>> Indexer: finished at 2014-07-06 22:42:30, elapsed: 00:23:26
>> Generator: finished at 2014-07-06 23:46:29, elapsed: 00:04:15
>> Fetcher: finished at 2014-07-07 00:57:49, elapsed: 01:11:18
>> ParseSegment: finished at 2014-07-07 03:11:16, elapsed: 02:13:24
>> CrawlDb update: finished at 2014-07-07 03:16:36, elapsed: 00:05:17
>> LinkDb: finished at 2014-07-07 03:17:42, elapsed: 00:01:02
>> Indexer: finished at 2014-07-07 03:44:33, elapsed: 00:24:11
>> Generator: finished at 2014-07-07 04:47:21, elapsed: 00:03:54
>> Fetcher: finished at 2014-07-07 05:56:32, elapsed: 01:09:08
>> ParseSegment: finished at 2014-07-07 08:04:50, elapsed: 02:08:16
>> CrawlDb update: finished at 2014-07-07 08:10:09, elapsed: 00:05:16
>> LinkDb: finished at 2014-07-07 08:11:09, elapsed: 00:00:57
>> Indexer: finished at 2014-07-07 08:38:48, elapsed: 00:24:49
>> Generator: finished at 2014-07-07 09:47:50, elapsed: 00:04:21
>> Fetcher: finished at 2014-07-07 10:54:31, elapsed: 01:06:37
>> ParseSegment: finished at 2014-07-07 12:55:42, elapsed: 02:01:09
>> CrawlDb update: finished at 2014-07-07 13:00:45, elapsed: 00:04:59
>> LinkDb: finished at 2014-07-07 13:01:41, elapsed: 00:00:53
>> Indexer: finished at 2014-07-07 13:28:06, elapsed: 00:23:23
>> --
>
>> If I actually separate out the first "round" only, and using grep, find
>> all the "Parsed (xms)" sections, there's 48,906 lnstances. If I extract 
the
>> number of milliseconds per each and even assume that 0ms equals 1ms, 
the
>> total time taken for parsing is 49,444 ms .. which, correct me if I'm
>> wrong, should be 49.4 seconds. Unfortunately, as you can see, the first
>> ParseSegment job took 02:16:16.
>

> See
> 
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

> This is something we should fix : the "Parsed (xms)" times do not 
reflect
> the time it takes to parse the docs but the time it takes to do some 
stuff
> afterwards on the docs. Probably a bug, if not it should have a 
different
> name.

Understood. Thanks for that explanation!


>> I believe that the thing is hanging after parsing these. As I 
understood
>> it, this indicates the delay required to reduce the mapped task.. But 
in
>> our case, there is no external Hadoop, so does reduction even occur 
here?
>

> Running in pseudo distrib mode would give you a better idea of where the
> issue is. The reduce step is done even in local mode BTW.
> 
> I suspect that your problem could have to do with the normalisation of 
the
> URLS which is done in
> 
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
.
> Some long URLs can cause carnage with the normalisers and take far 
longer
> than they should. The best way to identify the problem is to call jstack 
on
> the process and see what it is doing when it is taking time.

Duly noted, thank you!


>
>> Should it be considered concerning that we're fetching, I think, 
roughly
>> 50,000 URLs per each round and the Solr Index never goes higher than 
40,000
>> documents?
>

> Nope. If you have redirects, gone URLs, etc...

I dunno, I'd consider that concerning still, no? It goes 50,000 URLs times 
10 rounds.. That's 500,000 URLs with a total output of 40,000 documents. 
Per my other message, I seem to have located the issue, at least. But, 
that concern that we're having way, way too many fetches for the amount 
outputted is what tipped me off to a serious investigation of what was 
being fetched.

Obviously I regret not doing that sooner... but live and learn! :)

Again, thanks so much! I'll send a few other questions in another email!

Sincerely,
Craig

Re: Nutch local: large crawls, extremely slow, small solr index

Reply via email to