Re: Nutch local: large crawls, extremely slow, small solr index

Julien Nioche Thu, 10 Jul 2014 09:20:56 -0700

Hi Craig

See comments below, will also comment on your other mail separately :


On 9 July 2014 20:58, Craig Leinoff <lein...@un.org> wrote:

> Hello,
>
> I have a handful of questions about Nutch, and it's unclear whether it's
> considered "impolite" to combine them all into one. As a result, I'm going
> to start with the most pressing issue first and, I guess, maybe send other
> ones in separate messages?
>

don't worry - it's not impolite at all.


>
> Our primary issue is that one (at least one, that is) of our crawls is
> going very slowly. We have a single system running Nutch in local mode, no
> external Hadoop (at the moment), to a Solr cluster hosted on other servers
> internal to our network.


I'd really recommend that you run it in pseudo-distributed mode instead of
local. Presumably your server has multiple cores and p-d mode would allow
you to leverage that. You'd also be able to use the MapReduce Web GUI to
monitor your crawl, check the counters,  and easier access to the logs per
job. Local mode should be used for testing / debugging but not in
production.



> We're restricted by security (and then financial) reasons for our setup.
> This server is crawling a number of websites, only one of which should be
> at all "large" and over 100,000 pages. Nutch, its temporary directory, its
> crawls, and all else are running on an NFS mount.
>
> The crawl we're having an issue with is for a collection that should
> contain about 30,000 documents. We have Nutch scheduled to crawl it once
> every day. It is set to go for 10 rounds, topN of 50,000. Without a doubt,
> it does all 10 rounds and, I believe, 50,000 records. For some reason, the
> whole process takes a little under 48 hours, and when it has imported
> everything into Solr, there are indeed, as I said, about 30,000 documents.
>
> =====
>
> A few things of note:
>
> From the log output of a recent job, I've extracted all the lines related
> to how long each process took. The fetcher usually takes a solid hour, but
> the ParseSegment job always seems to be significantly more. Have a look:
>
> --
> Injector: finished at 2014-07-05 14:56:52, elapsed: 00:01:57
> Generator: finished at 2014-07-05 15:01:16, elapsed: 00:04:20
> Fetcher: finished at 2014-07-05 16:09:10, elapsed: 01:07:52
> ParseSegment: finished at 2014-07-05 18:25:29, elapsed: 02:16:16
> CrawlDb update: finished at 2014-07-05 18:30:27, elapsed: 00:04:55
> LinkDb: finished at 2014-07-05 18:31:26, elapsed: 00:00:56
> Indexer: finished at 2014-07-05 18:58:34, elapsed: 00:24:47
> Generator: finished at 2014-07-05 20:01:59, elapsed: 00:04:09
> Fetcher: finished at 2014-07-05 21:14:24, elapsed: 01:12:22
> ParseSegment: finished at 2014-07-05 23:30:31, elapsed: 02:16:05
> CrawlDb update: finished at 2014-07-05 23:35:51, elapsed: 00:05:16
> LinkDb: finished at 2014-07-05 23:36:57, elapsed: 00:01:03
> Indexer: finished at 2014-07-06 00:05:58, elapsed: 00:25:46
> Generator: finished at 2014-07-06 01:10:40, elapsed: 00:04:00
> Fetcher: finished at 2014-07-06 02:19:49, elapsed: 01:09:07
> ParseSegment: finished at 2014-07-06 04:19:18, elapsed: 01:59:26
> CrawlDb update: finished at 2014-07-06 04:24:42, elapsed: 00:05:20
> LinkDb: finished at 2014-07-06 04:25:44, elapsed: 00:00:59
> Indexer: finished at 2014-07-06 04:54:13, elapsed: 00:25:29
> Generator: finished at 2014-07-06 05:59:19, elapsed: 00:04:07
> Fetcher: finished at 2014-07-06 07:07:44, elapsed: 01:08:21
> ParseSegment: finished at 2014-07-06 09:15:45, elapsed: 02:07:59
> CrawlDb update: finished at 2014-07-06 09:21:08, elapsed: 00:05:19
> LinkDb: finished at 2014-07-06 09:22:09, elapsed: 00:00:59
> Indexer: finished at 2014-07-06 09:49:52, elapsed: 00:25:03
> Generator: finished at 2014-07-06 10:55:52, elapsed: 00:04:15
> Fetcher: finished at 2014-07-06 12:05:19, elapsed: 01:09:23
> ParseSegment: finished at 2014-07-06 13:59:10, elapsed: 01:53:49
> CrawlDb update: finished at 2014-07-06 14:05:25, elapsed: 00:06:11
> LinkDb: finished at 2014-07-06 14:06:38, elapsed: 00:01:09
> Indexer: finished at 2014-07-06 14:34:24, elapsed: 00:24:42
> Generator: finished at 2014-07-06 15:38:11, elapsed: 00:04:28
> Fetcher: finished at 2014-07-06 16:22:35, elapsed: 00:44:21
> ParseSegment: finished at 2014-07-06 17:36:21, elapsed: 01:13:43
> CrawlDb update: finished at 2014-07-06 17:40:23, elapsed: 00:04:00
> LinkDb: finished at 2014-07-06 17:41:12, elapsed: 00:00:46
> Indexer: finished at 2014-07-06 17:59:21, elapsed: 00:15:30
> Generator: finished at 2014-07-06 19:04:54, elapsed: 00:05:01
> Fetcher: finished at 2014-07-06 20:09:03, elapsed: 01:04:06
> ParseSegment: finished at 2014-07-06 22:10:20, elapsed: 02:01:14
> CrawlDb update: finished at 2014-07-06 22:15:23, elapsed: 00:04:59
> LinkDb: finished at 2014-07-06 22:16:24, elapsed: 00:00:59
> Indexer: finished at 2014-07-06 22:42:30, elapsed: 00:23:26
> Generator: finished at 2014-07-06 23:46:29, elapsed: 00:04:15
> Fetcher: finished at 2014-07-07 00:57:49, elapsed: 01:11:18
> ParseSegment: finished at 2014-07-07 03:11:16, elapsed: 02:13:24
> CrawlDb update: finished at 2014-07-07 03:16:36, elapsed: 00:05:17
> LinkDb: finished at 2014-07-07 03:17:42, elapsed: 00:01:02
> Indexer: finished at 2014-07-07 03:44:33, elapsed: 00:24:11
> Generator: finished at 2014-07-07 04:47:21, elapsed: 00:03:54
> Fetcher: finished at 2014-07-07 05:56:32, elapsed: 01:09:08
> ParseSegment: finished at 2014-07-07 08:04:50, elapsed: 02:08:16
> CrawlDb update: finished at 2014-07-07 08:10:09, elapsed: 00:05:16
> LinkDb: finished at 2014-07-07 08:11:09, elapsed: 00:00:57
> Indexer: finished at 2014-07-07 08:38:48, elapsed: 00:24:49
> Generator: finished at 2014-07-07 09:47:50, elapsed: 00:04:21
> Fetcher: finished at 2014-07-07 10:54:31, elapsed: 01:06:37
> ParseSegment: finished at 2014-07-07 12:55:42, elapsed: 02:01:09
> CrawlDb update: finished at 2014-07-07 13:00:45, elapsed: 00:04:59
> LinkDb: finished at 2014-07-07 13:01:41, elapsed: 00:00:53
> Indexer: finished at 2014-07-07 13:28:06, elapsed: 00:23:23
> --
>
> If I actually separate out the first "round" only, and using grep, find
> all the "Parsed (xms)" sections, there's 48,906 lnstances. If I extract the
> number of milliseconds per each and even assume that 0ms equals 1ms, the
> total time taken for parsing is 49,444 ms .. which, correct me if I'm
> wrong, should be 49.4 seconds. Unfortunately, as you can see, the first
> ParseSegment job took 02:16:16.
>

See
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
This is something we should fix : the "Parsed (xms)" times do not reflect
the time it takes to parse the docs but the time it takes to do some stuff
afterwards on the docs. Probably a bug, if not it should have a different
name.



> I believe that the thing is hanging after parsing these. As I understood
> it, this indicates the delay required to reduce the mapped task.. But in
> our case, there is no external Hadoop, so does reduction even occur here?
>

Running in pseudo distrib mode would give you a better idea of where the
issue is. The reduce step is done even in local mode BTW.

I suspect that your problem could have to do with the normalisation of the
URLS which is done in
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java.
Some long URLs can cause carnage with the normalisers and take far longer
than they should. The best way to identify the problem is to call jstack on
the process and see what it is doing when it is taking time.


>
> Should it be considered concerning that we're fetching, I think, roughly
> 50,000 URLs per each round and the Solr Index never goes higher than 40,000
> documents?
>

Nope. If you have redirects, gone URLs, etc...


> Thanks so much in advance for any assistance!
>
> Sincerely,
> Craig
>

HTH

Julien

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch local: large crawls, extremely slow, small solr index

Reply via email to