Nutch local: large crawls, extremely slow, small solr index

Craig Leinoff Wed, 09 Jul 2014 12:58:59 -0700

Hello,

I have a handful of questions about Nutch, and it's unclear whether it's 
considered "impolite" to combine them all into one. As a result, I'm going to 
start with the most pressing issue first and, I guess, maybe send other ones in 
separate messages?


Our primary issue is that one (at least one, that is) of our crawls is going 
very slowly. We have a single system running Nutch in local mode, no external 
Hadoop (at the moment), to a Solr cluster hosted on other servers internal to 
our network. We're restricted by security (and then financial) reasons for our 
setup. This server is crawling a number of websites, only one of which should 
be at all "large" and over 100,000 pages. Nutch, its temporary directory, its 
crawls, and all else are running on an NFS mount.

The crawl we're having an issue with is for a collection that should contain 
about 30,000 documents. We have Nutch scheduled to crawl it once every day. It 
is set to go for 10 rounds, topN of 50,000. Without a doubt, it does all 10 
rounds and, I believe, 50,000 records. For some reason, the whole process takes 
a little under 48 hours, and when it has imported everything into Solr, there 
are indeed, as I said, about 30,000 documents.

=====

A few things of note:

>From the log output of a recent job, I've extracted all the lines related to 
>how long each process took. The fetcher usually takes a solid hour, but the 
>ParseSegment job always seems to be significantly more. Have a look:

--
Injector: finished at 2014-07-05 14:56:52, elapsed: 00:01:57
Generator: finished at 2014-07-05 15:01:16, elapsed: 00:04:20
Fetcher: finished at 2014-07-05 16:09:10, elapsed: 01:07:52
ParseSegment: finished at 2014-07-05 18:25:29, elapsed: 02:16:16
CrawlDb update: finished at 2014-07-05 18:30:27, elapsed: 00:04:55
LinkDb: finished at 2014-07-05 18:31:26, elapsed: 00:00:56
Indexer: finished at 2014-07-05 18:58:34, elapsed: 00:24:47
Generator: finished at 2014-07-05 20:01:59, elapsed: 00:04:09
Fetcher: finished at 2014-07-05 21:14:24, elapsed: 01:12:22
ParseSegment: finished at 2014-07-05 23:30:31, elapsed: 02:16:05
CrawlDb update: finished at 2014-07-05 23:35:51, elapsed: 00:05:16
LinkDb: finished at 2014-07-05 23:36:57, elapsed: 00:01:03
Indexer: finished at 2014-07-06 00:05:58, elapsed: 00:25:46
Generator: finished at 2014-07-06 01:10:40, elapsed: 00:04:00
Fetcher: finished at 2014-07-06 02:19:49, elapsed: 01:09:07
ParseSegment: finished at 2014-07-06 04:19:18, elapsed: 01:59:26
CrawlDb update: finished at 2014-07-06 04:24:42, elapsed: 00:05:20
LinkDb: finished at 2014-07-06 04:25:44, elapsed: 00:00:59
Indexer: finished at 2014-07-06 04:54:13, elapsed: 00:25:29
Generator: finished at 2014-07-06 05:59:19, elapsed: 00:04:07
Fetcher: finished at 2014-07-06 07:07:44, elapsed: 01:08:21
ParseSegment: finished at 2014-07-06 09:15:45, elapsed: 02:07:59
CrawlDb update: finished at 2014-07-06 09:21:08, elapsed: 00:05:19
LinkDb: finished at 2014-07-06 09:22:09, elapsed: 00:00:59
Indexer: finished at 2014-07-06 09:49:52, elapsed: 00:25:03
Generator: finished at 2014-07-06 10:55:52, elapsed: 00:04:15
Fetcher: finished at 2014-07-06 12:05:19, elapsed: 01:09:23
ParseSegment: finished at 2014-07-06 13:59:10, elapsed: 01:53:49
CrawlDb update: finished at 2014-07-06 14:05:25, elapsed: 00:06:11
LinkDb: finished at 2014-07-06 14:06:38, elapsed: 00:01:09
Indexer: finished at 2014-07-06 14:34:24, elapsed: 00:24:42
Generator: finished at 2014-07-06 15:38:11, elapsed: 00:04:28
Fetcher: finished at 2014-07-06 16:22:35, elapsed: 00:44:21
ParseSegment: finished at 2014-07-06 17:36:21, elapsed: 01:13:43
CrawlDb update: finished at 2014-07-06 17:40:23, elapsed: 00:04:00
LinkDb: finished at 2014-07-06 17:41:12, elapsed: 00:00:46
Indexer: finished at 2014-07-06 17:59:21, elapsed: 00:15:30
Generator: finished at 2014-07-06 19:04:54, elapsed: 00:05:01
Fetcher: finished at 2014-07-06 20:09:03, elapsed: 01:04:06
ParseSegment: finished at 2014-07-06 22:10:20, elapsed: 02:01:14
CrawlDb update: finished at 2014-07-06 22:15:23, elapsed: 00:04:59
LinkDb: finished at 2014-07-06 22:16:24, elapsed: 00:00:59
Indexer: finished at 2014-07-06 22:42:30, elapsed: 00:23:26
Generator: finished at 2014-07-06 23:46:29, elapsed: 00:04:15
Fetcher: finished at 2014-07-07 00:57:49, elapsed: 01:11:18
ParseSegment: finished at 2014-07-07 03:11:16, elapsed: 02:13:24
CrawlDb update: finished at 2014-07-07 03:16:36, elapsed: 00:05:17
LinkDb: finished at 2014-07-07 03:17:42, elapsed: 00:01:02
Indexer: finished at 2014-07-07 03:44:33, elapsed: 00:24:11
Generator: finished at 2014-07-07 04:47:21, elapsed: 00:03:54
Fetcher: finished at 2014-07-07 05:56:32, elapsed: 01:09:08
ParseSegment: finished at 2014-07-07 08:04:50, elapsed: 02:08:16
CrawlDb update: finished at 2014-07-07 08:10:09, elapsed: 00:05:16
LinkDb: finished at 2014-07-07 08:11:09, elapsed: 00:00:57
Indexer: finished at 2014-07-07 08:38:48, elapsed: 00:24:49
Generator: finished at 2014-07-07 09:47:50, elapsed: 00:04:21
Fetcher: finished at 2014-07-07 10:54:31, elapsed: 01:06:37
ParseSegment: finished at 2014-07-07 12:55:42, elapsed: 02:01:09
CrawlDb update: finished at 2014-07-07 13:00:45, elapsed: 00:04:59
LinkDb: finished at 2014-07-07 13:01:41, elapsed: 00:00:53
Indexer: finished at 2014-07-07 13:28:06, elapsed: 00:23:23
--

If I actually separate out the first "round" only, and using grep, find all the 
"Parsed (xms)" sections, there's 48,906 lnstances. If I extract the number of 
milliseconds per each and even assume that 0ms equals 1ms, the total time taken 
for parsing is 49,444 ms .. which, correct me if I'm wrong, should be 49.4 
seconds. Unfortunately, as you can see, the first ParseSegment job took 
02:16:16.

I believe that the thing is hanging after parsing these. As I understood it, 
this indicates the delay required to reduce the mapped task.. But in our case, 
there is no external Hadoop, so does reduction even occur here?

Should it be considered concerning that we're fetching, I think, roughly 50,000 
URLs per each round and the Solr Index never goes higher than 40,000 documents?
Thanks so much in advance for any assistance!

Sincerely,
Craig

Nutch local: large crawls, extremely slow, small solr index

Reply via email to