Hello, I have a handful of questions about Nutch, and it's unclear whether it's considered "impolite" to combine them all into one. As a result, I'm going to start with the most pressing issue first and, I guess, maybe send other ones in separate messages?
Our primary issue is that one (at least one, that is) of our crawls is going very slowly. We have a single system running Nutch in local mode, no external Hadoop (at the moment), to a Solr cluster hosted on other servers internal to our network. We're restricted by security (and then financial) reasons for our setup. This server is crawling a number of websites, only one of which should be at all "large" and over 100,000 pages. Nutch, its temporary directory, its crawls, and all else are running on an NFS mount. The crawl we're having an issue with is for a collection that should contain about 30,000 documents. We have Nutch scheduled to crawl it once every day. It is set to go for 10 rounds, topN of 50,000. Without a doubt, it does all 10 rounds and, I believe, 50,000 records. For some reason, the whole process takes a little under 48 hours, and when it has imported everything into Solr, there are indeed, as I said, about 30,000 documents. ===== A few things of note: >From the log output of a recent job, I've extracted all the lines related to >how long each process took. The fetcher usually takes a solid hour, but the >ParseSegment job always seems to be significantly more. Have a look: -- Injector: finished at 2014-07-05 14:56:52, elapsed: 00:01:57 Generator: finished at 2014-07-05 15:01:16, elapsed: 00:04:20 Fetcher: finished at 2014-07-05 16:09:10, elapsed: 01:07:52 ParseSegment: finished at 2014-07-05 18:25:29, elapsed: 02:16:16 CrawlDb update: finished at 2014-07-05 18:30:27, elapsed: 00:04:55 LinkDb: finished at 2014-07-05 18:31:26, elapsed: 00:00:56 Indexer: finished at 2014-07-05 18:58:34, elapsed: 00:24:47 Generator: finished at 2014-07-05 20:01:59, elapsed: 00:04:09 Fetcher: finished at 2014-07-05 21:14:24, elapsed: 01:12:22 ParseSegment: finished at 2014-07-05 23:30:31, elapsed: 02:16:05 CrawlDb update: finished at 2014-07-05 23:35:51, elapsed: 00:05:16 LinkDb: finished at 2014-07-05 23:36:57, elapsed: 00:01:03 Indexer: finished at 2014-07-06 00:05:58, elapsed: 00:25:46 Generator: finished at 2014-07-06 01:10:40, elapsed: 00:04:00 Fetcher: finished at 2014-07-06 02:19:49, elapsed: 01:09:07 ParseSegment: finished at 2014-07-06 04:19:18, elapsed: 01:59:26 CrawlDb update: finished at 2014-07-06 04:24:42, elapsed: 00:05:20 LinkDb: finished at 2014-07-06 04:25:44, elapsed: 00:00:59 Indexer: finished at 2014-07-06 04:54:13, elapsed: 00:25:29 Generator: finished at 2014-07-06 05:59:19, elapsed: 00:04:07 Fetcher: finished at 2014-07-06 07:07:44, elapsed: 01:08:21 ParseSegment: finished at 2014-07-06 09:15:45, elapsed: 02:07:59 CrawlDb update: finished at 2014-07-06 09:21:08, elapsed: 00:05:19 LinkDb: finished at 2014-07-06 09:22:09, elapsed: 00:00:59 Indexer: finished at 2014-07-06 09:49:52, elapsed: 00:25:03 Generator: finished at 2014-07-06 10:55:52, elapsed: 00:04:15 Fetcher: finished at 2014-07-06 12:05:19, elapsed: 01:09:23 ParseSegment: finished at 2014-07-06 13:59:10, elapsed: 01:53:49 CrawlDb update: finished at 2014-07-06 14:05:25, elapsed: 00:06:11 LinkDb: finished at 2014-07-06 14:06:38, elapsed: 00:01:09 Indexer: finished at 2014-07-06 14:34:24, elapsed: 00:24:42 Generator: finished at 2014-07-06 15:38:11, elapsed: 00:04:28 Fetcher: finished at 2014-07-06 16:22:35, elapsed: 00:44:21 ParseSegment: finished at 2014-07-06 17:36:21, elapsed: 01:13:43 CrawlDb update: finished at 2014-07-06 17:40:23, elapsed: 00:04:00 LinkDb: finished at 2014-07-06 17:41:12, elapsed: 00:00:46 Indexer: finished at 2014-07-06 17:59:21, elapsed: 00:15:30 Generator: finished at 2014-07-06 19:04:54, elapsed: 00:05:01 Fetcher: finished at 2014-07-06 20:09:03, elapsed: 01:04:06 ParseSegment: finished at 2014-07-06 22:10:20, elapsed: 02:01:14 CrawlDb update: finished at 2014-07-06 22:15:23, elapsed: 00:04:59 LinkDb: finished at 2014-07-06 22:16:24, elapsed: 00:00:59 Indexer: finished at 2014-07-06 22:42:30, elapsed: 00:23:26 Generator: finished at 2014-07-06 23:46:29, elapsed: 00:04:15 Fetcher: finished at 2014-07-07 00:57:49, elapsed: 01:11:18 ParseSegment: finished at 2014-07-07 03:11:16, elapsed: 02:13:24 CrawlDb update: finished at 2014-07-07 03:16:36, elapsed: 00:05:17 LinkDb: finished at 2014-07-07 03:17:42, elapsed: 00:01:02 Indexer: finished at 2014-07-07 03:44:33, elapsed: 00:24:11 Generator: finished at 2014-07-07 04:47:21, elapsed: 00:03:54 Fetcher: finished at 2014-07-07 05:56:32, elapsed: 01:09:08 ParseSegment: finished at 2014-07-07 08:04:50, elapsed: 02:08:16 CrawlDb update: finished at 2014-07-07 08:10:09, elapsed: 00:05:16 LinkDb: finished at 2014-07-07 08:11:09, elapsed: 00:00:57 Indexer: finished at 2014-07-07 08:38:48, elapsed: 00:24:49 Generator: finished at 2014-07-07 09:47:50, elapsed: 00:04:21 Fetcher: finished at 2014-07-07 10:54:31, elapsed: 01:06:37 ParseSegment: finished at 2014-07-07 12:55:42, elapsed: 02:01:09 CrawlDb update: finished at 2014-07-07 13:00:45, elapsed: 00:04:59 LinkDb: finished at 2014-07-07 13:01:41, elapsed: 00:00:53 Indexer: finished at 2014-07-07 13:28:06, elapsed: 00:23:23 -- If I actually separate out the first "round" only, and using grep, find all the "Parsed (xms)" sections, there's 48,906 lnstances. If I extract the number of milliseconds per each and even assume that 0ms equals 1ms, the total time taken for parsing is 49,444 ms .. which, correct me if I'm wrong, should be 49.4 seconds. Unfortunately, as you can see, the first ParseSegment job took 02:16:16. I believe that the thing is hanging after parsing these. As I understood it, this indicates the delay required to reduce the mapped task.. But in our case, there is no external Hadoop, so does reduction even occur here? Should it be considered concerning that we're fetching, I think, roughly 50,000 URLs per each round and the Solr Index never goes higher than 40,000 documents? Thanks so much in advance for any assistance! Sincerely, Craig

