Hi Julien, Thank you so much. I will reply inline.
From: Julien Nioche <[email protected]> To: "[email protected]" <[email protected]> Date: 07/10/2014 12:17 PM Subject: Re: Nutch local: large crawls, extremely slow, small solr index Hi Craig See comments below, will also comment on your other mail separately : On 9 July 2014 20:58, Craig Leinoff <[email protected]>> wrote: >> Hello, > >> I have a handful of questions about Nutch, and it's unclear whether it's >> considered "impolite" to combine them all into one. As a result, I'm going >> to start with the most pressing issue first and, I guess, maybe send other >> ones in separate messages? > > don't worry - it's not impolite at all. OK! Thanks! > >> Our primary issue is that one (at least one, that is) of our crawls is >> going very slowly. We have a single system running Nutch in local mode, no >> external Hadoop (at the moment), to a Solr cluster hosted on other servers >> internal to our network. > I'd really recommend that you run it in pseudo-distributed mode instead of > local. Presumably your server has multiple cores and p-d mode would allow > you to leverage that. You'd also be able to use the MapReduce Web GUI to > monitor your crawl, check the counters, and easier access to the logs per > job. Local mode should be used for testing / debugging but not in > production. OK. Until this time, I had never heard of "pseudo-distributed" mode, so that's really great; thank you for recommending it. The implication, seemingly, is that we create a single-node Hadoop cluster on the same machine as Nutch, right? I did find this URL: https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial but the "as explained here" link, which is supposed to link to some sort of "stable" documentation for Hadoop is a 404, I guess. :( Presumably this is a reasonable analogue: http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html >> We're restricted by security (and then financial) reasons for our setup. >> This server is crawling a number of websites, only one of which should be >> at all "large" and over 100,000 pages. Nutch, its temporary directory, its >> crawls, and all else are running on an NFS mount. > >> The crawl we're having an issue with is for a collection that should >> contain about 30,000 documents. We have Nutch scheduled to crawl it once >> every day. It is set to go for 10 rounds, topN of 50,000. Without a doubt, >> it does all 10 rounds and, I believe, 50,000 records. For some reason, the >> whole process takes a little under 48 hours, and when it has imported >> everything into Solr, there are indeed, as I said, about 30,000 documents. > >> ===== > >> A few things of note: > >> From the log output of a recent job, I've extracted all the lines related >> to how long each process took. The fetcher usually takes a solid hour, but >> the ParseSegment job always seems to be significantly more. Have a look: > >> -- >> Injector: finished at 2014-07-05 14:56:52, elapsed: 00:01:57 >> Generator: finished at 2014-07-05 15:01:16, elapsed: 00:04:20 >> Fetcher: finished at 2014-07-05 16:09:10, elapsed: 01:07:52 >> ParseSegment: finished at 2014-07-05 18:25:29, elapsed: 02:16:16 >> CrawlDb update: finished at 2014-07-05 18:30:27, elapsed: 00:04:55 >> LinkDb: finished at 2014-07-05 18:31:26, elapsed: 00:00:56 >> Indexer: finished at 2014-07-05 18:58:34, elapsed: 00:24:47 >> Generator: finished at 2014-07-05 20:01:59, elapsed: 00:04:09 >> Fetcher: finished at 2014-07-05 21:14:24, elapsed: 01:12:22 >> ParseSegment: finished at 2014-07-05 23:30:31, elapsed: 02:16:05 >> CrawlDb update: finished at 2014-07-05 23:35:51, elapsed: 00:05:16 >> LinkDb: finished at 2014-07-05 23:36:57, elapsed: 00:01:03 >> Indexer: finished at 2014-07-06 00:05:58, elapsed: 00:25:46 >> Generator: finished at 2014-07-06 01:10:40, elapsed: 00:04:00 >> Fetcher: finished at 2014-07-06 02:19:49, elapsed: 01:09:07 >> ParseSegment: finished at 2014-07-06 04:19:18, elapsed: 01:59:26 >> CrawlDb update: finished at 2014-07-06 04:24:42, elapsed: 00:05:20 >> LinkDb: finished at 2014-07-06 04:25:44, elapsed: 00:00:59 >> Indexer: finished at 2014-07-06 04:54:13, elapsed: 00:25:29 >> Generator: finished at 2014-07-06 05:59:19, elapsed: 00:04:07 >> Fetcher: finished at 2014-07-06 07:07:44, elapsed: 01:08:21 >> ParseSegment: finished at 2014-07-06 09:15:45, elapsed: 02:07:59 >> CrawlDb update: finished at 2014-07-06 09:21:08, elapsed: 00:05:19 >> LinkDb: finished at 2014-07-06 09:22:09, elapsed: 00:00:59 >> Indexer: finished at 2014-07-06 09:49:52, elapsed: 00:25:03 >> Generator: finished at 2014-07-06 10:55:52, elapsed: 00:04:15 >> Fetcher: finished at 2014-07-06 12:05:19, elapsed: 01:09:23 >> ParseSegment: finished at 2014-07-06 13:59:10, elapsed: 01:53:49 >> CrawlDb update: finished at 2014-07-06 14:05:25, elapsed: 00:06:11 >> LinkDb: finished at 2014-07-06 14:06:38, elapsed: 00:01:09 >> Indexer: finished at 2014-07-06 14:34:24, elapsed: 00:24:42 >> Generator: finished at 2014-07-06 15:38:11, elapsed: 00:04:28 >> Fetcher: finished at 2014-07-06 16:22:35, elapsed: 00:44:21 >> ParseSegment: finished at 2014-07-06 17:36:21, elapsed: 01:13:43 >> CrawlDb update: finished at 2014-07-06 17:40:23, elapsed: 00:04:00 >> LinkDb: finished at 2014-07-06 17:41:12, elapsed: 00:00:46 >> Indexer: finished at 2014-07-06 17:59:21, elapsed: 00:15:30 >> Generator: finished at 2014-07-06 19:04:54, elapsed: 00:05:01 >> Fetcher: finished at 2014-07-06 20:09:03, elapsed: 01:04:06 >> ParseSegment: finished at 2014-07-06 22:10:20, elapsed: 02:01:14 >> CrawlDb update: finished at 2014-07-06 22:15:23, elapsed: 00:04:59 >> LinkDb: finished at 2014-07-06 22:16:24, elapsed: 00:00:59 >> Indexer: finished at 2014-07-06 22:42:30, elapsed: 00:23:26 >> Generator: finished at 2014-07-06 23:46:29, elapsed: 00:04:15 >> Fetcher: finished at 2014-07-07 00:57:49, elapsed: 01:11:18 >> ParseSegment: finished at 2014-07-07 03:11:16, elapsed: 02:13:24 >> CrawlDb update: finished at 2014-07-07 03:16:36, elapsed: 00:05:17 >> LinkDb: finished at 2014-07-07 03:17:42, elapsed: 00:01:02 >> Indexer: finished at 2014-07-07 03:44:33, elapsed: 00:24:11 >> Generator: finished at 2014-07-07 04:47:21, elapsed: 00:03:54 >> Fetcher: finished at 2014-07-07 05:56:32, elapsed: 01:09:08 >> ParseSegment: finished at 2014-07-07 08:04:50, elapsed: 02:08:16 >> CrawlDb update: finished at 2014-07-07 08:10:09, elapsed: 00:05:16 >> LinkDb: finished at 2014-07-07 08:11:09, elapsed: 00:00:57 >> Indexer: finished at 2014-07-07 08:38:48, elapsed: 00:24:49 >> Generator: finished at 2014-07-07 09:47:50, elapsed: 00:04:21 >> Fetcher: finished at 2014-07-07 10:54:31, elapsed: 01:06:37 >> ParseSegment: finished at 2014-07-07 12:55:42, elapsed: 02:01:09 >> CrawlDb update: finished at 2014-07-07 13:00:45, elapsed: 00:04:59 >> LinkDb: finished at 2014-07-07 13:01:41, elapsed: 00:00:53 >> Indexer: finished at 2014-07-07 13:28:06, elapsed: 00:23:23 >> -- > >> If I actually separate out the first "round" only, and using grep, find >> all the "Parsed (xms)" sections, there's 48,906 lnstances. If I extract the >> number of milliseconds per each and even assume that 0ms equals 1ms, the >> total time taken for parsing is 49,444 ms .. which, correct me if I'm >> wrong, should be 49.4 seconds. Unfortunately, as you can see, the first >> ParseSegment job took 02:16:16. > > See > https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParseSegment.java > This is something we should fix : the "Parsed (xms)" times do not reflect > the time it takes to parse the docs but the time it takes to do some stuff > afterwards on the docs. Probably a bug, if not it should have a different > name. Understood. Thanks for that explanation! >> I believe that the thing is hanging after parsing these. As I understood >> it, this indicates the delay required to reduce the mapped task.. But in >> our case, there is no external Hadoop, so does reduction even occur here? > > Running in pseudo distrib mode would give you a better idea of where the > issue is. The reduce step is done even in local mode BTW. > > I suspect that your problem could have to do with the normalisation of the > URLS which is done in > https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java . > Some long URLs can cause carnage with the normalisers and take far longer > than they should. The best way to identify the problem is to call jstack on > the process and see what it is doing when it is taking time. Duly noted, thank you! > >> Should it be considered concerning that we're fetching, I think, roughly >> 50,000 URLs per each round and the Solr Index never goes higher than 40,000 >> documents? > > Nope. If you have redirects, gone URLs, etc... I dunno, I'd consider that concerning still, no? It goes 50,000 URLs times 10 rounds.. That's 500,000 URLs with a total output of 40,000 documents. Per my other message, I seem to have located the issue, at least. But, that concern that we're having way, way too many fetches for the amount outputted is what tipped me off to a serious investigation of what was being fetched. Obviously I regret not doing that sooner... but live and learn! :) Again, thanks so much! I'll send a few other questions in another email! Sincerely, Craig

