Hi Lewis, Thanks for your reply. Not sure what you mean by reading the Metadata. Like I mentioned parsechecker shows hundreds of links. Also, deleting the collection seems to fix things.
On Tue, Jun 14, 2016 at 9:25 PM, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> wrote: > Hi Jean, > > On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> wrote: > >> From: Jean Vence <jve...@gmail.com> >> To: user@nutch.apache.org >> Cc: >> Date: Mon, 13 Jun 2016 21:57:30 +0100 >> Subject: Nutch 2.3.1 with MongoDB not generating any URLs >> I have installed and successfully web crawled thousands of pages using >> Nutch 2.3.1 with MongoDB. >> >> But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed >> list URL are accepted (InjectorJob: total number of urls injected >> after normalization and filtering: 3) and >> ./bin/nutch parsechecker -dumpText http://xxx.com shows hundreds of URLs >> >> Error as follows: >> >> GeneratorJob: starting at 2016-06-09 07:26:15 >> GeneratorJob: Selecting best-scoring urls due for fetch. >> GeneratorJob: starting >> GeneratorJob: filtering: false >> GeneratorJob: normalizing: false >> GeneratorJob: topN: 50000 >> GeneratorJob: finished at 2016-06-09 07:26:28, time elapsed: 00:00:13 >> GeneratorJob: generated batch id: 1465471572-2463 containing 0 URLs >> >> What is interesting is that if I delete the webpage collection in the >> mongodb nutch database, then the crawler works fine so I'm assuming >> there's a record in the collection that is causing the issue. Can >> anyone recommend how to fix this problem? (tried deleting any record >> that doesn't have a status field but that did not help). >> >> > Can you please read the Metadata of your records, as this will indicate if > any outlinks have been extracted and are suitable for fetch. > AFAIK, this is fixed in Nutch 2.X branch. It would be very helpful if you > could please verify and get back to us here. > Thanks > Lewis