Re: Nutch 2.3.1 with MongoDB not generating any URLs

Jean Vence Wed, 15 Jun 2016 01:29:38 -0700

Hi Lewis,

Thanks for your reply. Not sure what you mean by reading the Metadata.
Like I mentioned parsechecker shows hundreds of links. Also, deleting
the collection seems to fix things.


On Tue, Jun 14, 2016 at 9:25 PM, Lewis John Mcgibbney
<lewis.mcgibb...@gmail.com> wrote:
> Hi Jean,
>
> On Mon, Jun 13, 2016 at 1:57 PM, <user-digest-h...@nutch.apache.org> wrote:
>
>> From: Jean Vence <jve...@gmail.com>
>> To: user@nutch.apache.org
>> Cc:
>> Date: Mon, 13 Jun 2016 21:57:30 +0100
>> Subject: Nutch 2.3.1 with MongoDB not generating any URLs
>> I have installed and successfully web crawled thousands of pages using
>> Nutch 2.3.1 with MongoDB.
>>
>> But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed
>> list URL are accepted (InjectorJob: total number of urls injected
>> after normalization and filtering: 3) and
>> ./bin/nutch parsechecker  -dumpText http://xxx.com shows hundreds of URLs
>>
>> Error as follows:
>>
>> GeneratorJob: starting at 2016-06-09 07:26:15
>> GeneratorJob: Selecting best-scoring urls due for fetch.
>> GeneratorJob: starting
>> GeneratorJob: filtering: false
>> GeneratorJob: normalizing: false
>> GeneratorJob: topN: 50000
>> GeneratorJob: finished at 2016-06-09 07:26:28, time elapsed: 00:00:13
>> GeneratorJob: generated batch id: 1465471572-2463 containing 0 URLs
>>
>> What is interesting is that if I delete the webpage collection in the
>> mongodb nutch database, then the crawler works fine so I'm assuming
>> there's a record in the collection that is causing the issue. Can
>> anyone recommend how to fix this problem? (tried deleting any record
>> that doesn't have a status field but that did not help).
>>
>>
> Can you please read the Metadata of your records, as this will indicate if
> any outlinks have been extracted and are suitable for fetch.
> AFAIK, this is fixed in Nutch 2.X branch. It would be very helpful if you
> could please verify and get back to us here.
> Thanks
> Lewis

Re: Nutch 2.3.1 with MongoDB not generating any URLs

Reply via email to