ache.org
Subject: crawlcomplete
Hi,
I'm trying to understand some of the design decisions behind the
crawlcomplete tool. I find the concept itself very useful, but there are a
couple of behaviors that I don't understand:
1. URLs that resulted in redirect (even permanent) are counted as
unfet
Hi,
I'm trying to understand some of the design decisions behind the
crawlcomplete tool. I find the concept itself very useful, but there are a
couple of behaviors that I don't understand:
1. URLs that resulted in redirect (even permanent) are counted as
unfetched. That means that if I
OK, I filed an issue https://issues.apache.org/jira/browse/NUTCH-2468
From: Sebastian Nagel <wastl.na...@googlemail.com>
To: user@nutch.apache.org
Sent: Wednesday, November 29, 2017 9:04 AM
Subject: Re: Not valid URLs in Crawldb through crawlcomplete
tutorial, is it mentioned that you need to
> specify "-filter" to updatedb to make it work.
>
>
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> To: user@nutch.apache.org
> Sent: Wednesday, November 29, 2017 2:40 AM
> Subject: Re: Not valid URLs in Crawl
ilter" to updatedb to make it work.
From: Sebastian Nagel <wastl.na...@googlemail.com>
To: user@nutch.apache.org
Sent: Wednesday, November 29, 2017 2:40 AM
Subject: Re: Not valid URLs in Crawldb through crawlcomplete
Hi,
all 8 available urlfilter-* plugins are linked from
maybe advice me a default filter that filters such
> problematic urls?
>
> Thanks.
>
> Semyon.
>
>
> Sent: Tuesday, November 28, 2017 at 4:17 PM
> From: "Sebastian Nagel" <wastl.na...@googlemail.com>
> To: user@nutch.apache.org
> Subject:
stian Nagel" <wastl.na...@googlemail.com>
To: user@nutch.apache.org
Subject: Re: Not valid URLs in Crawldb through crawlcomplete
Hi Semyon,
> It seems like Nutch takes the anchor name as an URL for the crawling a store
> it in database with
the key equals to name.
if you look
the CrawlDb.
Best,
Sebastian
On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
> Hello all,
>
> I have launched a crawling process for 100 websites with external links
> equals to true.
> After several hours, I run the crawlcomplete command with mode equals host.
>
> The cr
Hello all,
I have launched a crawling process for 100 websites with external links equals
to true.
After several hours, I run the crawlcomplete command with mode equals host.
The crawlcomplete output file contains(apart from the proper host names) the
following lines.
1#Are there any
9 matches
Mail list logo