Re: crawlcomplete

2017-12-14 Thread Semyon Semyonov
ache.org Subject: crawlcomplete Hi, I'm trying to understand some of the design decisions behind the crawlcomplete tool. I find the concept itself very useful, but there are a couple of behaviors that I don't understand: 1. URLs that resulted in redirect (even permanent) are counted as unfet

crawlcomplete

2017-12-04 Thread Yossi Tamari
Hi, I'm trying to understand some of the design decisions behind the crawlcomplete tool. I find the concept itself very useful, but there are a couple of behaviors that I don't understand: 1. URLs that resulted in redirect (even permanent) are counted as unfetched. That means that if I

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-30 Thread Michael Coffey
OK, I filed an issue https://issues.apache.org/jira/browse/NUTCH-2468 From: Sebastian Nagel <wastl.na...@googlemail.com> To: user@nutch.apache.org Sent: Wednesday, November 29, 2017 9:04 AM Subject: Re: Not valid URLs in Crawldb through crawlcomplete

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-29 Thread Sebastian Nagel
tutorial, is it mentioned that you need to > specify "-filter" to updatedb to make it work. > > > From: Sebastian Nagel <wastl.na...@googlemail.com> > To: user@nutch.apache.org > Sent: Wednesday, November 29, 2017 2:40 AM > Subject: Re: Not valid URLs in Crawl

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-29 Thread Michael Coffey
ilter" to updatedb to make it work. From: Sebastian Nagel <wastl.na...@googlemail.com> To: user@nutch.apache.org Sent: Wednesday, November 29, 2017 2:40 AM Subject: Re: Not valid URLs in Crawldb through crawlcomplete Hi, all 8 available urlfilter-* plugins are linked from

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-29 Thread Sebastian Nagel
maybe advice me a default filter that filters such > problematic urls? > > Thanks. > > Semyon. > > > Sent: Tuesday, November 28, 2017 at 4:17 PM > From: "Sebastian Nagel" <wastl.na...@googlemail.com> > To: user@nutch.apache.org > Subject:

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-29 Thread Semyon Semyonov
stian Nagel" <wastl.na...@googlemail.com> To: user@nutch.apache.org Subject: Re: Not valid URLs in Crawldb through crawlcomplete Hi Semyon, > It seems like Nutch takes the anchor name as an URL for the crawling a store > it in database with the key equals to name. if you look

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-28 Thread Sebastian Nagel
the CrawlDb. Best, Sebastian On 11/28/2017 02:17 PM, Semyon Semyonov wrote: > Hello all, > > I have launched a crawling process for 100 websites with external links > equals to true. > After several hours, I run the crawlcomplete command with mode equals host. > > The cr

Not valid URLs in Crawldb through crawlcomplete

2017-11-28 Thread Semyon Semyonov
Hello all, I have launched a crawling process for 100 websites with external links equals to true. After several hours, I run the crawlcomplete command with mode equals host. The crawlcomplete output file contains(apart from the proper host names) the following lines. 1#Are there any