Re: RE: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Semyon Semyonov
In addition, If you crawl a fixed set of urls with external links = false,  case 1 is solve. For example, if you inject http://www.samacharplus.com/ only 1) will be crawled, 2 will be ignored(because of external links = false). 1)http://www.samacharplus.com/~samachar/index.php/en/worlds/11-in

RE: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Markus Jelsma
About URL Normalizers, you can use: urlnormalizer-host to normalize between www- and non-www hosts, and urlnormalizer-slash to normalize per host trailing or non-trailing slashes. There are no committed tools that automate this, but if your set of sites is limited, it is easy to manage by hand.

Re: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Sebastian Nagel
Hi Shiva, 1. you can define URL normalizer rules to rewrite the URLs but it only works for sites where you know which URL is the canonical form. 2. you can deduplicate (command "nutch dedup") based on the content checksum: the duplicates are still crawled but deleted afterwards It's

Fwd: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread ShivaKarthik S
Hi, I am crawling many websites using Nutch-1.11 or Nutch-1.13 or 1.14. While crawling am getting near duplicate URLs like the following where the content is exactly the same *Case1: URLs with and Without WWW* http://www.samacharplus.com/~samachar/index.php/en/worlds/ 11-india/24151-nine-cr

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
If you look at the code of the HTML parser, you'll see that the parameter is passed the variable "root", the same variable that is passed to the methods that extract the outlinks, the title, and the text. So it simply can’t be null. It may be an issue with what toString is printing for this elem

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
Yes I am using Html parser and yes the document is getting parsed but document fragment is printing null. On 15 Mar 2018 13:52, "Yossi Tamari" wrote: > Is your parser the HTML parser? I can say from experience that the > document is passed. > I really recommend debugging in local mode rather th

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
Is your parser the HTML parser? I can say from experience that the document is passed. I really recommend debugging in local mode rather than using sysout. > -Original Message- > From: Yash Thenuan Thenuan > Sent: 15 March 2018 10:13 > To: user@nutch.apache.org > Subject: RE: RE: Depende

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
I tried printing the contents of document fragment in parsefilter-regex by writing System.out.println(doc) but its printing null!! And document is getting parsed!! On 15 Mar 2018 13:15, "Yossi Tamari" wrote: > Parse filters receive a DocumentFragment as their fourth parameter. > > > -Origina

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
Parse filters receive a DocumentFragment as their fourth parameter. > -Original Message- > From: Yash Thenuan Thenuan > Sent: 15 March 2018 08:50 > To: user@nutch.apache.org > Subject: Re: RE: Dependency between plugins > > Hi Jorge and Yossi, > The reason why I am trying to do it is exa