In addition,
If you crawl a fixed set of urls with external links = false, case 1 is solve.
For example, if you inject http://www.samacharplus.com/ only 1) will be
crawled, 2 will be ignored(because of external links = false).
1)http://www.samacharplus.com/~samachar/index.php/en/worlds/11-in
About URL Normalizers, you can use:
urlnormalizer-host to normalize between www- and non-www hosts, and
urlnormalizer-slash to normalize per host trailing or non-trailing slashes.
There are no committed tools that automate this, but if your set of sites is
limited, it is easy to manage by hand.
Hi Shiva,
1. you can define URL normalizer rules to rewrite the URLs
but it only works for sites where you know which URL is
the canonical form.
2. you can deduplicate (command "nutch dedup") based on the
content checksum: the duplicates are still crawled but deleted
afterwards
It's
Hi,
I am crawling many websites using Nutch-1.11 or Nutch-1.13 or 1.14.
While crawling am getting near duplicate URLs like the following where the
content is exactly the same
*Case1: URLs with and Without WWW*
http://www.samacharplus.com/~samachar/index.php/en/worlds/
11-india/24151-nine-cr
If you look at the code of the HTML parser, you'll see that the parameter is
passed the variable "root", the same variable that is passed to the methods
that extract the outlinks, the title, and the text. So it simply can’t be null.
It may be an issue with what toString is printing for this elem
Yes I am using Html parser and yes the document is getting parsed but
document fragment is printing null.
On 15 Mar 2018 13:52, "Yossi Tamari" wrote:
> Is your parser the HTML parser? I can say from experience that the
> document is passed.
> I really recommend debugging in local mode rather th
Is your parser the HTML parser? I can say from experience that the document is
passed.
I really recommend debugging in local mode rather than using sysout.
> -Original Message-
> From: Yash Thenuan Thenuan
> Sent: 15 March 2018 10:13
> To: user@nutch.apache.org
> Subject: RE: RE: Depende
I tried printing the contents of document fragment in parsefilter-regex by
writing System.out.println(doc) but its printing null!! And document is
getting parsed!!
On 15 Mar 2018 13:15, "Yossi Tamari" wrote:
> Parse filters receive a DocumentFragment as their fourth parameter.
>
> > -Origina
Parse filters receive a DocumentFragment as their fourth parameter.
> -Original Message-
> From: Yash Thenuan Thenuan
> Sent: 15 March 2018 08:50
> To: user@nutch.apache.org
> Subject: Re: RE: Dependency between plugins
>
> Hi Jorge and Yossi,
> The reason why I am trying to do it is exa
9 matches
Mail list logo