Re: RE: Internal links appear to be external in Parse. Improvement of the crawling quality

Semyon Semyonov Wed, 21 Feb 2018 01:44:48 -0800

Thanks Yossi, Markus,

I have an issue with the db.ignore.external.links.mode=byDomain solution.


I crawl specific hosts only therefore I have a finite number of hosts to crawl.
Lets say, www.somewebsite.com

I want to stay limited with this host. In other words, neither 
www.art.somewebsite.com nor www.sport.somewebsite.com.
That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
true(no external websites).

Although, I want to get the links that seem to belong to the same 
host(www.somewebsite.com -> somewebsite.com/games, without www).
The question is shouldn't we include it as a default behavior(or configured 
behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as 
one host?



PS. For me it is not really clear how ProtocolResolver works.

Semyon


 

Sent: Tuesday, February 20, 2018 at 9:40 PM
From: "Markus Jelsma" <markus.jel...@openindex.io>
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: RE: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hello Semyon,

Yossi is right, you can use the db.ignore.* set of directives to resolve the 
problem.

Regarding protocol, you can use urlnormalizer-protocol to set up per host 
rules. This is, of course, a tedious job if you operate a crawl on an 
indefinite amount of hosts, so use the uncommitted ProtocolResolver for that to 
do it for you.

See: https://issues.apache.org/jira/browse/NUTCH-2247

If i remember it tomorrow afternoon, i can probably schedule some time to work 
on it the coming seven days or so, and commit.

Regards,
Markus

-----Original message-----
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Tuesday 20th February 2018 21:06
> To: user@nutch.apache.org
> Subject: RE: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>
> Hi Semyon,
>
> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
> issue?
> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
> decision if this is the same domain.
>
> Yossi.
>
> > -----Original Message-----
> > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > Sent: 20 February 2018 20:43
> > To: usernutch.apache.org <user@nutch.apache.org>
> > Subject: Internal links appear to be external in Parse. Improvement of the
> > crawling quality
> >
> > Dear All,
> >
> > I'm trying to increase quality of the crawling. A part of my database has
> > DB_FETCHED = 1.
> >
> > Example, http://www.wincs.be/[http://www.wincs.be/] in seed list.
> >
> > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
> >
> > Nutch considers one of the 
> > link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) 
> > as external
> > and therefore reject it.
> >
> >
> > If I insert http://wincs.be[http://wincs.be] in seed file, everything works 
> > fine.
> >
> > Do you think it is a good behavior? I mean, formally it is indeed two 
> > different
> > domains, but from user perspective it is exactly the same.
> >
> > And if it is a default behavior, how can I fix it for my case? The same 
> > question for
> > similar switch http -> https etc.
> >
> > Thanks.
> >
> > Semyon.
>
>

Re: RE: Internal links appear to be external in Parse. Improvement of the crawling quality

Reply via email to