Thanks Yossi, Markus, I have an issue with the db.ignore.external.links.mode=byDomain solution.
I crawl specific hosts only therefore I have a finite number of hosts to crawl. Lets say, www.somewebsite.com I want to stay limited with this host. In other words, neither www.art.somewebsite.com nor www.sport.somewebsite.com. That's why db.ignore.external.links.mode=byHost and db.ignore.external = true(no external websites). Although, I want to get the links that seem to belong to the same host(www.somewebsite.com -> somewebsite.com/games, without www). The question is shouldn't we include it as a default behavior(or configured behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as one host? PS. For me it is not really clear how ProtocolResolver works. Semyon Sent: Tuesday, February 20, 2018 at 9:40 PM From: "Markus Jelsma" <markus.jel...@openindex.io> To: "user@nutch.apache.org" <user@nutch.apache.org> Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling quality Hello Semyon, Yossi is right, you can use the db.ignore.* set of directives to resolve the problem. Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This is, of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use the uncommitted ProtocolResolver for that to do it for you. See: https://issues.apache.org/jira/browse/NUTCH-2247 If i remember it tomorrow afternoon, i can probably schedule some time to work on it the coming seven days or so, and commit. Regards, Markus -----Original message----- > From:Yossi Tamari <yossi.tam...@pipl.com> > Sent: Tuesday 20th February 2018 21:06 > To: user@nutch.apache.org > Subject: RE: Internal links appear to be external in Parse. Improvement of > the crawling quality > > Hi Semyon, > > Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be > issue? > As far as I can see the protocol (HTTP/HTTPS) does not play any part in the > decision if this is the same domain. > > Yossi. > > > -----Original Message----- > > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] > > Sent: 20 February 2018 20:43 > > To: usernutch.apache.org <user@nutch.apache.org> > > Subject: Internal links appear to be external in Parse. Improvement of the > > crawling quality > > > > Dear All, > > > > I'm trying to increase quality of the crawling. A part of my database has > > DB_FETCHED = 1. > > > > Example, http://www.wincs.be/[http://www.wincs.be/] in seed list. > > > > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 > > > > Nutch considers one of the > > link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) > > as external > > and therefore reject it. > > > > > > If I insert http://wincs.be[http://wincs.be] in seed file, everything works > > fine. > > > > Do you think it is a good behavior? I mean, formally it is indeed two > > different > > domains, but from user perspective it is exactly the same. > > > > And if it is a default behavior, how can I fix it for my case? The same > > question for > > similar switch http -> https etc. > > > > Thanks. > > > > Semyon. > >