Hi again, Another issue has appeared with introduction of bidirectional url exemption filter.
Having http://www.website.com/page1 and http://website.com/page2 Before as an indexer output(lets say a text file) I had one parent/host(www.website.com) with children/pages(http://www.website.com/page1, http://www.website.com/...). Now, I have two different hosts and therefore two different parents for my output. I prefer to have the same hostname/alias for both hosts. I checked url exemption filters and they don't allow to add metadata to the parsed data. Therefore, two questions: 1) What is the best way to do it? 2) Should I include it into Nutch code or we don't need it and I should make a quick fix for myself? Semyon. Sent: Tuesday, March 06, 2018 at 11:08 AM From: "Sebastian Nagel" <wastl.na...@googlemail.com> To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality Hi Semyon, > We apply logical AND here, which is not really reasonable here. By now, there was only a single exemption filter, it made no difference. But yes, sounds plausible to change this to an OR resp. return true as soon one of the filters accepts/exempts the URL. Please open a issue to change it. Thanks, Sebastian On 03/06/2018 10:28 AM, Semyon Semyonov wrote: > I have proposed a solution for this problem > https://issues.apache.org/jira/browse/NUTCH-2522. > > The other question is how voting mechanism of UrlExemptionFilters should work. > > UrlExemptionFilters.java : lines 60-65 > //An URL is exempted when all the filters accept it to pass through > for (int i = 0; i < this.filters.length && exempted; i++) { > exempted = this.filters[i].filter(fromUrl, toUrl); > } > URLExemptionFilter > We apply logical AND here, which is not really reasonable here. > > I think if one of the filters votes for exempt then we should exempt it, > therefore logical OR instead. > For example, with the new filter links such as > http://www.website.com[http://www.website.com] -> > http://website.com/about[http://website.com/about] can be exempted, but > standart filter will not exempt it because they are from different hosts. > With current logic, the url will not be exempted, because of logical AND > > > Any ideas? > > > > > Sent: Wednesday, February 21, 2018 at 2:58 PM > From: "Sebastian Nagel" <wastl.na...@googlemail.com> > To: user@nutch.apache.org > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality >> 1) Do we have a config setting that we can use already? > > Not out-of-the-box. But there is already an extension point for your use case > [1]: > the filter method takes to arguments (fromURL and toURL). > Have a look at it, maybe you can fix it by implementing/contributing a plugin. > >> 2) ... It looks more like same Host problem rather ... > > To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() > which implements RFC 1738 [2]. We cannot change Java but it would be possible > to modify URLUtil.getDomainName(...), at least, as a work-around. > >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >> somewhere else as well? > > You may also want to fix it in FetcherThread.handleRedirect(...) which > affects also your use case > of following only internal links (if db.ignore.also.redirects == true). > > Best, > Sebastian > > > [1] > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html] > > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html] > [2] > https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]] > > > On 02/21/2018 01:52 PM, Semyon Semyonov wrote: >> Hi Sabastian, >> >> If I >> - modify the method URLUtil.getDomainName(URL url) >> >> doesn't it mean that I don't need >> - set db.ignore.external.links.mode=byDomain >> >> anymore? >> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]] >> becomes the same host as somewhebsite.com. >> >> >> To make it as generic as possible I can create an issue/pull request for >> this, but I would like to hear your suggestion about the best way to do so. >> 1) Do we have a config setting that we can use already? >> 2) The domain discussion[1] is quite wide though. In my case I cover only >> one issue with the mapping www -> _ . It looks more like same Host problem >> rather than the same Domain problem. What to you think about such host >> resolution? >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >> somewhere else as well? >> >> Semyon. >> >> >> >> >> Sent: Wednesday, February 21, 2018 at 11:51 AM >> From: "Sebastian Nagel" <wastl.na...@googlemail.com> >> To: user@nutch.apache.org >> Subject: Re: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> Hi Semyon, >> >>> interpret >>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]] >>> and somewhebsite.com as one host? >> >> Yes, that's a common problem. More because of external links which must >> include the host name - well-designed sites would use relative links >> for internal same-host links. >> >> For a quick work-around: >> - set db.ignore.external.links.mode=byDomain >> - modify the method URLUtil.getDomainName(URL url) >> so that it returns the hostname with www. stripped >> >> For a final solution we could make it configurable >> which method or class is called. Since the definition of "domain" >> is somewhat debatable [1], we could even provide alternative >> implementations. >> >>> PS. For me it is not really clear how ProtocolResolver works. >> >> It's only a heuristics to avoid duplicates by protocol (http and https). >> If you care about duplicates and cannot get rid of them afterwards by a >> deduplication job, >> you may have a look at urlnormalizer-protocol and NUTCH-2447. >> >> Best, >> Sebastian >> >> >> [1] >> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]] >> >> On 02/21/2018 10:44 AM, Semyon Semyonov wrote: >>> Thanks Yossi, Markus, >>> >>> I have an issue with the db.ignore.external.links.mode=byDomain solution. >>> >>> I crawl specific hosts only therefore I have a finite number of hosts to >>> crawl. >>> Lets say, >>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]] >>> >>> I want to stay limited with this host. In other words, neither >>> www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]][http://www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]]] >>> nor >>> www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]]]. >>> That's why db.ignore.external.links.mode=byHost and db.ignore.external = >>> true(no external websites). >>> >>> Although, I want to get the links that seem to belong to the same >>> host(www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]] >>> -> somewebsite.com/games, without www). >>> The question is shouldn't we include it as a default behavior(or configured >>> behavior) in Nutch and interpret >>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]] >>> and somewhebsite.com as one host? >>> >>> >>> >>> PS. For me it is not really clear how ProtocolResolver works. >>> >>> Semyon >>> >>> >>> >>> >>> Sent: Tuesday, February 20, 2018 at 9:40 PM >>> From: "Markus Jelsma" <markus.jel...@openindex.io> >>> To: "user@nutch.apache.org" <user@nutch.apache.org> >>> Subject: RE: Internal links appear to be external in Parse. Improvement of >>> the crawling quality >>> Hello Semyon, >>> >>> Yossi is right, you can use the db.ignore.* set of directives to resolve >>> the problem. >>> >>> Regarding protocol, you can use urlnormalizer-protocol to set up per host >>> rules. This is, of course, a tedious job if you operate a crawl on an >>> indefinite amount of hosts, so use the uncommitted ProtocolResolver for >>> that to do it for you. >>> >>> See: >>> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]]] >>> >>> If i remember it tomorrow afternoon, i can probably schedule some time to >>> work on it the coming seven days or so, and commit. >>> >>> Regards, >>> Markus >>> >>> -----Original message----- >>>> From:Yossi Tamari <yossi.tam...@pipl.com> >>>> Sent: Tuesday 20th February 2018 21:06 >>>> To: user@nutch.apache.org >>>> Subject: RE: Internal links appear to be external in Parse. Improvement of >>>> the crawling quality >>>> >>>> Hi Semyon, >>>> >>>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your >>>> wincs.be issue? >>>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in >>>> the decision if this is the same domain. >>>> >>>> Yossi. >>>> >>>>> -----Original Message----- >>>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] >>>>> Sent: 20 February 2018 20:43 >>>>> To: usernutch.apache.org <user@nutch.apache.org> >>>>> Subject: Internal linksURLExemptionFilter appear to be external in Parse. >>>>> Improvement of the >>>>> crawling quality >>>>> >>>>> Dear All, >>>>> >>>>> I'm trying to increase quality of the crawling. A part of my database has >>>>> DB_FETCHED = 1. >>>>> >>>>> Example, >>>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]]] >>>>> in seed list. >>>>> >>>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 >>>>> >>>>> Nutch considers one of the >>>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]]]) >>>>> as external >>>>> and therefore reject it. >>>>> >>>>> >>>>> If I insert >>>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]]] >>>>> in seed file, everything works fine. >>>>> >>>>> Do you think it is a good behavior? I mean, formally it is indeed two >>>>> different >>>>> domains, but from user perspective it is exactly the same. >>>>> >>>>> And if it is a default behavior, how can I fix it for my case? The same >>>>> question for >>>>> similar switch http -> https etc. >>>>> >>>>> Thanks. >>>>> >>>>> Semyon. >>>> >>>> >> >> > >