Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Semyon Semyonov Tue, 20 Mar 2018 08:18:06 -0700

I found out that there is no direct way to do it, the problem was solved 
through calling of the regex transformation one more time in IndexerMapReduce, 
before the Indexer gets the Doc for writting.


Something like(IndexerMapReduce.java:line 369),
 doc.add("modifiedId", 
URLUtil.getHost(BidirectionalUrlExemptionFilter.tranform(key.toString()));
 

Sent: Friday, March 16, 2018 at 7:20 PM
From: "Semyon Semyonov" <semyon.semyo...@mail.com>
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi again,

Another issue has appeared with introduction of bidirectional url exemption 
filter.

Having
http://www.website.com/page1
and
http://website.com/page2[http://website.com/page2]

Before as an indexer output(lets say a text file) I had one 
parent/host(www.website.com[http://www.website.com]) with 
children/pages(http://www.website.com/page1[http://www.website.com/page1], 
http://www.website.com/[http://www.website.com/]...).
Now, I have two different hosts and therefore two different parents for my 
output. I prefer to have the same hostname/alias for both hosts.

I checked url exemption filters and they don't allow to add metadata to the 
parsed data.

Therefore, two questions:
1) What is the best way to do it?
2) Should I include it into Nutch code or we don't need it and I should make a 
quick fix for myself?

Semyon.
 

Sent: Tuesday, March 06, 2018 at 11:08 AM
From: "Sebastian Nagel" <wastl.na...@googlemail.com>
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522[https://issues.apache.org/jira/browse/NUTCH-2522].
>
> The other question is how voting mechanism of UrlExemptionFilters should work.
>
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
> exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
>
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as 
> http://www.website.com[http://www.website.com][http://www.website.com[http://www.website.com]]
>  -> 
> http://website.com/about[http://website.com/about][http://website.com/about[http://website.com/about]]
>  can be exempted, but standart filter will not exempt it because they are 
> from different hosts. With current logic, the url will not be exempted, 
> because of logical AND
>
>
> Any ideas?
>
>  
>  
>
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" <wastl.na...@googlemail.com>
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
>
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
>
>> 2) ... It looks more like same Host problem rather ...
>
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
>
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
>
> Best,
> Sebastian
>
>
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html][https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]]
>
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html][https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html]]
> [2] 
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]]]
>
>
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>>  - set db.ignore.external.links.mode=byDomain
>>
>> anymore? 
>> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]
>>  becomes the same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for 
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only 
>> one issue with the mapping www -> _ . It looks more like same Host problem 
>> rather than the same Domain problem. What to you think about such host 
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>>
>> Semyon.
>>
>>
>>  
>>
>> Sent: Wednesday, February 21, 2018 at 11:51 AM
>> From: "Sebastian Nagel" <wastl.na...@googlemail.com>
>> To: user@nutch.apache.org
>> Subject: Re: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hi Semyon,
>>
>>> interpret 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]]
>>>  and somewhebsite.com as one host?
>>
>> Yes, that's a common problem. More because of external links which must
>> include the host name - well-designed sites would use relative links
>> for internal same-host links.
>>
>> For a quick work-around:
>> - set db.ignore.external.links.mode=byDomain
>> - modify the method URLUtil.getDomainName(URL url)
>> so that it returns the hostname with www. stripped
>>
>> For a final solution we could make it configurable
>> which method or class is called. Since the definition of "domain"
>> is somewhat debatable [1], we could even provide alternative
>> implementations.
>>
>>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> It's only a heuristics to avoid duplicates by protocol (http and https).
>> If you care about duplicates and cannot get rid of them afterwards by a 
>> deduplication job,
>> you may have a look at urlnormalizer-protocol and NUTCH-2447.
>>
>> Best,
>> Sebastian
>>
>>
>> [1] 
>> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]]]
>>
>> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>>> Thanks Yossi, Markus,
>>>
>>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>>
>>> I crawl specific hosts only therefore I have a finite number of hosts to 
>>> crawl.
>>> Lets say, 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]]
>>>
>>> I want to stay limited with this host. In other words, neither 
>>> www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]][http://www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]]][http://www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]][http://www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]]]]
>>>  nor 
>>> www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]]][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]]]].
>>> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
>>> true(no external websites).
>>>
>>> Although, I want to get the links that seem to belong to the same 
>>> host(www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]]
>>>  -> somewebsite.com/games, without www).
>>> The question is shouldn't we include it as a default behavior(or configured 
>>> behavior) in Nutch and interpret 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]]
>>>  and somewhebsite.com as one host?
>>>
>>>
>>>
>>> PS. For me it is not really clear how ProtocolResolver works.
>>>
>>> Semyon
>>>
>>>
>>>  
>>>
>>> Sent: Tuesday, February 20, 2018 at 9:40 PM
>>> From: "Markus Jelsma" <markus.jel...@openindex.io>
>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>>> the crawling quality
>>> Hello Semyon,
>>>
>>> Yossi is right, you can use the db.ignore.* set of directives to resolve 
>>> the problem.
>>>
>>> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
>>> rules. This is, of course, a tedious job if you operate a crawl on an 
>>> indefinite amount of hosts, so use the uncommitted ProtocolResolver for 
>>> that to do it for you.
>>>
>>> See: 
>>> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]]][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]]]]
>>>
>>> If i remember it tomorrow afternoon, i can probably schedule some time to 
>>> work on it the coming seven days or so, and commit.
>>>
>>> Regards,
>>> Markus
>>>
>>> -----Original message-----
>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
>>>> Sent: Tuesday 20th February 2018 21:06
>>>> To: user@nutch.apache.org
>>>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>>>> the crawling quality
>>>>
>>>> Hi Semyon,
>>>>
>>>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your 
>>>> wincs.be issue?
>>>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in 
>>>> the decision if this is the same domain.
>>>>
>>>> Yossi.
>>>>
>>>>> -----Original Message-----
>>>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
>>>>> Sent: 20 February 2018 20:43
>>>>> To: usernutch.apache.org <user@nutch.apache.org>
>>>>> Subject: Internal linksURLExemptionFilter appear to be external in Parse. 
>>>>> Improvement of the
>>>>> crawling quality
>>>>>
>>>>> Dear All,
>>>>>
>>>>> I'm trying to increase quality of the crawling. A part of my database has
>>>>> DB_FETCHED = 1.
>>>>>
>>>>> Example, 
>>>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]]]]
>>>>>  in seed list.
>>>>>
>>>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>>>
>>>>> Nutch considers one of the 
>>>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]]]])
>>>>>  as external
>>>>> and therefore reject it.
>>>>>
>>>>>
>>>>> If I insert 
>>>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]]]]
>>>>>  in seed file, everything works fine.
>>>>>
>>>>> Do you think it is a good behavior? I mean, formally it is indeed two 
>>>>> different
>>>>> domains, but from user perspective it is exactly the same.
>>>>>
>>>>> And if it is a default behavior, how can I fix it for my case? The same 
>>>>> question for
>>>>> similar switch http -> https etc.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Semyon.
>>>>
>>>>
>>  
>>
>  
>

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Reply via email to