Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Semyon Semyonov Wed, 21 Feb 2018 04:53:43 -0800

Hi Sabastian,

If I
- modify the method URLUtil.getDomainName(URL url)


doesn't it mean that I don't need 
 - set db.ignore.external.links.mode=byDomain

anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com.


To make it as generic as possible I can create an issue/pull request for this, 
but I would like to hear your suggestion about the best way to do so.
1) Do we have a config setting that we can use already?
2) The domain discussion[1] is quite wide though. In my case I cover only one 
issue with the mapping www -> _ . It looks more like same Host problem rather 
than the same Domain problem. What to you think about such host resolution?
3) Where this problem should be solved? Only in ParseOutputFormat.java or 
somewhere else as well?

Semyon.


 

Sent: Wednesday, February 21, 2018 at 11:51 AM
From: "Sebastian Nagel" <wastl.na...@googlemail.com>
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> interpret www.somewebsite.com[http://www.somewebsite.com] and 
> somewhebsite.com as one host?

Yes, that's a common problem. More because of external links which must
include the host name - well-designed sites would use relative links
for internal same-host links.

For a quick work-around:
- set db.ignore.external.links.mode=byDomain
- modify the method URLUtil.getDomainName(URL url)
so that it returns the hostname with www. stripped

For a final solution we could make it configurable
which method or class is called. Since the definition of "domain"
is somewhat debatable [1], we could even provide alternative
implementations.

> PS. For me it is not really clear how ProtocolResolver works.

It's only a heuristics to avoid duplicates by protocol (http and https).
If you care about duplicates and cannot get rid of them afterwards by a 
deduplication job,
you may have a look at urlnormalizer-protocol and NUTCH-2447.

Best,
Sebastian


[1] 
https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]

On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
> Thanks Yossi, Markus,
>
> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>
> I crawl specific hosts only therefore I have a finite number of hosts to 
> crawl.
> Lets say, www.somewebsite.com[http://www.somewebsite.com]
>
> I want to stay limited with this host. In other words, neither 
> www.art.somewebsite.com[http://www.art.somewebsite.com] nor 
> www.sport.somewebsite.com[http://www.sport.somewebsite.com].
> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
> true(no external websites).
>
> Although, I want to get the links that seem to belong to the same 
> host(www.somewebsite.com[http://www.somewebsite.com] -> 
> somewebsite.com/games, without www).
> The question is shouldn't we include it as a default behavior(or configured 
> behavior) in Nutch and interpret 
> www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one 
> host?
>
>
>
> PS. For me it is not really clear how ProtocolResolver works.
>
> Semyon
>
>
>  
>
> Sent: Tuesday, February 20, 2018 at 9:40 PM
> From: "Markus Jelsma" <markus.jel...@openindex.io>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Subject: RE: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hello Semyon,
>
> Yossi is right, you can use the db.ignore.* set of directives to resolve the 
> problem.
>
> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
> rules. This is, of course, a tedious job if you operate a crawl on an 
> indefinite amount of hosts, so use the uncommitted ProtocolResolver for that 
> to do it for you.
>
> See: 
> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]
>
> If i remember it tomorrow afternoon, i can probably schedule some time to 
> work on it the coming seven days or so, and commit.
>
> Regards,
> Markus
>
> -----Original message-----
>> From:Yossi Tamari <yossi.tam...@pipl.com>
>> Sent: Tuesday 20th February 2018 21:06
>> To: user@nutch.apache.org
>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>>
>> Hi Semyon,
>>
>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
>> issue?
>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
>> decision if this is the same domain.
>>
>> Yossi.
>>
>>> -----Original Message-----
>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
>>> Sent: 20 February 2018 20:43
>>> To: usernutch.apache.org <user@nutch.apache.org>
>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>> crawling quality
>>>
>>> Dear All,
>>>
>>> I'm trying to increase quality of the crawling. A part of my database has
>>> DB_FETCHED = 1.
>>>
>>> Example, 
>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]
>>>  in seed list.
>>>
>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>
>>> Nutch considers one of the 
>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]])
>>>  as external
>>> and therefore reject it.
>>>
>>>
>>> If I insert 
>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]] in seed 
>>> file, everything works fine.
>>>
>>> Do you think it is a good behavior? I mean, formally it is indeed two 
>>> different
>>> domains, but from user perspective it is exactly the same.
>>>
>>> And if it is a default behavior, how can I fix it for my case? The same 
>>> question for
>>> similar switch http -> https etc.
>>>
>>> Thanks.
>>>
>>> Semyon.
>>
>>

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Reply via email to