[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647467#comment-14647467
 ] 

Julien Nioche commented on NUTCH-2069:
--------------------------------------

Hi [~wastl-nagel] and [~markus17].  BTW did not mean to be short in my previous 
message but was typing from my phone ;-)
I know the difficulties of enforcing the code formatting systematically, but I 
thought I might as well fix it while I was working on that part of the code. 
Feel free to remove the bits from the patch that are about the formatting only.

bq. we could define this as two properties `db.ignore.external.links` + 
`db.ignore.external.links.mode`. The latter can be "host" or "domain", similar 
to other properties (partition.url.mode, generator.count.mode, 
fetcher.queue.mode). That would be extensible and can make the code leaner.

yes that would be more elegant

on vacation for the next few weeks as of today, will update the code  based on 
your suggestion when I am back unless one of you beats me to it of course.

J.  



> Ignore external links based on domain
> -------------------------------------
>
>                 Key: NUTCH-2069
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2069
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher, parser
>    Affects Versions: 1.10
>            Reporter: Julien Nioche
>             Fix For: 1.11
>
>         Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to