[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283775#comment-16283775
 ] 

Semyon Semyonov commented on NUTCH-2455:
----------------------------------------

[~wastl-nagel]  [~markus17]l
Please, have a look.

Could you please review two more issues at the same time as this issue, because 
they are closely related.
https://issues.apache.org/jira/browse/NUTCH-2454
and
https://issues.apache.org/jira/browse/NUTCH-2461

>From the commit, I duplicate:
Three questions/modification left open:
1) In several places we use url.getHost() in the Nutch code, in other we use 
url.getHost().toLower(). Why?
2) public static class ScoreHostKeyComparator extends WritableComparator should 
Implement Raw comparator. If you know how to do it you are welcome to do.
3) The whole Generator file is to big, it should be spread to several files. 
Again, if you know how to fix it in a good way, you are welcome. 



> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to