[ 
https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144226#comment-13144226
 ] 

Andrzej Bialecki  commented on NUTCH-1196:
------------------------------------------

Very nicely done and useful patch! A few cosmetic comments:

* a common pattern in Hadoop is to reuse object instances as much as possible, 
so any places where you use the "new" operator should be reviewed (e.g. new 
NutchWritable(...)).
* in UrlScoreComparator.compare(o1, o2) you can just use unary minus instead of 
multiplication by -1.
* in DbUpdateMapper you can assign a score of Float.MAX_VALUE to the web page 
record, this way in DbUpdateReducer.reduce you won't have to iterate over all 
entries, because the web page record will always come as the first, and we can 
save some time by skipping the remaining entries. Unless you really want to 
tally the number of skipped inlinks.

Overall the patch looks good, +1 for committing.
                
> Update job should impose an upper limit on the number of inlinks (nutchgora)
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1196
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1196
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1196.patch
>
>
> Currently the nutchgora branch does not limit the number of inlinks in the 
> update job. This will result in some nasty out-of-memory exceptions and 
> timeouts when the crawl is getting big. Nutch trunk already has a default 
> limit of 10,000 inlinks. I will implement this in nutchgora too. Nutch trunk 
> uses a sorting mechanism in the reducer itself, but I will implement it using 
> standard Hadoop components instead (should be a bit faster). This means:
> The keys of the reducer will be a {url,score} tuple.
> *Partitioning* will be done by {url}.
> *Sorting* will be done by {url,score}.
> Finally *grouping* will be done by {url} again.
> This ensures all indentical urls will be put in the same reducer, but in 
> order of scoring.
> Patch should be ready by tomorrow. Please let me know when you have any 
> comments or suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to