[jira] [Updated] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

Ferdy Galema (Updated) (JIRA) Fri, 04 Nov 2011 08:56:16 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ferdy Galema updated NUTCH-1196:
--------------------------------

    Attachment: NUTCH-1196.patch

Patch done. It applies the db.update.max.inlinks just like Nutch trunk. Please 
note that the property was already present in nutch-default.xml (but obviously 
it was doing nothing yet).

The patch adds a class named UrlWithScore with an extensive test class. This 
class makes it easy to integrate this way of sorting into Nutch trunk too. 
(Feel free to do if anyone is up for this). An advantage to implementing it 
with Hadoop sorting is that all keys enter the reducer in order. This makes way 
for the following improvement: Removing the upper limit altogether by using 
entirely Iterator interfaces and therefore eliminating memory consuming 
collections (bottlenecks). This requires changes to the ScoringFilter interface 
(change lists to iteraters) and possible Gora classes (allow partial puts to 
avoid big local buffers), but once that is done we are able to completely 
remove this limit. After all we don't like limits when talking MapReduce right? 
:) Anyway this is just a suggestion for followup improvements. This issue is 
for simply adding the inlinks limit.

Feedback is much appreciated.
                
> Update job should impose an upper limit on the number of inlinks (nutchgora)
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1196
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1196
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1196.patch
>
>
> Currently the nutchgora branch does not limit the number of inlinks in the 
> update job. This will result in some nasty out-of-memory exceptions and 
> timeouts when the crawl is getting big. Nutch trunk already has a default 
> limit of 10,000 inlinks. I will implement this in nutchgora too. Nutch trunk 
> uses a sorting mechanism in the reducer itself, but I will implement it using 
> standard Hadoop components instead (should be a bit faster). This means:
> The keys of the reducer will be a {url,score} tuple.
> *Partitioning* will be done by {url}.
> *Sorting* will be done by {url,score}.
> Finally *grouping* will be done by {url} again.
> This ensures all indentical urls will be put in the same reducer, but in 
> order of scoring.
> Patch should be ready by tomorrow. Please let me know when you have any 
> comments or suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

Reply via email to