[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2008-03-16 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579285#action_12579285
 ] 

Emmanuel Joke commented on NUTCH-530:
-

OK

 Add a combiner to improve performance on updatedb
 -

 Key: NUTCH-530
 URL: https://issues.apache.org/jira/browse/NUTCH-530
 Project: Nutch
  Issue Type: Improvement
 Environment: java 1.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-530.patch


 We have a lot of similar links with status linked generated at the ouput of 
 the map task when we try to update the crawldb based on the segment fetched.
 We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-09-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525475
 ] 

Andrzej Bialecki  commented on NUTCH-530:
-

I'm still against this patch, exactly because we are not sure how many times 
the ScoringFilters will be executed - it may be once, twice or N times. The 
current contract for ScoringFilters is that they are executed once.

CrawlDbReducer itself does not reduce all inlinked datums to a single 
CrawlDatum - it's up to the scoring filters to do whatever they want to do with 
all inlinks - although it's true that scoring-opic performs an operation 
equivalent to this, this may not always be the case.

Second, let's consider the following scenario (BTW, this is close to one of the 
ScoringFilters that I actually implemented, so it's not far fetched): let's say 
I implemented a ScoringFilter that checks for existence of a flag in CrawlDatum 
(presumably put there by Generator), and based on the value of this flag it 
counts the score from inlinks differently. Then it clears the flag to mark a 
successful update. If we ran updatedb that includes your patch, this operation 
would work ok in the first spill from the Combiner (although with vastly 
incomplete information), and then it would fail to do the right thing on 
subsequent runs through the Combiner or Reducer, because the flag would be 
already reset.

 Add a combiner to improve performance on updatedb
 -

 Key: NUTCH-530
 URL: https://issues.apache.org/jira/browse/NUTCH-530
 Project: Nutch
  Issue Type: Improvement
 Environment: java 1.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-530.patch


 We have a lot of similar links with status linked generated at the ouput of 
 the map task when we try to update the crawldb based on the segment fetched.
 We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516621
 ] 

Doğacan Güney commented on NUTCH-530:
-

Yeah, you are right.

+1 from me.

 Add a combiner to improve performance on updatedb
 -

 Key: NUTCH-530
 URL: https://issues.apache.org/jira/browse/NUTCH-530
 Project: Nutch
  Issue Type: Improvement
 Environment: java 1.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-530.patch


 We have a lot of similar links with status linked generated at the ouput of 
 the map task when we try to update the crawldb based on the segment fetched.
 We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-31 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516675
 ] 

Emmanuel Joke commented on NUTCH-530:
-

Actually I don't re-use CrawlDbReducer, I've define a new class as Combiner. 
This class aggregates only the score of all CrawlDatum with the status Linked 
into one CrawlDatum. Its just a part of what CrawlDbReducer do. I've done few 
test in different case and it has no impact on the current score.

 Add a combiner to improve performance on updatedb
 -

 Key: NUTCH-530
 URL: https://issues.apache.org/jira/browse/NUTCH-530
 Project: Nutch
  Issue Type: Improvement
 Environment: java 1.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-530.patch


 We have a lot of similar links with status linked generated at the ouput of 
 the map task when we try to update the crawldb based on the segment fetched.
 We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516357
 ] 

Doğacan Güney commented on NUTCH-530:
-

Ehm, I am not sure about this... After this, we call updateDbScore twice, 
right? Once to 'merge' linked's together, once to pass big-merged-linked to old 
datum. This changes ScoringFilter's semantics and may not work for 
ScoringFilters if one is, say, using the number of outlinks as a factor in 
scoring.

 Add a combiner to improve performance on updatedb
 -

 Key: NUTCH-530
 URL: https://issues.apache.org/jira/browse/NUTCH-530
 Project: Nutch
  Issue Type: Improvement
 Environment: java 1.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-530.patch


 We have a lot of similar links with status linked generated at the ouput of 
 the map task when we try to update the crawldb based on the segment fetched.
 We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-30 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602
 ] 

Emmanuel Joke commented on NUTCH-530:
-

I'm sure to follow your point regarding the outlinks number. 

I don't think its relevant to take into account the number of inlinks. A url 
can have inlink from different segments. If we really want to do it, it means 
that we will have to update the db using all segments in one update. So far, 
the updateDb is done only on a single segment. 


 Add a combiner to improve performance on updatedb
 -

 Key: NUTCH-530
 URL: https://issues.apache.org/jira/browse/NUTCH-530
 Project: Nutch
  Issue Type: Improvement
 Environment: java 1.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-530.patch


 We have a lot of similar links with status linked generated at the ouput of 
 the map task when we try to update the crawldb based on the segment fetched.
 We can use a combiner to improve the performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.