[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2014-01-02 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142
 ] 

Tien Nguyen Manh commented on NUTCH-1686:
-

In this patch i also fixed an bug with fetchTime. Currently each time we run 
whole updatedb, fetchTime is increased again for all urls.

 Optimize UpdateDb to load less field from Store
 ---

 Key: NUTCH-1686
 URL: https://issues.apache.org/jira/browse/NUTCH-1686
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1686.patch


 While running large crawl i found that updatedb run very slow, especially the 
 Map task which loading data from store.
 We can't use filter by batchId to load less url due to bug in NUTCH-1679 so 
 we must always update the whole table.
 After checking the field loaded in UpdateDbJob i found that it load many 
 fields from store (at least 15/25 field) which make updatedb slow
 I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, 
 METADATA which is used to compute link score, distance that i think the main 
 purpose of this job.
 The other fields is used to compute url schedule to parser and fetcher, we 
 can move code to Parser or Fetcher whithout loading much new field because 
 many field are generated from parser. WE can also use gora filter for Fetcher 
 or Parser so load new field is not a problem.
 I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
 currently store in METADATA. field CASH is used in OPICScoring which is used 
 only in UpdateDB and distance is used only in Generator and Updater so move 
 both field two new Metadata field can prevent reading METADATA at Generator 
 and Updater, METADATA contains many data that is used only at Parser and 
 Indexer
 So with new change
 UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
 don't need to load big family Fetch and INLINKS.
 Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2013-12-22 Thread Nguyen Manh Tien (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855385#comment-13855385
 ] 

Nguyen Manh Tien commented on NUTCH-1686:
-

no backwards compatibility, because i move CASH and Distance from METADATA to 
new field SCOREMETA.
We will need to write a job to copy those data from METADATA to new field.

 Optimize UpdateDb to load less field from Store
 ---

 Key: NUTCH-1686
 URL: https://issues.apache.org/jira/browse/NUTCH-1686
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Nguyen Manh Tien
 Fix For: 2.3

 Attachments: NUTCH-1686.patch


 While running large crawl i found that updatedb run very slow, especially the 
 Map task which loading data from store.
 We can't use filter by batchId to load less url due to bug in NUTCH-1679 so 
 we must always update the whole table.
 After checking the field loaded in UpdateDbJob i found that it load many 
 fields from store (at least 15/25 field) which make updatedb slow
 I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, 
 METADATA which is used to compute link score, distance that i think the main 
 purpose of this job.
 The other fields is used to compute url schedule to parser and fetcher, we 
 can move code to Parser or Fetcher whithout loading much new field because 
 many field are generated from parser. WE can also use gora filter for Fetcher 
 or Parser so load new field is not a problem.
 I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
 currently store in METADATA. field CASH is used in OPICScoring which is used 
 only in UpdateDB and distance is used only in Generator and Updater so move 
 both field two new Metadata field can prevent reading METADATA at Generator 
 and Updater, METADATA contains many data that is used only at Parser and 
 Indexer
 So with new change
 UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
 don't need to load big family Fetch and INLINKS.
 Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)