Re: [DISCUSS] Release Trunk
yes, 2.x release need more testing depends on GORA. but for nutch 1.x, I see that there are 29 issues [0] have been solved since the previous release and still have 78 issues [1] need to be solved as Julien mentioned. so before release , we need to do more works. [0] https://issues.apache.org/jira/browse/NUTCH-1520?jql=project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%20%221.8%22%20ORDER%20BY%20priority%20DESC [1] https://issues.apache.org/jira/browse/NUTCH-1464?jql=fixVersion%20%3D%20%221.8%22%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC On Fri, Nov 29, 2013 at 12:34 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, Thread says it all. There are some hot tickets over in Gora right now so I think holding off the next while for a 2.x release would be wise. I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs up. Ta Lewis -- *Lewis* -- Don't Grow Old, Grow Up... :-)
[jira] [Created] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
Nguyen Manh Tien created NUTCH-1679: --- Summary: UpdateDb using batchId, link may override crawled page. Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Nguyen Manh Tien The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839240#comment-13839240 ] Otis Gospodnetic commented on NUTCH-1556: - [~tiennm] it looks like you added a patch to this issue, but the issue is already marked Resolved and Fixed. [~amuseme.lu], want to commit this new patch before 2.3 release? enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Reopened] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reopened NUTCH-1556: - Reopening because this issue has a new patch that should be committed. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-1679: Priority: Critical (was: Major) UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Nguyen Manh Tien Priority: Critical Fix For: 2.3 The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839242#comment-13839242 ] Otis Gospodnetic edited comment on NUTCH-1556 at 12/4/13 7:23 PM: -- Reopening because this issue has a new patch that should be committed. Wait the patch that was added here is the same as the patch in NUTCH-1667. was (Author: otis): Reopening because this issue has a new patch that should be committed. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-1556. - Resolution: Fixed Marking as Fixed again because I see the patch that was added to this issue after it was closed was also added in NUTCH-1667 (but not removed from this issue, thus causing confusion). enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1#6144)