[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892423#comment-13892423 ] Koen Smets commented on NUTCH-1556: --- Hi [~lewismc], I confirmed NUTCH-1679 on Cassandra store. Although, the `bin/crawl` script changed in [NUTCH-1556] only by adding the `$batchId` from the one in 2.2.1, this changes behaviour drastically. A lot of pages get refetched sooner than indicated by db.default.fetch.interval and outnumber the unfetched pages. Although I noticed the remarkable speedup when focusing only on the pages from the current batch, I changed `$batchId` to `-all` in order to give preference to the pages that are truly unfetched. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch, NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892364#comment-13892364 ] Lewis John McGibbney commented on NUTCH-1556: - Hi [~ksmets], this issue has been addressed and resolved. Do you care to submit your comments or a patch for NUTCH-1679? Also can you explain more verbosely what you mean by bq. this causes a lot of refetched pages Do you mean that this causes a lot of fetched pages to be overwritten and refetched? > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch, NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892075#comment-13892075 ] Koen Smets commented on NUTCH-1556: --- Should be reconsidered as this causes a lot of already fetched pages as indicated by NUTCH-1679. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch, NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839240#comment-13839240 ] Otis Gospodnetic commented on NUTCH-1556: - [~tiennm] it looks like you added a patch to this issue, but the issue is already marked Resolved and Fixed. [~amuseme.lu], want to commit this new patch before 2.3 release? > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch, NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765426#comment-13765426 ] Hudson commented on NUTCH-1556: --- FAILURE: Integrated in Nutch-nutchgora #754 (See [https://builds.apache.org/job/Nutch-nutchgora/754/]) NUTCH-1556 enabling updatedb to accept batchId (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1522566) * /nutch/branches/2.x/src/bin/crawl > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765413#comment-13765413 ] Julien Nioche commented on NUTCH-1556: -- No probs Lufeng. Thanks! > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765410#comment-13765410 ] lufeng commented on NUTCH-1556: --- oh, I'm so sorry, I already fixed this problem. commit revision 1522566 in 2.x HEAD. thanks Julien. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765243#comment-13765243 ] Julien Nioche commented on NUTCH-1556: -- Guys, you have broken the crawl script for 2.x. The usage is Usage: DbUpdaterJob ( | -all) [-crawlId ] but you are passing '-batchId $batchId'. Can you please fix this? Thanks > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759168#comment-13759168 ] Hudson commented on NUTCH-1556: --- SUCCESS: Integrated in Nutch-nutchgora #746 (See [https://builds.apache.org/job/Nutch-nutchgora/746/]) NUTCH-1556 enabling updatedb to accept batchId (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1520332) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/bin/crawl * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateMapper.java * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdaterJob.java > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759123#comment-13759123 ] lufeng commented on NUTCH-1556: --- Committed revision 1520332 in 2.x HEAD Thanks kaveh. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756080#comment-13756080 ] lufeng commented on NUTCH-1556: --- I will commit this unless there are objections > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752432#comment-13752432 ] lufeng commented on NUTCH-1556: --- thanks kaveh. +1 > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750803#comment-13750803 ] lufeng commented on NUTCH-1556: --- Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch into one and can you check this out. thanks. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750394#comment-13750394 ] Lewis John McGibbney commented on NUTCH-1556: - It would be real nice to merge the proposal on both NUTCH-1556 and NUTCH-1632 > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630657#comment-13630657 ] Lewis John McGibbney commented on NUTCH-1556: - Nice one Kaveh. I will check this out soon. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.2 > > Attachments: NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira