Re: [DISCUSS] Release Trunk

2013-12-04 Thread feng lu
yes, 2.x release need more testing depends on GORA. but for nutch 1.x, I
see that there are 29 issues [0] have been solved since the previous
release and still have 78 issues [1] need to be solved as Julien mentioned.
so before release , we need to do more works.

[0]
https://issues.apache.org/jira/browse/NUTCH-1520?jql=project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%20%221.8%22%20ORDER%20BY%20priority%20DESC
 [1]
https://issues.apache.org/jira/browse/NUTCH-1464?jql=fixVersion%20%3D%20%221.8%22%20AND%20project%20%3D%20NUTCH%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC

On Fri, Nov 29, 2013 at 12:34 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Thread says it all.
 There are some hot tickets over in Gora right now so I think holding off
 the next while for a 2.x release would be wise.
 I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs
 up.
 Ta
 Lewis

 --
 *Lewis*




-- 
Don't Grow Old, Grow Up... :-)


[jira] [Created] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2013-12-04 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1679:
---

 Summary: UpdateDb using batchId, link may override crawled page.
 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Nguyen Manh Tien


The problem is in Hbase store, not sure about other store.

Suppose at first crawl cycle we crawl link A, then get an outlink B.
In second cycle we crawl link B which also has a link point to A
In second updatedb we load only page B from store, and will add A as new link 
because it doesn't know A already exist in store and will override A.

UpdateDb must be run without batchId or we must set additionsAllowed=false

Here are code for new page
  page = new WebPage();
  schedule.initializeSchedule(url, page);
  page.setStatus(CrawlStatus.STATUS_UNFETCHED);
  try {
scoringFilters.initialScore(url, page);
  } catch (ScoringFilterException e) {
page.setScore(0.0f);
  }
new page will override old page status, score, fetchTime, fetchInterval, 
retries, metadata[CASH_KEY]
 - i think we can change something here so that new page will only update one 
column for example 'link' and if it is really a new page, we can initialize all 
above fields in generator
- or we add operator checkAndPut to store so when add new page we will check if 
already exist first



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-12-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839240#comment-13839240
 ] 

Otis Gospodnetic commented on NUTCH-1556:
-

[~tiennm] it looks like you added a patch to this issue, but the issue is 
already marked Resolved and Fixed.

[~amuseme.lu], want to commit this new patch before 2.3 release?

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Reopened] (NUTCH-1556) enabling updatedb to accept batchId

2013-12-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reopened NUTCH-1556:
-


Reopening because this issue has a new patch that should be committed.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2013-12-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated NUTCH-1679:


Priority: Critical  (was: Major)

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Nguyen Manh Tien
Priority: Critical
 Fix For: 2.3


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (NUTCH-1556) enabling updatedb to accept batchId

2013-12-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839242#comment-13839242
 ] 

Otis Gospodnetic edited comment on NUTCH-1556 at 12/4/13 7:23 PM:
--

Reopening because this issue has a new patch that should be committed.

Wait the patch that was added here is the same as the patch in NUTCH-1667.


was (Author: otis):
Reopening because this issue has a new patch that should be committed.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (NUTCH-1556) enabling updatedb to accept batchId

2013-12-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved NUTCH-1556.
-

Resolution: Fixed

Marking as Fixed again because I see the patch that was added to this issue 
after it was closed was also added in NUTCH-1667 (but not removed from this 
issue, thus causing confusion).

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1#6144)