[jira] Created: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
CrawlDbMerger: wrong computation of last fetch time --- Key: NUTCH-532 URL: https://issues.apache.org/jira/browse/NUTCH-532 Project: Nutch Issue Type: Bug Reporter: Emmanuel Joke

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532.patch Patch provided. CrawlDbMerger: wrong computation of last fetch time

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: (was: NUTCH-532.patch) CrawlDbMerger: wrong computation of last fetch time

[jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532.patch CrawlDbMerger: wrong computation of last fetch time

[jira] Created: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
LinkDbMerger: url normlaized is not updated in the key and inlinks list --- Key: NUTCH-533 URL: https://issues.apache.org/jira/browse/NUTCH-533 Project: Nutch Issue Type:

[jira] Updated: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-533: Attachment: NUTCH-533.patch Patch provided LinkDbMerger: url normlaized is not updated in the key

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516357 ] Doğacan Güney commented on NUTCH-530: - Ehm, I am not sure about this... After this, we call updateDbScore twice,

[jira] Commented: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516358 ] Emmanuel Joke commented on NUTCH-526: - Could you please wait again few days ? I would like to wait for a

[jira] Updated: (NUTCH-531) Pages with no ContentType cause a Null Pointer exception

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-531: Attachment: NUTCH-531-draft.patch I agree with you. IMHO, a simple change to getContentType should

[jira] Commented: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516362 ] Doğacan Güney commented on NUTCH-533: - Looks good to me. +1 LinkDbMerger: url normlaized is not updated in the

[jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516364 ] Doğacan Güney commented on NUTCH-532: - Does this calculation: res.getFetchTime() -

[jira] Commented: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516365 ] Doğacan Güney commented on NUTCH-514: - Since no one commented, I am assuming that no one wants to see 404 and

[jira] Commented: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516366 ] Doğacan Güney commented on NUTCH-528: - This is my personal nit, but the cli options look weird. Why not something

[jira] Commented: (NUTCH-529) NodeWalker.skipChildren don't wrok for more than 1 child.

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516367 ] Doğacan Güney commented on NUTCH-529: - Could you also add a junit test case? (actually, since NodeWalker is used

[jira] Commented: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516408 ] Andrzej Bialecki commented on NUTCH-533: - +1. Please fix the typo (present also in the original file): empy

[jira] Commented: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS

2007-07-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516428 ] Andrzej Bialecki commented on NUTCH-514: - +1 we're only humans with 24 hours in a day .. ;) Actually, this

[jira] Closed: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-514. --- Resolved and committed. Indexer should only index pages with fetch status SUCCESS

[jira] Resolved: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-514. - Resolution: Fixed Assignee: Doğacan Güney Committed in rev. 561092. Indexer should only

[jira] Updated: (NUTCH-529) NodeWalker.skipChildren doesn't work for more than 1 child.

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-529: Summary: NodeWalker.skipChildren doesn't work for more than 1 child. (was: NodeWalker.skipChildren

[jira] Updated: (NUTCH-533) LinkDbMerger: url normalized is not updated in the key and inlinks list

2007-07-30 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-533: Summary: LinkDbMerger: url normalized is not updated in the key and inlinks list (was:

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602 ] Emmanuel Joke commented on NUTCH-530: - I'm sure to follow your point regarding the outlinks number. I don't

[jira] Commented: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS

2007-07-30 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516613 ] Hudson commented on NUTCH-514: -- Integrated in Nutch-Nightly #166 (See

[jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516618 ] Emmanuel Joke commented on NUTCH-532: - res.getFetchTime() - Math.round(res.getFetchInterval() * 1000d); always