[jira] [Commented] (NUTCH-1504) Pluggable url partitioner

2015-06-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598726#comment-14598726 ] Lewis John McGibbney commented on NUTCH-1504: - [~mjoyce] scope this out. This

[jira] [Assigned] (NUTCH-1504) Pluggable url partitioner

2015-06-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1504: --- Assignee: Lewis John McGibbney > Pluggable url partitioner >

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-23 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598724#comment-14598724 ] Chris A. Mattmann commented on NUTCH-2038: -- Yeah so here's the deal. I think I ca

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-23 Thread Asitang Mishra (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598715#comment-14598715 ] Asitang Mishra commented on NUTCH-2038: --- Hey [~wastl-nagel], I have decided to impl

[jira] [Commented] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-23 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598511#comment-14598511 ] Hudson commented on NUTCH-2045: --- SUCCESS: Integrated in Nutch-nutchgora #1477 (See [https:/

[jira] [Updated] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2045: Fix Version/s: (was: 1.11) > index-basic incorrect assignment of next fetch time

[jira] [Updated] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2045: Affects Version/s: (was: 1.10) > index-basic incorrect assignment of next fetch

[jira] [Resolved] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2045. - Resolution: Fixed Patch for 2.X Committed revision 1687145. [~wastl-nagel] you ar

[jira] [Comment Edited] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598479#comment-14598479 ] Sebastian Nagel edited comment on NUTCH-2045 at 6/23/15 10:20 PM: --

[jira] [Commented] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598479#comment-14598479 ] Sebastian Nagel commented on NUTCH-2045: Is 1.x (1.10) really affected? BasicIndex

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598468#comment-14598468 ] Sebastian Nagel commented on NUTCH-2038: bq. From what I understand the problem is

[Nutch Wiki] Trivial Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by LewisJohnMcgibbney

2015-06-23 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff

[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by LewisJohnMcgibbney

2015-06-23 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff

[jira] [Updated] (NUTCH-1335) OutlinkDB to collect unique URL's only

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1335: - Attachment: NUTCH-1335.patch Updated as well. This reduces increases performance on very large cra

[jira] [Updated] (NUTCH-1980) Jexl expressions for CrawlDbReader

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1980: - Attachment: NUTCH-1980.patch Updated patch. It was reviewed, will commit shortly. > Jexl expressi

[jira] [Updated] (NUTCH-1838) Host and domain based regex and automaton filtering

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1838: - Attachment: NUTCH-1838.patch Apologies, this is the correct patch > Host and domain based regex a

[jira] [Updated] (NUTCH-1838) Host and domain based regex and automaton filtering

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1838: - Attachment: NUTCH-1838.patch Patch for trunk. Please check it out, it provides a huge performance

[jira] [Updated] (NUTCH-1730) Scoring-depth optionally not to increment depth for external hosts

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1730: - Attachment: NUTCH-1730.patch Updated patch. Please see unit test. This resolves some issues and ad

[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1692: - Attachment: NUTCH-1692.patch Updated patch. It was reviewed, will commit shortly > SegmentReader

[jira] [Updated] (NUTCH-1684) ParseMeta to be added before fetch schedulers are run

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1684: - Attachment: NUTCH-1684-trunk.patch Updated patch. Please check out. Without this schedulers do not

[jira] [Updated] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1625: - Attachment: NUTCH-1625.patch Patch for 1.10. Please check it out. Without this Nutch will not prop

[jira] [Closed] (NUTCH-1711) Normalizer does not encode exclamation mark

2015-06-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1711. Resolution: Won't Fix Agreed > Normalizer does not encode exclamation mark > --