[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Updated patch for trunk. This still relies on a LinkDB and a CrawlDB

[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-27 Thread Michael Joyce (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2048: - Attachment: NUTCH-2048_Joyce_20150727.patch Updated the patch to set the sync attribute on

[jira] [Updated] (NUTCH-2068) Allow subcollection overrides via metadata

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2068: - Attachment: NUTCH-2068.patch patch for trunk Allow subcollection overrides via metadata

[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2064: - Attachment: NUTCH-2064.patch Quick and dirty patch where [ and ] are also encoded. I just

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642500#comment-14642500 ] Markus Jelsma commented on NUTCH-2064: -- Also, spaces are now also escaped in