[jira] [Commented] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083109#comment-15083109 ] Markus Jelsma commented on NUTCH-1449: -- We have it nicely running for some years. I will commit this

[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083285#comment-15083285 ] Sebastian Nagel commented on NUTCH-2168: Hi [~kalanya], looks like the indexed raw content of the

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083214#comment-15083214 ] Chris A. Mattmann commented on NUTCH-2191: -- Very nice, Markus! Beat me to implementing this one.

[jira] [Updated] (NUTCH-1321) IDNNormalizer

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1321: - Patch Info: Patch Available > IDNNormalizer > - > > Key: NUTCH-1321 >

[jira] [Created] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2196: Summary: IndexingFilterChecker to optionally normalize Key: NUTCH-2196 URL: https://issues.apache.org/jira/browse/NUTCH-2196 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083273#comment-15083273 ] Markus Jelsma commented on NUTCH-2191: -- Hey Chris! An Ajax pattern handler is new to me. Can you

[jira] [Created] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2195: Summary: IndexingFilterChecker to optionally follow N redirects Key: NUTCH-2195 URL: https://issues.apache.org/jira/browse/NUTCH-2195 Project: Nutch Issue

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083092#comment-15083092 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis! * it should be no problem. But since

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2191: - Patch Info: Patch Available > Add protocol-htmlunit > - > >

[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083101#comment-15083101 ] Markus Jelsma commented on NUTCH-1838: -- If no objections, i'll get this one in soon > Host and

[jira] [Created] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2194: Summary: Run IndexingFilterChecker as simple Telnet server Key: NUTCH-2194 URL: https://issues.apache.org/jira/browse/NUTCH-2194 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083104#comment-15083104 ] Markus Jelsma commented on NUTCH-2191: -- Does anyone have an idea on how to force the plugin to use

[jira] [Commented] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083114#comment-15083114 ] Markus Jelsma commented on NUTCH-1257: -- Hmm, there is no patch but i remember having had this support

[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2178: - Patch Info: Patch Available > DeduplicationJob to optionall group on host or domain >

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Patch Info: Patch Available > Automatically remove orphaned pages >

[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes

2016-01-05 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083138#comment-15083138 ] Lewis John McGibbney commented on NUTCH-1186: - Will scope and test [~markus17] >

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083603#comment-15083603 ] Sebastian Nagel commented on NUTCH-2191: As [~haraldk] mentioned in [this

[jira] [Commented] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083096#comment-15083096 ] Markus Jelsma commented on NUTCH-2178: -- Will commit in a few if no further objections. >

[jira] [Updated] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1449: - Patch Info: Patch Available > Optionally delete documents skipped by IndexingFilters >

[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1186: - Patch Info: Patch Available > FreeGenerator always normalizes > --- >

[jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-05 Thread liuqibj (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085043#comment-15085043 ] liuqibj commented on NUTCH-2143: I have a fix and can deliver it > GeneratorJob ignores batch id passed

[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Auro Miralles (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904 ] Auro Miralles commented on NUTCH-2168: -- Hello. I have no idea which document fails... I can crawl

[jira] [Comment Edited] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Auro Miralles (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904 ] Auro Miralles edited comment on NUTCH-2168 at 1/5/16 11:50 AM: --- Hello. I

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2016-01-05 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082951#comment-15082951 ] ASF GitHub Bot commented on NUTCH-1946: --- Github user lewismc commented on the pull request:

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2016-01-05 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082966#comment-15082966 ] ASF GitHub Bot commented on NUTCH-1946: --- Github user jeroenvlek closed the pull request at: