[jira] [Commented] (NUTCH-2715) WARCExporter fails on large records

2019-05-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833858#comment-16833858 ] Sebastian Nagel commented on NUTCH-2715: Hi [~yossi], thanks! Unfortunately both WARC writer

[jira] [Assigned] (NUTCH-2650) -addBinaryContent -base64 flags are causing "String length must be a multiple of four" error in IndexingJob

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2650: -- Assignee: Sebastian Nagel > -addBinaryContent -base64 flags are causing "String

[jira] [Updated] (NUTCH-2706) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2706: --- Fix Version/s: 1.16 > -addBinaryContent flag can cause "String length must be a multiple of

[jira] [Assigned] (NUTCH-2706) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2706: -- Assignee: Sebastian Nagel > -addBinaryContent flag can cause "String length must be a

[jira] [Updated] (NUTCH-2650) -addBinaryContent -base64 flags are causing "String length must be a multiple of four" error in IndexingJob

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2650: --- Fix Version/s: 1.16 > -addBinaryContent -base64 flags are causing "String length must be a

[jira] [Commented] (NUTCH-2706) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832841#comment-16832841 ] Sebastian Nagel commented on NUTCH-2706: Hi [~pemanuel], I was able to reproduce the issue and

[jira] [Commented] (NUTCH-2585) NPE in TrieStringMatcher

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832568#comment-16832568 ] Sebastian Nagel commented on NUTCH-2585: PR including fix is open:

[jira] [Commented] (NUTCH-2585) NPE in TrieStringMatcher

2019-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832542#comment-16832542 ] Sebastian Nagel commented on NUTCH-2585: Ok, this is reproduced using parallel streams (see

[jira] [Commented] (NUTCH-2706) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2019-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819060#comment-16819060 ] Sebastian Nagel commented on NUTCH-2706: Thanks for the hint, [~pemanuel]! I'll have a look. >

[jira] [Updated] (NUTCH-2709) Remove unused properties and code related to HTTP protocol

2019-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2709: --- Component/s: protocol > Remove unused properties and code related to HTTP protocol >

[jira] [Created] (NUTCH-2709) Remove unused properties and code related to HTTP protocol

2019-04-16 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2709: -- Summary: Remove unused properties and code related to HTTP protocol Key: NUTCH-2709 URL: https://issues.apache.org/jira/browse/NUTCH-2709 Project: Nutch

[jira] [Assigned] (NUTCH-2702) Fetcher: suppress stack for frequent exceptions

2019-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2702: -- Assignee: Sebastian Nagel > Fetcher: suppress stack for frequent exceptions >

[jira] [Resolved] (NUTCH-2704) Upgrade crawler-commons dependency to 1.0

2019-04-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2704. Resolution: Implemented > Upgrade crawler-commons dependency to 1.0 >

[jira] [Assigned] (NUTCH-2704) Upgrade crawler-commons dependency to 1.0

2019-04-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2704: -- Assignee: Sebastian Nagel > Upgrade crawler-commons dependency to 1.0 >

[jira] [Assigned] (NUTCH-2699) Protocol-okhttp: needless loops to increment requested bytes counter when more content is already buffered

2019-04-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2699: -- Assignee: Sebastian Nagel > Protocol-okhttp: needless loops to increment requested

[jira] [Resolved] (NUTCH-2699) Protocol-okhttp: needless loops to increment requested bytes counter when more content is already buffered

2019-04-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2699. Resolution: Fixed > Protocol-okhttp: needless loops to increment requested bytes counter

[jira] [Work started] (NUTCH-2699) Protocol-okhttp: needless loops to increment requested bytes counter when more content is already buffered

2019-04-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2699 started by Sebastian Nagel. -- > Protocol-okhttp: needless loops to increment requested bytes counter when

[jira] [Created] (NUTCH-2708) urlfilter-automaton: update library dependency (dk.brics.automaton)

2019-04-11 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2708: -- Summary: urlfilter-automaton: update library dependency (dk.brics.automaton) Key: NUTCH-2708 URL: https://issues.apache.org/jira/browse/NUTCH-2708 Project: Nutch

[jira] [Comment Edited] (NUTCH-2690) Configurable and fast URL filter

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815355#comment-16815355 ] Sebastian Nagel edited comment on NUTCH-2690 at 4/11/19 11:53 AM: -- PR

[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815355#comment-16815355 ] Sebastian Nagel commented on NUTCH-2690: PR updated, squashed and rebased to current master. I'll

[jira] [Assigned] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2279: -- Assignee: Sebastian Nagel > LinkRank fails when using Hadoop MR output compression >

[jira] [Assigned] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2700: -- Assignee: Sebastian Nagel > Indexchecker: improve command-line help >

[jira] [Resolved] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2700. Resolution: Implemented > Indexchecker: improve command-line help >

[jira] [Work started] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2700 started by Sebastian Nagel. -- > Indexchecker: improve command-line help >

[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815282#comment-16815282 ] Sebastian Nagel commented on NUTCH-2703: +1 But I would opt to make it configurable. I'll open a

[jira] [Resolved] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-04-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2701. Resolution: Implemented Merged/committed. Thanks, [~markus17]! > Fetcher: log dates and

[jira] [Resolved] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-04-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2666. Resolution: Implemented Merged in to master, will be available in 1.16. Thanks,

[jira] [Assigned] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-04-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2666: -- Assignee: Sebastian Nagel > Increase default value for http.content.limit /

[jira] [Resolved] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-04-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2683. Resolution: Implemented > DeduplicationJob: add option to prefer https:// over http:// >

[jira] [Assigned] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-04-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2683: -- Assignee: Sebastian Nagel > DeduplicationJob: add option to prefer https:// over

[jira] [Commented] (NUTCH-2688) Unify the licence headers

2019-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813732#comment-16813732 ] Sebastian Nagel commented on NUTCH-2688:  Thanks, [~roannel]! In general, yes it would be

[jira] [Commented] (NUTCH-2706) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2019-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813724#comment-16813724 ] Sebastian Nagel commented on NUTCH-2706: Hi [~pemanuel], can you share the document which causes

[jira] [Commented] (NUTCH-2334) Extension point for schedulers

2019-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813569#comment-16813569 ] Sebastian Nagel commented on NUTCH-2334: Hi [~roannel], I think now with the AND and OR voting

[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-04-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812257#comment-16812257 ] Sebastian Nagel commented on NUTCH-2669: At least, IVY-1586 has been confirmed and assigned. Need

[jira] [Commented] (NUTCH-2707) protocol-okhttp fails to decompress content if Content-Encoding header is wrong

2019-04-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811949#comment-16811949 ] Sebastian Nagel commented on NUTCH-2707: Turns out that there are few more servers which does not

[jira] [Updated] (NUTCH-2707) protocol-okhttp fails to decompress content if Content-Encoding header is wrong

2019-04-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2707: --- Summary: protocol-okhttp fails to decompress content if Content-Encoding header is wrong

[jira] [Commented] (NUTCH-2707) protocol-okhttp fails to decompress gzip-encoded content

2019-04-05 Thread Sebastian Nagel (JIRA)

[jira] [Created] (NUTCH-2707) protocol-okhttp fails to decompress gzip-encoded content

2019-04-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2707: -- Summary: protocol-okhttp fails to decompress gzip-encoded content Key: NUTCH-2707 URL: https://issues.apache.org/jira/browse/NUTCH-2707 Project: Nutch

[jira] [Created] (NUTCH-2705) urlfilter-validator rejects IPv6 URLs

2019-03-26 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2705: -- Summary: urlfilter-validator rejects IPv6 URLs Key: NUTCH-2705 URL: https://issues.apache.org/jira/browse/NUTCH-2705 Project: Nutch Issue Type: Bug

[jira] [Created] (NUTCH-2704) Upgrade crawler-commons dependency to 1.0

2019-03-25 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2704: -- Summary: Upgrade crawler-commons dependency to 1.0 Key: NUTCH-2704 URL: https://issues.apache.org/jira/browse/NUTCH-2704 Project: Nutch Issue Type:

[jira] [Assigned] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-03-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2701: -- Assignee: Sebastian Nagel > Fetcher: log dates and times also in human-readable form

[jira] [Commented] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-03-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799010#comment-16799010 ] Sebastian Nagel commented on NUTCH-2701: PR which fixes the logging: {noformat} 19/03/22 11:50:48

[jira] [Updated] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-03-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2703: --- Summary: parse-tika: Boilerpipe should not run for non-(X)HTML pages (was: Boilerpipe

[jira] [Updated] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-03-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2703: --- Component/s: plugin > parse-tika: Boilerpipe should not run for non-(X)HTML pages >

[jira] [Created] (NUTCH-2702) Fetcher: suppress stack for frequent exceptions

2019-03-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2702: -- Summary: Fetcher: suppress stack for frequent exceptions Key: NUTCH-2702 URL: https://issues.apache.org/jira/browse/NUTCH-2702 Project: Nutch Issue

[jira] [Created] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-03-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2701: -- Summary: Fetcher: log dates and times also in human-readable form Key: NUTCH-2701 URL: https://issues.apache.org/jira/browse/NUTCH-2701 Project: Nutch

[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-03-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792447#comment-16792447 ] Sebastian Nagel commented on NUTCH-2669: Hi [~lewismc], one point is that the work-around to

[jira] [Created] (NUTCH-2700) Indexchecker: improve command-line help

2019-03-13 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2700: -- Summary: Indexchecker: improve command-line help Key: NUTCH-2700 URL: https://issues.apache.org/jira/browse/NUTCH-2700 Project: Nutch Issue Type:

[jira] [Created] (NUTCH-2699) Protocol-okhttp: needless loops to increment requested bytes counter when more content is already buffered

2019-03-13 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2699: -- Summary: Protocol-okhttp: needless loops to increment requested bytes counter when more content is already buffered Key: NUTCH-2699 URL:

[jira] [Assigned] (NUTCH-2696) Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

2019-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2696: -- Assignee: Sebastian Nagel > Nutch SegmentReader does not dump non-ASCII characters

[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785836#comment-16785836 ] Sebastian Nagel commented on NUTCH-2683: Any comments or objections? Thanks! Otherwise I'll

[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785819#comment-16785819 ] Sebastian Nagel commented on NUTCH-2666: Any objections? It's a huge jump but the it may be

[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2019-03-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784851#comment-16784851 ] Sebastian Nagel commented on NUTCH-2292: Yes, we can try to get a GSoC project. At a first

[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2019-03-02 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782415#comment-16782415 ] Sebastian Nagel commented on NUTCH-2292: Hi [~lewismc], hi [~thammegowda], I've tried to rebase

[jira] [Commented] (NUTCH-2697) Upgrade Ivy to fix the issue of an unset packaging.type property.

2019-02-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778154#comment-16778154 ] Sebastian Nagel commented on NUTCH-2697: Thanks! I also had no success in getting this sorted

[jira] [Commented] (NUTCH-2697) Upgrade Ivy to fix the issue of an unset packaging.type property.

2019-02-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776858#comment-16776858 ] Sebastian Nagel commented on NUTCH-2697: Hi [~chrisgavin], thanks for the patch/PR. Can you

[jira] [Updated] (NUTCH-2697) Upgrade Ivy to fix the issue of an unset packaging.type property.

2019-02-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2697: --- Affects Version/s: 1.16 > Upgrade Ivy to fix the issue of an unset packaging.type property.

[jira] [Updated] (NUTCH-2697) Upgrade Ivy to fix the issue of an unset packaging.type property.

2019-02-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2697: --- Fix Version/s: 1.16 > Upgrade Ivy to fix the issue of an unset packaging.type property. >

[jira] [Commented] (NUTCH-2695) Fix some alerts raised by LGTM

2019-02-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776634#comment-16776634 ] Sebastian Nagel commented on NUTCH-2695: Hi [~malcolmt], the build should succeed with the second

[jira] [Resolved] (NUTCH-2460) use the headless option of firefox and chrome in protocol-selenium

2019-02-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2460. Resolution: Implemented Implemented as part of NUTCH-2676. > use the headless option of

[jira] [Resolved] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-02-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2676. Resolution: Fixed Tested again successfully chrome and gecko drivers. Merged PR #430.

[jira] [Updated] (NUTCH-2696) Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2696: --- Fix Version/s: 1.16 > Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

[jira] [Updated] (NUTCH-2696) Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2696: --- Affects Version/s: 1.15 > Nutch SegmentReader does not dump non-ASCII characters with Hadoop

[jira] [Commented] (NUTCH-2696) Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775641#comment-16775641 ] Sebastian Nagel commented on NUTCH-2696: Hi [~lhervaud], thanks for the bug report! This issue is

[jira] [Commented] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775308#comment-16775308 ] Sebastian Nagel commented on NUTCH-2692: Hi [~markus17], the commit added

[jira] [Resolved] (NUTCH-2684) Add README.md file to all indexer writers plugins

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2684. Resolution: Fixed Thanks, [~roannel]! > Add README.md file to all indexer writers plugins

[jira] [Commented] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775264#comment-16775264 ] Sebastian Nagel commented on NUTCH-2692: +1  Ideally, the new property should be also described

[jira] [Resolved] (NUTCH-2693) Misspelled configuration property names in documentation

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2693. Resolution: Fixed Committed/merged. > Misspelled configuration property names in

[jira] [Assigned] (NUTCH-2693) Misspelled configuration property names in documentation

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2693: -- Assignee: Sebastian Nagel > Misspelled configuration property names in documentation

[jira] [Resolved] (NUTCH-2627) Fetcher to optionally filter URLs

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2627. Resolution: Implemented Assignee: Sebastian Nagel Committed/merged. > Fetcher to

[jira] [Resolved] (NUTCH-2695) Fix some alerts raised by LGTM

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2695. Resolution: Fixed Fix Version/s: 1.16 Fixed/merged. Thanks, [~malcolmt]! > Fix

[jira] [Assigned] (NUTCH-2695) Fix some alerts raised by LGTM

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2695: -- Assignee: Sebastian Nagel > Fix some alerts raised by LGTM >

[jira] [Commented] (NUTCH-2695) Fix some alerts raised by LGTM

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775152#comment-16775152 ] Sebastian Nagel commented on NUTCH-2695: Hi [~malcolmt], thanks! I'll merge the pull request and

[jira] [Commented] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775104#comment-16775104 ] Sebastian Nagel commented on NUTCH-2694: +1  but requires also few changes in ResolverThread, see

[jira] [Updated] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2694: --- Attachment: NUTCH-2694-2.patch > HostDB to aggregate by long instead of integer >

[jira] [Commented] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774512#comment-16774512 ] Sebastian Nagel commented on NUTCH-2694: +1 A HostDatum has no version byte in its serialization

[jira] [Created] (NUTCH-2693) Misspelled configuration property names in documentation

2019-02-07 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2693: -- Summary: Misspelled configuration property names in documentation Key: NUTCH-2693 URL: https://issues.apache.org/jira/browse/NUTCH-2693 Project: Nutch

[jira] [Resolved] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2689. Resolution: Implemented Thanks, [~markus17]! Merged. > Speed up urlfilter-regex and

[jira] [Resolved] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2691. Resolution: Implemented Merged. Thanks, [~yossi]! > Improve logging from scoring-depth

[jira] [Resolved] (NUTCH-2685) Add README.md file to all exchange plugins

2019-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2685. Resolution: Fixed Merged. Thanks, [~roannel]! > Add README.md file to all exchange

[jira] [Commented] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748878#comment-16748878 ] Sebastian Nagel commented on NUTCH-2691: +1 > Improve logging from scoring-depth plugin >

[jira] [Created] (NUTCH-2690) Configurable and fast URL filter

2019-01-22 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2690: -- Summary: Configurable and fast URL filter Key: NUTCH-2690 URL: https://issues.apache.org/jira/browse/NUTCH-2690 Project: Nutch Issue Type: Improvement

[jira] [Resolved] (NUTCH-2686) Separate field for mime types mapped by index-more plugin

2019-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2686. Resolution: Fixed Thanks, [~roannel]! Resolving because PR is merged. The failure of unit

[jira] [Commented] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748765#comment-16748765 ] Sebastian Nagel commented on NUTCH-2689: The benchmark times before ... {noformat} % grep 'bench

[jira] [Assigned] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2689: -- Assignee: Sebastian Nagel > Speed up urlfilter-regex and urlfilter-automaton >

[jira] [Created] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2689: -- Summary: Speed up urlfilter-regex and urlfilter-automaton Key: NUTCH-2689 URL: https://issues.apache.org/jira/browse/NUTCH-2689 Project: Nutch Issue

[jira] [Resolved] (NUTCH-2682) Upgrade to Tika 1.20

2019-01-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2682. Resolution: Fixed Assignee: Sebastian Nagel > Upgrade to Tika 1.20 >

[jira] [Resolved] (NUTCH-2629) Documentation for CSV Index Writer

2019-01-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2629. Resolution: Fixed Thanks, [~roannel]! Looks good. I've added also a short note that the

[jira] [Assigned] (NUTCH-2680) Documentation: https supported by multiple protocol plugins not only httpclient

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2680: -- Assignee: Sebastian Nagel > Documentation: https supported by multiple protocol

[jira] [Resolved] (NUTCH-2680) Documentation: https supported by multiple protocol plugins not only httpclient

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2680. Resolution: Fixed > Documentation: https supported by multiple protocol plugins not only

[jira] [Resolved] (NUTCH-2663) Improve index-jexl-filter syntax for scripts

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2663. Resolution: Fixed Merged PR. Thanks, [~jorgelbg]! > Improve index-jexl-filter syntax for

[jira] [Resolved] (NUTCH-2653) ProtocolFactory.getProtocol(url) creates separate plugin instances for http/https

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2653. Resolution: Fixed Fixed contained in NUTCH-2678. > ProtocolFactory.getProtocol(url)

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746230#comment-16746230 ] Sebastian Nagel commented on NUTCH-2678: Yes. It includes the fix for the unit test. > Allow for

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745926#comment-16745926 ] Sebastian Nagel commented on NUTCH-2678: Hi [~markus17], great! I've tested everything again:

[jira] [Commented] (NUTCH-2688) Unify the licence headers

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746056#comment-16746056 ] Sebastian Nagel commented on NUTCH-2688: Using block comments sounds reasonable. Also the rules

[jira] [Commented] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746050#comment-16746050 ] Sebastian Nagel commented on NUTCH-2687: +1 Just for completion - the HTTP header for the given

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745191#comment-16745191 ] Sebastian Nagel commented on NUTCH-2676: Thanks, [~virt], the PR looks promising! If done,

[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735780#comment-16735780 ] Sebastian Nagel commented on NUTCH-2676: Great! Thanks! > Update to the latest selenium and add

[jira] [Updated] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-01-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2666: --- Fix Version/s: 1.16 > Increase default value for http.content.limit / ftp.content.limit / >

[jira] [Updated] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-01-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2666: --- Summary: Increase default value for http.content.limit / ftp.content.limit /

<    8   9   10   11   12   13   14   15   16   17   >