[jira] [Closed] (NUTCH-2875) TEST

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2875. -- Resolution: Invalid > TEST > > > Key: NUTCH-2875 >

[jira] [Closed] (NUTCH-2874) TEST

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2874. -- Resolution: Invalid > TEST > > > Key: NUTCH-2874 >

[jira] [Updated] (NUTCH-2875) TEST

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2875: --- Fix Version/s: (was: 1.19) > TEST > > > Key

[jira] [Reopened] (NUTCH-2875) TEST

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-2875: > TEST > > > Key: NUTCH-2875 >

[jira] [Updated] (NUTCH-2874) TEST

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2874: --- Fix Version/s: (was: 1.19) > TEST > > > Key

[jira] [Reopened] (NUTCH-2874) TEST

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-2874: > TEST > > > Key: NUTCH-2874 >

[jira] [Updated] (NUTCH-2293) Make the unit tests which requires "plugin.folders" as integration tests

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2293: --- Fix Version/s: (was: 1.19) > Make the unit tests which requires "plugin

[jira] [Updated] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2292: --- Fix Version/s: (was: 1.19) > Mavenize the build for nutch-core and nutch-plug

[jira] [Updated] (NUTCH-2638) Publish plugins in Maven

2022-08-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2638: --- Fix Version/s: (was: 1.19) > Publish plugins in Ma

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-08-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576804#comment-17576804 ] Sebastian Nagel commented on NUTCH-2953: Hi [~markus17], transformed the patches into a pull

[jira] [Created] (NUTCH-2958) Upgrade to crawler-commons 1.3

2022-08-08 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2958: -- Summary: Upgrade to crawler-commons 1.3 Key: NUTCH-2958 URL: https://issues.apache.org/jira/browse/NUTCH-2958 Project: Nutch Issue Type: Improvement

[jira] [Assigned] (NUTCH-2958) Upgrade to crawler-commons 1.3

2022-08-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2958: -- Assignee: Sebastian Nagel > Upgrade to crawler-commons

[jira] [Assigned] (NUTCH-2957) indexer-solr / Solr schema: add fall-back field definitions for unknown index fields

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2957: -- Assignee: Sebastian Nagel > indexer-solr / Solr schema: add fall-back fi

[jira] [Created] (NUTCH-2957) indexer-solr / Solr schema: add fall-back field definitions for unknown index fields

2022-08-06 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2957: -- Summary: indexer-solr / Solr schema: add fall-back field definitions for unknown index fields Key: NUTCH-2957 URL: https://issues.apache.org/jira/browse/NUTCH-2957

[jira] [Updated] (NUTCH-2957) indexer-solr / Solr schema: add fall-back field definitions for unknown index fields

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2957: --- Fix Version/s: 1.19 > indexer-solr / Solr schema: add fall-back field definiti

[jira] [Updated] (NUTCH-2957) indexer-solr / Solr schema: add fall-back field definitions for unknown index fields

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2957: --- Affects Version/s: 1.18 > indexer-solr / Solr schema: add fall-back field definiti

[jira] [Assigned] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2956: -- Assignee: Sebastian Nagel > index-geoip: dependency upgrades and improveme

[jira] [Assigned] (NUTCH-2955) indexer-solr: replace deprecated/removed field type solr.LatLonType

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2955: -- Assignee: Sebastian Nagel > indexer-solr: replace deprecated/removed field t

[jira] [Updated] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2956: --- Description: Upgrades and improvements to the index-geoip plugin: - upgrade the geoip2

[jira] [Updated] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2956: --- Description: Upgrades and improvements to the index-geoip plugin: - upgrade the geoip2

[jira] [Created] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-06 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2956: -- Summary: index-geoip: dependency upgrades and improvements Key: NUTCH-2956 URL: https://issues.apache.org/jira/browse/NUTCH-2956 Project: Nutch Issue

[jira] [Created] (NUTCH-2955) indexer-solr: replace deprecated/removed field type solr.LatLonType

2022-08-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2955: -- Summary: indexer-solr: replace deprecated/removed field type solr.LatLonType Key: NUTCH-2955 URL: https://issues.apache.org/jira/browse/NUTCH-2955 Project: Nutch

[jira] [Created] (NUTCH-2954) Add unit tests for URLStreamHandlerFactory

2022-06-24 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2954: -- Summary: Add unit tests for URLStreamHandlerFactory Key: NUTCH-2954 URL: https://issues.apache.org/jira/browse/NUTCH-2954 Project: Nutch Issue Type

[jira] [Assigned] (NUTCH-2930) Protocol-okhttp: implement IP filter

2022-06-23 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2930: -- Assignee: Sebastian Nagel > Protocol-okhttp: implement IP fil

[jira] [Updated] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-23 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2953: --- Labels: PatchAvailable patch-available (was: ) > Indexer Elastic to ignore SSL iss

[jira] [Resolved] (NUTCH-2827) Migrate site repository

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2827. Resolution: Implemented > Migrate site reposit

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556856#comment-17556856 ] Sebastian Nagel commented on NUTCH-2953: Hi [~markus17], with some extra imports I was able

[jira] [Updated] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2953: --- Attachment: NUTCH-2953-2.patch > Indexer Elastic to ignore SSL iss

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556829#comment-17556829 ] Sebastian Nagel commented on NUTCH-2953: Ok, but the patch does not cleanly apply to the recent

[jira] [Updated] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2953: --- Fix Version/s: 1.19 > Indexer Elastic to ignore SSL iss

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556827#comment-17556827 ] Sebastian Nagel commented on NUTCH-2953: +1 patch looks good! > Indexer Elastic to ignore

[jira] [Resolved] (NUTCH-2831) Elastic indexer does not support SSL

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2831. Resolution: Fixed > Elastic indexer does not support

[jira] [Closed] (NUTCH-2073) Unable to create index on elasticsearch through nutch

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2073. -- Resolution: Fixed The 2.x branch is not supported anymore. > Unable to create in

[jira] [Closed] (NUTCH-2806) Nutch can't parse links

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2806. -- > Nutch can't parse links > > > Key

[jira] [Resolved] (NUTCH-2806) Nutch can't parse links

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2806. Resolution: Won't Fix The 2.x branch is not supported anymore. > Nutch can't parse li

[jira] [Commented] (NUTCH-1611) Elastic Search Indexer Creates field in elastic search "boost" as a string value, so cannot be used in custom boost queries

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556821#comment-17556821 ] Sebastian Nagel commented on NUTCH-1611: Need to clarify whether this should be fixed for 1.x

[jira] [Updated] (NUTCH-1611) Elastic Search Indexer Creates field in elastic search "boost" as a string value, so cannot be used in custom boost queries

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1611: --- Affects Version/s: 1.16 > Elastic Search Indexer Creates field in elastic search &qu

[jira] [Resolved] (NUTCH-2951) Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever

2022-06-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2951. Resolution: Fixed Fixed. Thanks again, [~Lapax]! > Crawl datum with metad

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554702#comment-17554702 ] Sebastian Nagel commented on NUTCH-2936: Update: the issue is reproducible also in local mode

[jira] [Assigned] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2952: -- Assignee: Sebastian Nagel > Upgrade core dependencies (Hadoop 3.3.3, log4j 2.1

[jira] [Created] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2952: -- Summary: Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) Key: NUTCH-2952 URL: https://issues.apache.org/jira/browse/NUTCH-2952 Project: Nutch

[jira] [Commented] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554564#comment-17554564 ] Sebastian Nagel commented on NUTCH-2949: This is addressed in [PR#733|https://github.com/apache

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554531#comment-17554531 ] Sebastian Nagel commented on NUTCH-2936: After debugging this: the call by the Hadoop MR Job

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553996#comment-17553996 ] Sebastian Nagel commented on NUTCH-2936: I was able to reproduce the issue in pseudo-distributed

[jira] [Updated] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2936: --- Summary: Early registration of URL stream handlers provided by plugins may fail Hadoop jobs

[jira] [Assigned] (NUTCH-2951) Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever

2022-06-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2951: -- Assignee: Sebastian Nagel > Crawl datum with metadata WRITABLE_GENERATE_TIME_

[jira] [Commented] (NUTCH-2951) Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever

2022-06-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544913#comment-17544913 ] Sebastian Nagel commented on NUTCH-2951: Thanks, [~Lapax] - good catch! Feel free to provide

[jira] [Updated] (NUTCH-2951) Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever

2022-06-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2951: --- Fix Version/s: 1.19 > Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetch

[jira] [Resolved] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-24 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2950. Resolution: Implemented > UpdateHostDb: performance improveme

[jira] [Commented] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-20 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540148#comment-17540148 ] Sebastian Nagel commented on NUTCH-2950: Thanks, [~markus17]! I'll merge the PR soon - all checks

[jira] [Commented] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-19 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539617#comment-17539617 ] Sebastian Nagel commented on NUTCH-2950: If desired I could also split the issue/PR into multiple

[jira] [Created] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-19 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2950: -- Summary: UpdateHostDb: performance improvements Key: NUTCH-2950 URL: https://issues.apache.org/jira/browse/NUTCH-2950 Project: Nutch Issue Type

[jira] [Resolved] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-19 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2946. Resolution: Implemented > Fetcher: optionally slow down fetching from hosts with repea

[jira] [Created] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

2022-05-19 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2949: -- Summary: Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers Key: NUTCH-2949 URL: https://issues.apache.org/jira/browse/NUTCH-2949

[jira] [Resolved] (NUTCH-2948) Upgrade dependencies to Any23 2.7 and Tika 2.3.0

2022-05-12 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2948. Resolution: Implemented > Upgrade dependencies to Any23 2.7 and Tika 2.

[jira] [Assigned] (NUTCH-2948) Upgrade dependencies to Any23 2.7 and Tika 2.3.0

2022-05-12 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2948: -- Assignee: Sebastian Nagel > Upgrade dependencies to Any23 2.7 and Tika 2.

[jira] [Created] (NUTCH-2948) Upgrade dependencies to Any23 2.7 and Tika 2.3.0

2022-05-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2948: -- Summary: Upgrade dependencies to Any23 2.7 and Tika 2.3.0 Key: NUTCH-2948 URL: https://issues.apache.org/jira/browse/NUTCH-2948 Project: Nutch Issue

[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-04 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531662#comment-17531662 ] Sebastian Nagel commented on NUTCH-2946: Hi [~markus17], thanks for your remarks! An [exponential

[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531245#comment-17531245 ] Sebastian Nagel commented on NUTCH-2946: > If you'd prefer this to be optional, i would pre

[jira] [Created] (NUTCH-2947) Fetcher: keep state of empty fetch queues unless queue feeder is finished

2022-05-03 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2947: -- Summary: Fetcher: keep state of empty fetch queues unless queue feeder is finished Key: NUTCH-2947 URL: https://issues.apache.org/jira/browse/NUTCH-2947 Project

[jira] [Created] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-03 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2946: -- Summary: Fetcher: optionally slow down fetching from hosts with repeated exceptions Key: NUTCH-2946 URL: https://issues.apache.org/jira/browse/NUTCH-2946 Project

[jira] [Commented] (NUTCH-2945) Solr Index Writer pluging schema.xml missing a copyToField

2022-05-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531162#comment-17531162 ] Sebastian Nagel commented on NUTCH-2945: Hi [~dfisla], could you explain where the field

[jira] [Updated] (NUTCH-2945) Solr Index Writer pluging schema.xml missing a copyToField

2022-05-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2945: --- Fix Version/s: 1.19 > Solr Index Writer pluging schema.xml missing a copyToFi

[jira] [Commented] (NUTCH-2831) Elastic indexer does not support SSL

2022-05-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531095#comment-17531095 ] Sebastian Nagel commented on NUTCH-2831: see also https://stackoverflow.com/questions/72085504

[jira] [Deleted] (NUTCH-2942) it is best form of software

2022-04-05 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel deleted NUTCH-2942: --- > it is best form of software > --- > > Key

[jira] [Updated] (NUTCH-2942) it is best form of software

2022-04-05 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2942: --- Labels: spam (was: ) > it is best form of softw

[jira] [Updated] (NUTCH-2923) Add Job Id in Job Failure messages

2022-01-27 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2923: --- Affects Version/s: 1.18 > Add Job Id in Job Failure messa

[jira] [Resolved] (NUTCH-2923) Add Job Id in Job Failure messages

2022-01-27 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2923. Fix Version/s: 1.19 Resolution: Implemented > Add Job Id in Job Failure messa

Re: [jira] [Commented] (NUTCH-122) block numbers need a better random number generator

2022-01-24 Thread Sebastian Nagel
Deletion of these spam comments is addressed in https://issues.apache.org/jira/browse/INFRA-22787 On 1/24/22 15:23, pankaj kumar singh (Jira) wrote: > > [ >

[jira] [Resolved] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2573. Resolution: Implemented > Suspend crawling if robots.txt fails to fetch with 5xx sta

[jira] [Resolved] (NUTCH-2935) DeduplicationJob: failure on URLs with invalid percent encoding

2022-01-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2935. Resolution: Fixed > DeduplicationJob: failure on URLs with invalid percent encod

[jira] [Updated] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2936: --- Priority: Blocker (was: Major) > Early registration of URL stream handlers provi

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-15 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476615#comment-17476615 ] Sebastian Nagel commented on NUTCH-2936: Using protocol-okhttp causes parsechecker to raise

[jira] [Created] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2022-01-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2937: -- Summary: parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode Key: NUTCH-2937 URL: https://issues.apache.org/jira/browse/NUTCH-2937

[jira] [Updated] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2573: --- Description: Fetcher should optionally (by default) suspend crawling by a configurable

[jira] [Assigned] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2022-01-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2573: -- Assignee: Sebastian Nagel > Suspend crawling if robots.txt fails to fetch with

[jira] [Created] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-01-14 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2936: -- Summary: Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode Key: NUTCH-2936 URL: https://issues.apache.org/jira

[jira] [Resolved] (NUTCH-2929) Fetcher: start threads slowly to avoid that resources are temporarily exhausted

2022-01-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2929. Resolution: Implemented Thanks for the reviews, [~markus17] and [~lewismc]! > Fetc

[jira] [Created] (NUTCH-2935) DeduplicationJob: failure on URLs with invalid percent encoding

2022-01-14 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2935: -- Summary: DeduplicationJob: failure on URLs with invalid percent encoding Key: NUTCH-2935 URL: https://issues.apache.org/jira/browse/NUTCH-2935 Project: Nutch

[jira] [Commented] (NUTCH-2929) Fetcher: start threads slowly to avoid that resources are temporarily exhausted

2022-01-11 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17473091#comment-17473091 ] Sebastian Nagel commented on NUTCH-2929: It wasn't that many Tika warnings: - (non-parsing

[jira] [Created] (NUTCH-2930) Protocol-okhttp: implement IP filter

2022-01-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2930: -- Summary: Protocol-okhttp: implement IP filter Key: NUTCH-2930 URL: https://issues.apache.org/jira/browse/NUTCH-2930 Project: Nutch Issue Type

[jira] [Created] (NUTCH-2929) Fetcher: start threads slowly to avoid that resources are temporarily exhausted

2022-01-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2929: -- Summary: Fetcher: start threads slowly to avoid that resources are temporarily exhausted Key: NUTCH-2929 URL: https://issues.apache.org/jira/browse/NUTCH-2929

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-01-11 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472731#comment-17472731 ] Sebastian Nagel commented on NUTCH-2924: Hi [~markus17], see NUTCH-2455 and [PR #254|https

Re: Addressing Nutch use of CMS WAS: [IMPORTANT] - ci.apache.org and CMS Shutdown end of January 2022

2022-01-09 Thread Sebastian Nagel
Hi Gavin, > https://svn.apache.org/repos/infra/websites/production/nutch/content/ > I assume you no longer use that area and I can safely remove? No, we do not need it anymore. Would be nice if https://svn.apache.org/repos/asf/nutch/cms_site/ could stay for some more time - just in case we

Re: !! Join the #nutch Slack channel !!

2022-01-09 Thread Sebastian Nagel
Thanks, Lewis! I've joined - but still it's quiet there... Sebastian On 12/29/21 21:41, lewis john mcgibbney wrote: > Hi user@, dev@, > I took the liberty of setting up a #nutch channel for our community to > communicate in a lower latency manner. > First join the-asf.slack.com

[jira] [Created] (NUTCH-2928) Fix favicon of content pages

2022-01-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2928: -- Summary: Fix favicon of content pages Key: NUTCH-2928 URL: https://issues.apache.org/jira/browse/NUTCH-2928 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2022-01-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1999: --- Fix Version/s: (was: 1.19) > Add http://nutch.apache.org/robots.

[jira] [Created] (NUTCH-2927) indexer-elastic: use Java API client

2022-01-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2927: -- Summary: indexer-elastic: use Java API client Key: NUTCH-2927 URL: https://issues.apache.org/jira/browse/NUTCH-2927 Project: Nutch Issue Type

[jira] [Resolved] (NUTCH-2903) Unable to Connect to Elasticsearch over HTTPS

2022-01-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2903. Resolution: Fixed > Unable to Connect to Elasticsearch over HT

[jira] [Resolved] (NUTCH-2922) Upgrade to log4j 2.17.0

2021-12-22 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2922. Resolution: Fixed > Upgrade to log4j 2.1

[jira] [Resolved] (NUTCH-2917) Remove transitive dependency to log4j 1.x

2021-12-22 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2917. Resolution: Fixed Committed. Thanks for the reviews! > Remove transitive depende

[jira] [Commented] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463232#comment-17463232 ] Sebastian Nagel commented on NUTCH-2921: Hi [~markus17], yes, that makes sense

[jira] [Commented] (NUTCH-2917) Remove transitive dependency to log4j 1.x

2021-12-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463210#comment-17463210 ] Sebastian Nagel commented on NUTCH-2917: Thanks, [~markus17]! Removing it should be safe

[jira] [Commented] (NUTCH-2917) Remove transitive dependency to log4j 1.x

2021-12-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463208#comment-17463208 ] Sebastian Nagel commented on NUTCH-2917: Tested in local mode: logging still works for Hadoop

[jira] [Assigned] (NUTCH-2917) Remove transitive dependency to log4j 1.x

2021-12-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2917: -- Assignee: Sebastian Nagel > Remove transitive dependency to log4j

[jira] [Created] (NUTCH-2922) Upgrade to log4j 2.17.0

2021-12-21 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2922: -- Summary: Upgrade to log4j 2.17.0 Key: NUTCH-2922 URL: https://issues.apache.org/jira/browse/NUTCH-2922 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-2912) CrawlDatumProcessor to calculate crawl completeness

2021-12-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463104#comment-17463104 ] Sebastian Nagel commented on NUTCH-2912: Looks good! I've added the new

[jira] [Commented] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463091#comment-17463091 ] Sebastian Nagel commented on NUTCH-2921: Hi [~markus17], the attached patch is for NUTCH-2912

[jira] [Resolved] (NUTCH-2914) nutch-default.xml: remove obsolete and unused properties

2021-12-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2914. Resolution: Implemented > nutch-default.xml: remove obsolete and unused propert

[jira] [Resolved] (NUTCH-2807) SitemapProcessor to warn that ignoring robots.txt affects detection of sitemaps

2021-12-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2807. Resolution: Implemented > SitemapProcessor to warn that ignoring robots.txt affe

<    1   2   3   4   5   6   7   8   9   10   >