[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784029#comment-17784029 ]
ASF GitHub Bot commented on NUTCH-3017: --------------------------------------- sebastian-nagel commented on PR #793: URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549 Thanks, @jnioche! Merged into master, adding the lines to make use of Hadoop-provided compression codecs. Successfully tested in local and pseudo-distributed mode with various codecs (gzip / .gz, bzip2, ZStandard / .zst). One final note: if the fast-urlfilter is not found, the Nutch job (local mode) or the tasks (distributed mode) fail with an exception. I didn't change this behavior. > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > ------------------------------------------------------------------- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter > Affects Versions: 1.19 > Reporter: Julien Nioche > Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)