[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

ASF GitHub Bot (Jira) Wed, 08 Nov 2023 04:42:22 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784029#comment-17784029
 ]


ASF GitHub Bot commented on NUTCH-3017:
---------------------------------------

sebastian-nagel commented on PR #793:
URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549

   Thanks, @jnioche!
   
   Merged into master, adding the lines to make use of Hadoop-provided 
compression codecs.
   
   Successfully tested in local and pseudo-distributed mode with various codecs 
(gzip / .gz, bzip2, ZStandard / .zst).
   
   One final note: if the fast-urlfilter is not found, the Nutch job (local 
mode) or the tasks (distributed mode) fail with an exception. I didn't change 
this behavior.




> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> -------------------------------------------------------------------
>
>                 Key: NUTCH-3017
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3017
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin, urlfilter
>    Affects Versions: 1.19
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

Reply via email to