Sebastian Nagel created NUTCH-3039: -------------------------------------- Summary: Failure to handle ftp:// URLs Key: NUTCH-3039 URL: https://issues.apache.org/jira/browse/NUTCH-3039 Project: Nutch Issue Type: Bug Components: plugin, protocol Affects Versions: 1.19 Reporter: Sebastian Nagel Fix For: 1.21
Nutch fails to handle ftp:// URLs: - URLNormalizerBasic returns the empty string because creating the URL instance fails with a MalformedURLException: {noformat} echo "ftp://ftp.example.com/path/file.txt" \ | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to a MalformedURLException: {noformat} bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ "ftp://ftp.example.com/path/file.txt" ... Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) ...{noformat} The issue is caused by NUTCH-2429: - we do not provide a dedicated URL stream handler for ftp URLs - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)