[ 
https://issues.apache.org/jira/browse/HADOOP-14444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021012#comment-16021012
 ] 

Steve Loughran commented on HADOOP-14444:
-----------------------------------------

Hit the Submit Patch button for the Yetus patch

Given how unloved that code has been it's good to have the new stuff. But: no 
more dependencies in Hadoop common. 

I propose 

# having a new {{hadoop-tools/hadoop-ftp}} module for this
# Make all tests which require an FTP server integration tests, prefix 
{{ITest}} and run by surefire. Look at hadoop-aws for this, and note how the 
parallel execution works, *provided all tests use unique paths*
# move all existing FTP stuff over there, with tests
# adopt the S3 policy: [no patches without declaration of test 
endpoint|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md].
 Here something like "Ubuntu 16 ftpd" would suffice
# Go for full FSContractTest coverage
# if we are happy, cut the older ftp client out.


This'll be Hadoop 3.x only, presumably, given use of Java 8 streams. And given 
the time it invariably takes to get a new FS client to stabilise (one public 
release, generally). I think it'd be good to find some other volunteers to help 
nurture this in. I am not going to do that, as I have too many other 
commitments in the object store space.

Quick look at bits of the code, without going into the real details

{code}
LOGGER.debug(String.format(ErrorStrings.E_FILE_NOTFOUND, f));
{code}
As well as not using the name "LOG", this is very inefficient; it's formatting 
the strings even when not printed. Use SLF4J's own mech:

{code}
LOG.debug("File not found {}", f));
{code}

Some logging (e.g, {{LOGGER.info("Finish glob processing")}} should go to debug 
as they aren't relevant to users

h4. {{AbstractFTPFileSystem}}

L557. Don't catch and wrap here. It only loses subclass info for no benefit. If 
you must, use NetUtils.wrapException and add in the extra diagnostics info 
expected there.

L600: {{open}} should through {{FileNotFoundException}} on a directory. 
Similarly, {{create}} raises {{FileAlreadyExistsException}}. Should all be 
documented in the [FS 
spec|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html]

If you set up the FS contract tests with "Strict" exception handling, they'll 
make sure the exception classes match HDFS's


h4. {{ConnectionInfo}}

just use Guava's {{Preconditions.checkArgument}} to validate things; it raises 
{{InvalidArgumentException}}, and does string construction too.

It looks like you are trying to support per-ftp endpoint config of user & 
password. Take a look at what we have done with S3a with per bucket config, all 
done in S3AUtils. We could maybe pull that up and make it more generic for all 
the FS clients: ftp and the object stores, specifically.




> New implementation of ftp and sftp filesystems
> ----------------------------------------------
>
>                 Key: HADOOP-14444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14444
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Lukas Waldmann
>         Attachments: HADOOP-14444.patch
>
>
> Current implementation of FTP and SFTP filesystems have severe limitations 
> and performance issues when dealing with high number of files. Mine patch 
> solve those issues and integrate both filesystems such a way that most of the 
> core functionality is common for both and therefore simplifying the 
> maintainability.
> The core features:
> * Support for HTTP/SOCKS proxies
> * Support for passive FTP
> * Support of connection pooling - new connection is not created for every 
> single command but reused from the pool.
> For huge number of files it shows order of magnitude performance improvement 
> over not pooled connections.
> * Caching of directory trees. For ftp you always need to list whole directory 
> whenever you ask information about particular file.
> Again for huge number of files it shows order of magnitude performance 
> improvement over not cached connections.
> * Support of keep alive (NOOP) messages to avoid connection drops
> * Support for Unix style or regexp wildcard glob - useful for listing a 
> particular files across whole directory tree
> * Support for reestablishing broken ftp data transfers - can happen 
> surprisingly often



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to