[jira] [Updated] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
[ https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated HADOOP-10643: - Release Note: (was: This was implemented as part of s3a work in HADOOP-11262) > Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs > implementation > --- > > Key: HADOOP-10643 > URL: https://issues.apache.org/jira/browse/HADOOP-10643 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: HADOOP-10643.patch > > > The new set of file system related apis (FileContext/AbstractFileSystem) > already support local filesytem, hdfs, viewfs) however they don't support > s3n. This patch is to add that support using configurations like > fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs > This patch however doesn't provide a new implementation, instead relies on > DelegateToFileSystem abstract class to delegate all calls from FileContext > apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
[ https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar resolved HADOOP-10643. -- Resolution: Duplicate Release Note: This was implemented as part of s3a work in HADOOP-11262 > Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs > implementation > --- > > Key: HADOOP-10643 > URL: https://issues.apache.org/jira/browse/HADOOP-10643 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Affects Versions: 2.4.0 >Reporter: Sumit Kumar >Assignee: Sumit Kumar > Attachments: HADOOP-10643.patch > > > The new set of file system related apis (FileContext/AbstractFileSystem) > already support local filesytem, hdfs, viewfs) however they don't support > s3n. This patch is to add that support using configurations like > fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs > This patch however doesn't provide a new implementation, instead relies on > DelegateToFileSystem abstract class to delegate all calls from FileContext > apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
[ https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682062#comment-14682062 ] Sumit Kumar commented on HADOOP-10643: -- I see, feel free to resolve with appropriate closure code. Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation --- Key: HADOOP-10643 URL: https://issues.apache.org/jira/browse/HADOOP-10643 Project: Hadoop Common Issue Type: Sub-task Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: HADOOP-10643.patch The new set of file system related apis (FileContext/AbstractFileSystem) already support local filesytem, hdfs, viewfs) however they don't support s3n. This patch is to add that support using configurations like fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs This patch however doesn't provide a new implementation, instead relies on DelegateToFileSystem abstract class to delegate all calls from FileContext apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
[ https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280942#comment-14280942 ] Sumit Kumar commented on HADOOP-10643: -- [~ste...@apache.org] I'm assuming your concern is on this portion of the patch (as part of AbstractFileSystem.java): {noformat} -// A file system implementation that requires authority must always -// specify default port -if (defaultPort 0 authorityNeeded) { - throw new HadoopIllegalArgumentException( - FileSystem implementation error - default port + defaultPort - + is not valid); -} {noformat} If so, s3's urls have a specific requirement that they don't contain any port (so defaultPort becomes -1 in this case) and they don't have any authority in the url as well. Does this work? Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation --- Key: HADOOP-10643 URL: https://issues.apache.org/jira/browse/HADOOP-10643 Project: Hadoop Common Issue Type: New Feature Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: HADOOP-10643.patch The new set of file system related apis (FileContext/AbstractFileSystem) already support local filesytem, hdfs, viewfs) however they don't support s3n. This patch is to add that support using configurations like fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs This patch however doesn't provide a new implementation, instead relies on DelegateToFileSystem abstract class to delegate all calls from FileContext apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
[ https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar reassigned HADOOP-10643: Assignee: Sumit Kumar Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation --- Key: HADOOP-10643 URL: https://issues.apache.org/jira/browse/HADOOP-10643 Project: Hadoop Common Issue Type: New Feature Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: HADOOP-10643.patch The new set of file system related apis (FileContext/AbstractFileSystem) already support local filesytem, hdfs, viewfs) however they don't support s3n. This patch is to add that support using configurations like fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs This patch however doesn't provide a new implementation, instead relies on DelegateToFileSystem abstract class to delegate all calls from FileContext apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10400) Incorporate new S3A FileSystem implementation
[ https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041290#comment-14041290 ] Sumit Kumar commented on HADOOP-10400: -- bq. Why not just have the build read in some non-SCM'd file? ... Agree with you. My intentions were exactly the same, i.e. to be able to run tests with credentials at a predetermined location. xml might require extra work because aws sdk supports following by default as part of sdk: system properties, environment variables, properties files and the new ProfileCredentialsProvider. I like the idea of skipping tests when credentials are not available, one good idea however would be to report in the test-report that these tests were skipped because of missing credentials file (along with expected location) Incorporate new S3A FileSystem implementation - Key: HADOOP-10400 URL: https://issues.apache.org/jira/browse/HADOOP-10400 Project: Hadoop Common Issue Type: Improvement Components: fs, fs/s3 Affects Versions: 2.4.0 Reporter: Jordan Mendelson Assignee: Jordan Mendelson Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch The s3native filesystem has a number of limitations (some of which were recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses the aws-sdk instead of the jets3t library. There are a number of improvements over s3native including: - Parallel copy (rename) support (dramatically speeds up commits on large files) - AWS S3 explorer compatible empty directories files xyz/ instead of xyz_$folder$ (reduces littering) - Ignores s3native created _$folder$ files created by s3native and other S3 browsing utilities - Supports multiple output buffer dirs to even out IO when uploading files - Supports IAM role-based authentication - Allows setting a default canned ACL for uploads (public, private, etc.) - Better error recovery handling - Should handle input seeks without having to download the whole file (used for splits a lot) This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to various pom files to get it to build against trunk. I've been using 0.0.1 in production with CDH 4 for several months and CDH 5 for a few days. The version here is 0.0.2 which changes around some keys to hopefully bring the key name style more inline with the rest of hadoop 2.x. *Tunable parameters:* fs.s3a.access.key - Your AWS access key ID (omit for role authentication) fs.s3a.secret.key - Your AWS secret key (omit for role authentication) fs.s3a.connection.maximum - Controls how many parallel connections HttpClient spawns (default: 15) fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 (default: true) fs.s3a.attempts.maximum - How many times we should retry commands on transient errors (default: 10) fs.s3a.connection.timeout - Socket connect timeout (default: 5000) fs.s3a.paging.maximum - How many keys to request from S3 when doing directory listings at a time (default: 5000) fs.s3a.multipart.size - How big (in bytes) to split a upload or copy operation up into (default: 104857600) fs.s3a.multipart.threshold - Until a file is this large (in bytes), use non-parallel upload (default: 2147483647) fs.s3a.acl.default - Set a canned ACL on newly created/copied objects (private | public-read | public-read-write | authenticated-read | log-delivery-write | bucket-owner-read | bucket-owner-full-control) fs.s3a.multipart.purge - True if you want to purge existing multipart uploads that may not have been completed/aborted correctly (default: false) fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to purge (default: 86400) fs.s3a.buffer.dir - Comma separated list of directories that will be used to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a ) *Caveats*: Hadoop uses a standard output committer which uploads files as filename.COPYING before renaming them. This can cause unnecessary performance issues with S3 because it does not have a rename operation and S3 already verifies uploads against an md5 that the driver sets on the upload request. While this FileSystem should be significantly faster than the built-in s3native driver because of parallel copy support, you may want to consider setting a null output committer on our jobs to further improve performance. Because S3 requires the file length and MD5 to be known before a file is uploaded, all output is buffered out to a temporary file first similar to the s3native driver. Due to the lack of native rename() for S3, renaming extremely large files or
[jira] [Commented] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems
[ https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034281#comment-14034281 ] Sumit Kumar commented on HADOOP-9565: - I see some interface like this in hadoop-azure codebase https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterface.java Add a Blobstore interface to add to blobstore FileSystems - Key: HADOOP-9565 URL: https://issues.apache.org/jira/browse/HADOOP-9565 Project: Hadoop Common Issue Type: Sub-task Components: fs Affects Versions: 2.0.4-alpha Reporter: Steve Loughran Priority: Minor We can make the fact that some {{FileSystem}} implementations are really blobstores, with different atomicity and consistency guarantees, by adding a {{Blobstore}} interface to add to them. This could also be a place to add a {{Copy(Path,Path)}} method, assuming that all blobstores implement at server-side copy operation as a substitute for rename. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10400) Incorporate new S3A FileSystem implementation
[ https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019299#comment-14019299 ] Sumit Kumar commented on HADOOP-10400: -- Few observations: # Should it include tests to verify behavior when a user tries to write paths that have multiple / such as s3a://foobar/delta///gammma//abc? How is this implementation handling this would be interesting because each occurance of / would appear a directory for current implementation? # Should it have more logic to consider _$folder$ as marker as well for folders like it's doing for / (by creating fake directories). That way the implementation would be exactly the same as current s3n. If item #1 fails, i don't see another approach to solve folder representation in s3. # aws-java-sdk provides https://github.com/aws/aws-sdk-java/blob/master/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java Should we consider adding ProfileCredentialsProvider as described here https://java.awsblog.com/post/TxRE9V31UFN860/Secure-Local-Development-with-the-ProfileCredentialsProvider This might be a big boon in testing s3 behaviors as unit tests (it's always been really hard keeping code/xmls checked-into code bases having the access and secret keys). # Should S3AFileStatus be more strict in contructor arguments, for example, if it's a directory constructor do we need isdir flag? Should this be more clearer api? # Should it be doing parallel rename/delete operations as well? More specifically could copy operations (while renaming a folder) leverage parallel threads using TransferManager apis? # Should it implement the iterative listing api as well for better performance and build listStatus on top of the same? Incorporate new S3A FileSystem implementation - Key: HADOOP-10400 URL: https://issues.apache.org/jira/browse/HADOOP-10400 Project: Hadoop Common Issue Type: Improvement Components: fs, fs/s3 Affects Versions: 2.4.0 Reporter: Jordan Mendelson Assignee: Jordan Mendelson Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch The s3native filesystem has a number of limitations (some of which were recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses the aws-sdk instead of the jets3t library. There are a number of improvements over s3native including: - Parallel copy (rename) support (dramatically speeds up commits on large files) - AWS S3 explorer compatible empty directories files xyz/ instead of xyz_$folder$ (reduces littering) - Ignores s3native created _$folder$ files created by s3native and other S3 browsing utilities - Supports multiple output buffer dirs to even out IO when uploading files - Supports IAM role-based authentication - Allows setting a default canned ACL for uploads (public, private, etc.) - Better error recovery handling - Should handle input seeks without having to download the whole file (used for splits a lot) This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to various pom files to get it to build against trunk. I've been using 0.0.1 in production with CDH 4 for several months and CDH 5 for a few days. The version here is 0.0.2 which changes around some keys to hopefully bring the key name style more inline with the rest of hadoop 2.x. *Tunable parameters:* fs.s3a.access.key - Your AWS access key ID (omit for role authentication) fs.s3a.secret.key - Your AWS secret key (omit for role authentication) fs.s3a.connection.maximum - Controls how many parallel connections HttpClient spawns (default: 15) fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 (default: true) fs.s3a.attempts.maximum - How many times we should retry commands on transient errors (default: 10) fs.s3a.connection.timeout - Socket connect timeout (default: 5000) fs.s3a.paging.maximum - How many keys to request from S3 when doing directory listings at a time (default: 5000) fs.s3a.multipart.size - How big (in bytes) to split a upload or copy operation up into (default: 104857600) fs.s3a.multipart.threshold - Until a file is this large (in bytes), use non-parallel upload (default: 2147483647) fs.s3a.acl.default - Set a canned ACL on newly created/copied objects (private | public-read | public-read-write | authenticated-read | log-delivery-write | bucket-owner-read | bucket-owner-full-control) fs.s3a.multipart.purge - True if you want to purge existing multipart uploads that may not have been completed/aborted correctly (default: false) fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to purge (default:
[jira] [Commented] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization
[ https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015522#comment-14015522 ] Sumit Kumar commented on HADOOP-10634: -- That was a great suggestion [~ste...@apache.org] and thanks for clarifying purpose of listLocatedStatus apis. It was confusing when i started working on this patch. I've updated patch for MAPREDUCE-5907 to use these iterator based apis that should address memory concerns. I'm still going through HADOOP-10400, on high level it's a great enhancement but i've few notes that i would share in a day or two (still going through the patch). Add recursive list apis to FileSystem to give implementations an opportunity for optimization - Key: HADOOP-10634 URL: https://issues.apache.org/jira/browse/HADOOP-10634 Project: Hadoop Common Issue Type: Improvement Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Attachments: HADOOP-10634.patch Currently different code flows in hadoop use recursive listing to discover files/folders in a given path. For example in FileInputFormat (both mapreduce and mapred implementations) this is done while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for fs implementations like s3 because every listStatus call ends up being a webservice call to s3. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that give opportunity to the s3 fs implementation to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for s3 it provides a simple change (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HADOOP-10655) swift native store returns true even when it doesn't delete the directory
Sumit Kumar created HADOOP-10655: Summary: swift native store returns true even when it doesn't delete the directory Key: HADOOP-10655 URL: https://issues.apache.org/jira/browse/HADOOP-10655 Project: Hadoop Common Issue Type: Bug Components: fs Reporter: Sumit Kumar Wasn't sure if this was desired behavior but the javadoc comments and implementation seem to contradict, hence this JIRA. See http://tiny.cc/aa6tgx -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization
[ https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated HADOOP-10634: - Resolution: Duplicate Status: Resolved (was: Patch Available) Duplicate of MAPREDUCE-5907 Add recursive list apis to FileSystem to give implementations an opportunity for optimization - Key: HADOOP-10634 URL: https://issues.apache.org/jira/browse/HADOOP-10634 Project: Hadoop Common Issue Type: Improvement Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Attachments: HADOOP-10634.patch Currently different code flows in hadoop use recursive listing to discover files/folders in a given path. For example in FileInputFormat (both mapreduce and mapred implementations) this is done while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for fs implementations like s3 because every listStatus call ends up being a webservice call to s3. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that give opportunity to the s3 fs implementation to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for s3 it provides a simple change (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
Sumit Kumar created HADOOP-10643: Summary: Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation Key: HADOOP-10643 URL: https://issues.apache.org/jira/browse/HADOOP-10643 Project: Hadoop Common Issue Type: New Feature Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar The new set of file system related apis (FileContext/AbstractFileSystem) already support local filesytem, hdfs, viewfs) however they don't support s3n. This patch is to add that support using configurations like fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs This patch however doesn't provide a new implementation, instead relies on DelegateToFileSystem abstract class to delegate all calls from FileContext apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation
[ https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated HADOOP-10643: - Attachment: HADOOP-10643.patch Added the implementation along with test case Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation --- Key: HADOOP-10643 URL: https://issues.apache.org/jira/browse/HADOOP-10643 Project: Hadoop Common Issue Type: New Feature Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Attachments: HADOOP-10643.patch The new set of file system related apis (FileContext/AbstractFileSystem) already support local filesytem, hdfs, viewfs) however they don't support s3n. This patch is to add that support using configurations like fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs This patch however doesn't provide a new implementation, instead relies on DelegateToFileSystem abstract class to delegate all calls from FileContext apis for s3n to the NativeS3FileSystem implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization
Sumit Kumar created HADOOP-10634: Summary: Add recursive list apis to FileSystem to give implementations an opportunity for optimization Key: HADOOP-10634 URL: https://issues.apache.org/jira/browse/HADOOP-10634 Project: Hadoop Common Issue Type: Improvement Components: fs/s3 Reporter: Sumit Kumar Fix For: 2.4.0 Currently different code flows in hadoop use recursive listing to discover files/folders in a given path. For example in FileInputFormat (both mapreduce and mapred implementations) this is done while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for fs implementations like s3 because every listStatus call ends up being a webservice call to s3. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that give opportunity to the s3 fs implementation to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for s3 it provides a simple change (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization
[ https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated HADOOP-10634: - Attachment: HADOOP-10634.patch Add recursive list apis to FileSystem to give implementations an opportunity for optimization - Key: HADOOP-10634 URL: https://issues.apache.org/jira/browse/HADOOP-10634 Project: Hadoop Common Issue Type: Improvement Components: fs/s3 Reporter: Sumit Kumar Fix For: 2.4.0 Attachments: HADOOP-10634.patch Currently different code flows in hadoop use recursive listing to discover files/folders in a given path. For example in FileInputFormat (both mapreduce and mapred implementations) this is done while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for fs implementations like s3 because every listStatus call ends up being a webservice call to s3. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that give opportunity to the s3 fs implementation to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for s3 it provides a simple change (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization
[ https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Kumar updated HADOOP-10634: - Fix Version/s: (was: 2.4.0) Affects Version/s: 2.4.0 Status: Patch Available (was: Open) attached a patch that passes all the tests on top of hadoop 2.4.0 branch Add recursive list apis to FileSystem to give implementations an opportunity for optimization - Key: HADOOP-10634 URL: https://issues.apache.org/jira/browse/HADOOP-10634 Project: Hadoop Common Issue Type: Improvement Components: fs/s3 Affects Versions: 2.4.0 Reporter: Sumit Kumar Attachments: HADOOP-10634.patch Currently different code flows in hadoop use recursive listing to discover files/folders in a given path. For example in FileInputFormat (both mapreduce and mapred implementations) this is done while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for fs implementations like s3 because every listStatus call ends up being a webservice call to s3. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that give opportunity to the s3 fs implementation to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for s3 it provides a simple change (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-6356) Add a Cache for AbstractFileSystem in the new FileContext/AbstractFileSystem framework.
[ https://issues.apache.org/jira/browse/HADOOP-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006445#comment-14006445 ] Sumit Kumar commented on HADOOP-6356: - @all - trying to bring your attention on this JIRA? I see that parts of Hive/Hadoop code have already started consuming these apis but looking at this JIRA, there hasn't been much interest since last 2 years Add a Cache for AbstractFileSystem in the new FileContext/AbstractFileSystem framework. --- Key: HADOOP-6356 URL: https://issues.apache.org/jira/browse/HADOOP-6356 Project: Hadoop Common Issue Type: Improvement Components: fs Affects Versions: 0.22.0 Reporter: Sanjay Radia Assignee: Sanjay Radia The new filesystem framework, FileContext and AbstractFileSystem does not implement a cache for AbstractFileSystem. This Jira proposes to add a cache to the new framework just like with the old FileSystem. -- This message was sent by Atlassian JIRA (v6.2#6252)