[jira] [Updated] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2016-01-20 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated HADOOP-10643:
-
Release Note:   (was: This was implemented as part of s3a work in 
HADOOP-11262)

> Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs 
> implementation
> ---
>
> Key: HADOOP-10643
> URL: https://issues.apache.org/jira/browse/HADOOP-10643
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
> Attachments: HADOOP-10643.patch
>
>
> The new set of file system related apis (FileContext/AbstractFileSystem) 
> already support local filesytem, hdfs, viewfs) however they don't support 
> s3n. This patch is to add that support using configurations like
> fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs
> This patch however doesn't provide a new implementation, instead relies on 
> DelegateToFileSystem abstract class to delegate all calls from FileContext 
> apis for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2016-01-20 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar resolved HADOOP-10643.
--
  Resolution: Duplicate
Release Note: This was implemented as part of s3a work in HADOOP-11262

> Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs 
> implementation
> ---
>
> Key: HADOOP-10643
> URL: https://issues.apache.org/jira/browse/HADOOP-10643
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 2.4.0
>Reporter: Sumit Kumar
>Assignee: Sumit Kumar
> Attachments: HADOOP-10643.patch
>
>
> The new set of file system related apis (FileContext/AbstractFileSystem) 
> already support local filesytem, hdfs, viewfs) however they don't support 
> s3n. This patch is to add that support using configurations like
> fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs
> This patch however doesn't provide a new implementation, instead relies on 
> DelegateToFileSystem abstract class to delegate all calls from FileContext 
> apis for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2015-08-11 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682062#comment-14682062
 ] 

Sumit Kumar commented on HADOOP-10643:
--

I see, feel free to resolve with appropriate closure code.

 Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs 
 implementation
 ---

 Key: HADOOP-10643
 URL: https://issues.apache.org/jira/browse/HADOOP-10643
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: HADOOP-10643.patch


 The new set of file system related apis (FileContext/AbstractFileSystem) 
 already support local filesytem, hdfs, viewfs) however they don't support 
 s3n. This patch is to add that support using configurations like
 fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs
 This patch however doesn't provide a new implementation, instead relies on 
 DelegateToFileSystem abstract class to delegate all calls from FileContext 
 apis for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2015-01-16 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280942#comment-14280942
 ] 

Sumit Kumar commented on HADOOP-10643:
--

[~ste...@apache.org] I'm assuming your concern is on this portion of the patch 
(as part of AbstractFileSystem.java):
{noformat}
-// A file system implementation that requires authority must always
-// specify default port
-if (defaultPort  0  authorityNeeded) {
-  throw new HadoopIllegalArgumentException(
-  FileSystem implementation error -  default port  + defaultPort
-  +  is not valid);
-}
{noformat}

If so, s3's urls have a specific requirement that they don't contain any port 
(so defaultPort becomes -1 in this case) and they don't have any authority in 
the url as well. Does this work?

 Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs 
 implementation
 ---

 Key: HADOOP-10643
 URL: https://issues.apache.org/jira/browse/HADOOP-10643
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: HADOOP-10643.patch


 The new set of file system related apis (FileContext/AbstractFileSystem) 
 already support local filesytem, hdfs, viewfs) however they don't support 
 s3n. This patch is to add that support using configurations like
 fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs
 This patch however doesn't provide a new implementation, instead relies on 
 DelegateToFileSystem abstract class to delegate all calls from FileContext 
 apis for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2014-12-17 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar reassigned HADOOP-10643:


Assignee: Sumit Kumar

 Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs 
 implementation
 ---

 Key: HADOOP-10643
 URL: https://issues.apache.org/jira/browse/HADOOP-10643
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
Assignee: Sumit Kumar
 Attachments: HADOOP-10643.patch


 The new set of file system related apis (FileContext/AbstractFileSystem) 
 already support local filesytem, hdfs, viewfs) however they don't support 
 s3n. This patch is to add that support using configurations like
 fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs
 This patch however doesn't provide a new implementation, instead relies on 
 DelegateToFileSystem abstract class to delegate all calls from FileContext 
 apis for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-06-23 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041290#comment-14041290
 ] 

Sumit Kumar commented on HADOOP-10400:
--

bq. Why not just have the build read in some non-SCM'd file? ...

Agree with you. My intentions were exactly the same, i.e. to be able to run 
tests with credentials at a predetermined location. xml might require extra 
work because aws sdk supports following by default as part of sdk: system 
properties, environment variables, properties files and the new 
ProfileCredentialsProvider. I like the idea of skipping tests when credentials 
are not available, one good idea however would be to report in the test-report 
that these tests were skipped because of missing credentials file (along with 
expected location)




 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 86400)
 fs.s3a.buffer.dir - Comma separated list of directories that will be used 
 to buffer file writes out of (default: uses ${hadoop.tmp.dir}/s3a )
 *Caveats*:
 Hadoop uses a standard output committer which uploads files as 
 filename.COPYING before renaming them. This can cause unnecessary performance 
 issues with S3 because it does not have a rename operation and S3 already 
 verifies uploads against an md5 that the driver sets on the upload request. 
 While this FileSystem should be significantly faster than the built-in 
 s3native driver because of parallel copy support, you may want to consider 
 setting a null output committer on our jobs to further improve performance.
 Because S3 requires the file length and MD5 to be known before a file is 
 uploaded, all output is buffered out to a temporary file first similar to the 
 s3native driver.
 Due to the lack of native rename() for S3, renaming extremely large files or 
 

[jira] [Commented] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems

2014-06-17 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034281#comment-14034281
 ] 

Sumit Kumar commented on HADOOP-9565:
-

I see some interface like this in hadoop-azure codebase 
https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterface.java

 Add a Blobstore interface to add to blobstore FileSystems
 -

 Key: HADOOP-9565
 URL: https://issues.apache.org/jira/browse/HADOOP-9565
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: fs
Affects Versions: 2.0.4-alpha
Reporter: Steve Loughran
Priority: Minor

 We can make the fact that some {{FileSystem}} implementations are really 
 blobstores, with different atomicity and consistency guarantees, by adding a 
 {{Blobstore}} interface to add to them. 
 This could also be a place to add a {{Copy(Path,Path)}} method, assuming that 
 all blobstores implement at server-side copy operation as a substitute for 
 rename.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10400) Incorporate new S3A FileSystem implementation

2014-06-05 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019299#comment-14019299
 ] 

Sumit Kumar commented on HADOOP-10400:
--

Few observations:

# Should it include tests to verify behavior when a user tries to write paths 
that have multiple / such as s3a://foobar/delta///gammma//abc? How is 
this implementation handling this would be interesting because each occurance 
of / would appear a directory for current implementation?
# Should it have more logic to consider _$folder$ as marker as well for folders 
like it's doing for / (by creating fake directories). That way the 
implementation would be exactly the same as current s3n. If item #1 fails, i 
don't see another approach to solve folder representation in s3.
# aws-java-sdk provides 
https://github.com/aws/aws-sdk-java/blob/master/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
 Should we consider adding ProfileCredentialsProvider as described here 
https://java.awsblog.com/post/TxRE9V31UFN860/Secure-Local-Development-with-the-ProfileCredentialsProvider
 This might be a big boon in testing s3 behaviors as unit tests (it's always 
been really hard keeping code/xmls checked-into code bases having the access 
and secret keys). 
# Should S3AFileStatus be more strict in contructor arguments, for example, if 
it's a directory constructor do we need isdir flag? Should this be more clearer 
api?
# Should it be doing parallel rename/delete operations as well? More 
specifically could copy operations (while renaming a folder) leverage parallel 
threads using TransferManager apis?
# Should it implement the iterative listing api as well for better performance 
and build listStatus on top of the same?

 Incorporate new S3A FileSystem implementation
 -

 Key: HADOOP-10400
 URL: https://issues.apache.org/jira/browse/HADOOP-10400
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs, fs/s3
Affects Versions: 2.4.0
Reporter: Jordan Mendelson
Assignee: Jordan Mendelson
 Attachments: HADOOP-10400-1.patch, HADOOP-10400-2.patch, 
 HADOOP-10400-3.patch, HADOOP-10400-4.patch, HADOOP-10400-5.patch


 The s3native filesystem has a number of limitations (some of which were 
 recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses 
 the aws-sdk instead of the jets3t library. There are a number of improvements 
 over s3native including:
 - Parallel copy (rename) support (dramatically speeds up commits on large 
 files)
 - AWS S3 explorer compatible empty directories files xyz/ instead of 
 xyz_$folder$ (reduces littering)
 - Ignores s3native created _$folder$ files created by s3native and other S3 
 browsing utilities
 - Supports multiple output buffer dirs to even out IO when uploading files
 - Supports IAM role-based authentication
 - Allows setting a default canned ACL for uploads (public, private, etc.)
 - Better error recovery handling
 - Should handle input seeks without having to download the whole file (used 
 for splits a lot)
 This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to 
 various pom files to get it to build against trunk. I've been using 0.0.1 in 
 production with CDH 4 for several months and CDH 5 for a few days. The 
 version here is 0.0.2 which changes around some keys to hopefully bring the 
 key name style more inline with the rest of hadoop 2.x.
 *Tunable parameters:*
 fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
 fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
 fs.s3a.connection.maximum - Controls how many parallel connections 
 HttpClient spawns (default: 15)
 fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 
 (default: true)
 fs.s3a.attempts.maximum - How many times we should retry commands on 
 transient errors (default: 10)
 fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
 fs.s3a.paging.maximum - How many keys to request from S3 when doing 
 directory listings at a time (default: 5000)
 fs.s3a.multipart.size - How big (in bytes) to split a upload or copy 
 operation up into (default: 104857600)
 fs.s3a.multipart.threshold - Until a file is this large (in bytes), use 
 non-parallel upload (default: 2147483647)
 fs.s3a.acl.default - Set a canned ACL on newly created/copied objects 
 (private | public-read | public-read-write | authenticated-read | 
 log-delivery-write | bucket-owner-read | bucket-owner-full-control)
 fs.s3a.multipart.purge - True if you want to purge existing multipart 
 uploads that may not have been completed/aborted correctly (default: false)
 fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads 
 to purge (default: 

[jira] [Commented] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization

2014-06-02 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015522#comment-14015522
 ] 

Sumit Kumar commented on HADOOP-10634:
--

That was a great suggestion [~ste...@apache.org] and thanks for clarifying 
purpose of listLocatedStatus apis. It was confusing when i started working on 
this patch. I've updated patch for MAPREDUCE-5907 to use these iterator based 
apis that should address memory concerns. 

I'm still going through HADOOP-10400, on high level it's a great enhancement 
but i've few notes that i would share in a day or two (still going through the 
patch).

 Add recursive list apis to FileSystem to give implementations an opportunity 
 for optimization
 -

 Key: HADOOP-10634
 URL: https://issues.apache.org/jira/browse/HADOOP-10634
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: HADOOP-10634.patch


 Currently different code flows in hadoop use recursive listing to discover 
 files/folders in a given path. For example in FileInputFormat (both mapreduce 
 and mapred implementations) this is done while calculating splits. They 
 however do this by doing listing level by level. That means to discover files 
 in /foo/bar means they do listing at /foo/bar first to get the immediate 
 children, then make the same call on all immediate children for /foo/bar to 
 discover their immediate children and so on. This doesn't scale well for fs 
 implementations like s3 because every listStatus call ends up being a 
 webservice call to s3. In cases where large number of files are considered 
 for input, this makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that give opportunity to the 
 s3 fs implementation to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for s3 it provides a 
 simple change (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10655) swift native store returns true even when it doesn't delete the directory

2014-06-02 Thread Sumit Kumar (JIRA)
Sumit Kumar created HADOOP-10655:


 Summary: swift native store returns true even when it doesn't 
delete the directory
 Key: HADOOP-10655
 URL: https://issues.apache.org/jira/browse/HADOOP-10655
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Reporter: Sumit Kumar


Wasn't sure if this was desired behavior but the javadoc comments and 
implementation seem to contradict, hence this JIRA. See http://tiny.cc/aa6tgx



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization

2014-05-29 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated HADOOP-10634:
-

Resolution: Duplicate
Status: Resolved  (was: Patch Available)

Duplicate of MAPREDUCE-5907

 Add recursive list apis to FileSystem to give implementations an opportunity 
 for optimization
 -

 Key: HADOOP-10634
 URL: https://issues.apache.org/jira/browse/HADOOP-10634
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: HADOOP-10634.patch


 Currently different code flows in hadoop use recursive listing to discover 
 files/folders in a given path. For example in FileInputFormat (both mapreduce 
 and mapred implementations) this is done while calculating splits. They 
 however do this by doing listing level by level. That means to discover files 
 in /foo/bar means they do listing at /foo/bar first to get the immediate 
 children, then make the same call on all immediate children for /foo/bar to 
 discover their immediate children and so on. This doesn't scale well for fs 
 implementations like s3 because every listStatus call ends up being a 
 webservice call to s3. In cases where large number of files are considered 
 for input, this makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that give opportunity to the 
 s3 fs implementation to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for s3 it provides a 
 simple change (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2014-05-29 Thread Sumit Kumar (JIRA)
Sumit Kumar created HADOOP-10643:


 Summary: Add NativeS3Fs that delgates calls from FileContext apis 
to native s3 fs implementation
 Key: HADOOP-10643
 URL: https://issues.apache.org/jira/browse/HADOOP-10643
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar


The new set of file system related apis (FileContext/AbstractFileSystem) 
already support local filesytem, hdfs, viewfs) however they don't support s3n. 
This patch is to add that support using configurations like

fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs

This patch however doesn't provide a new implementation, instead relies on 
DelegateToFileSystem abstract class to delegate all calls from FileContext apis 
for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10643) Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs implementation

2014-05-29 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated HADOOP-10643:
-

Attachment: HADOOP-10643.patch

Added the implementation along with test case

 Add NativeS3Fs that delgates calls from FileContext apis to native s3 fs 
 implementation
 ---

 Key: HADOOP-10643
 URL: https://issues.apache.org/jira/browse/HADOOP-10643
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: HADOOP-10643.patch


 The new set of file system related apis (FileContext/AbstractFileSystem) 
 already support local filesytem, hdfs, viewfs) however they don't support 
 s3n. This patch is to add that support using configurations like
 fs.AbstractFileSystem.s3n.impl = org.apache.hadoop.fs.s3native.NativeS3Fs
 This patch however doesn't provide a new implementation, instead relies on 
 DelegateToFileSystem abstract class to delegate all calls from FileContext 
 apis for s3n to the NativeS3FileSystem implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization

2014-05-28 Thread Sumit Kumar (JIRA)
Sumit Kumar created HADOOP-10634:


 Summary: Add recursive list apis to FileSystem to give 
implementations an opportunity for optimization
 Key: HADOOP-10634
 URL: https://issues.apache.org/jira/browse/HADOOP-10634
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Reporter: Sumit Kumar
 Fix For: 2.4.0


Currently different code flows in hadoop use recursive listing to discover 
files/folders in a given path. For example in FileInputFormat (both mapreduce 
and mapred implementations) this is done while calculating splits. They however 
do this by doing listing level by level. That means to discover files in 
/foo/bar means they do listing at /foo/bar first to get the immediate children, 
then make the same call on all immediate children for /foo/bar to discover 
their immediate children and so on. This doesn't scale well for fs 
implementations like s3 because every listStatus call ends up being a 
webservice call to s3. In cases where large number of files are considered for 
input, this makes getSplits() call slow. 

This patch adds a new set of recursive list apis that give opportunity to the 
s3 fs implementation to optimize. The behavior remains the same for other 
implementations (that is a default implementation is provided for other fs so 
they don't have to implement anything new). However for s3 it provides a simple 
change (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization

2014-05-28 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated HADOOP-10634:
-

Attachment: HADOOP-10634.patch

 Add recursive list apis to FileSystem to give implementations an opportunity 
 for optimization
 -

 Key: HADOOP-10634
 URL: https://issues.apache.org/jira/browse/HADOOP-10634
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Reporter: Sumit Kumar
 Fix For: 2.4.0

 Attachments: HADOOP-10634.patch


 Currently different code flows in hadoop use recursive listing to discover 
 files/folders in a given path. For example in FileInputFormat (both mapreduce 
 and mapred implementations) this is done while calculating splits. They 
 however do this by doing listing level by level. That means to discover files 
 in /foo/bar means they do listing at /foo/bar first to get the immediate 
 children, then make the same call on all immediate children for /foo/bar to 
 discover their immediate children and so on. This doesn't scale well for fs 
 implementations like s3 because every listStatus call ends up being a 
 webservice call to s3. In cases where large number of files are considered 
 for input, this makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that give opportunity to the 
 s3 fs implementation to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for s3 it provides a 
 simple change (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization

2014-05-28 Thread Sumit Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Kumar updated HADOOP-10634:
-

Fix Version/s: (was: 2.4.0)
Affects Version/s: 2.4.0
   Status: Patch Available  (was: Open)

attached a patch that passes all the tests on top of hadoop 2.4.0 branch

 Add recursive list apis to FileSystem to give implementations an opportunity 
 for optimization
 -

 Key: HADOOP-10634
 URL: https://issues.apache.org/jira/browse/HADOOP-10634
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Affects Versions: 2.4.0
Reporter: Sumit Kumar
 Attachments: HADOOP-10634.patch


 Currently different code flows in hadoop use recursive listing to discover 
 files/folders in a given path. For example in FileInputFormat (both mapreduce 
 and mapred implementations) this is done while calculating splits. They 
 however do this by doing listing level by level. That means to discover files 
 in /foo/bar means they do listing at /foo/bar first to get the immediate 
 children, then make the same call on all immediate children for /foo/bar to 
 discover their immediate children and so on. This doesn't scale well for fs 
 implementations like s3 because every listStatus call ends up being a 
 webservice call to s3. In cases where large number of files are considered 
 for input, this makes getSplits() call slow. 
 This patch adds a new set of recursive list apis that give opportunity to the 
 s3 fs implementation to optimize. The behavior remains the same for other 
 implementations (that is a default implementation is provided for other fs so 
 they don't have to implement anything new). However for s3 it provides a 
 simple change (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-6356) Add a Cache for AbstractFileSystem in the new FileContext/AbstractFileSystem framework.

2014-05-22 Thread Sumit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006445#comment-14006445
 ] 

Sumit Kumar commented on HADOOP-6356:
-

@all - trying to bring your attention on this JIRA? I see that parts of 
Hive/Hadoop code have already started consuming these apis but looking at this 
JIRA, there hasn't been much interest since last 2 years

 Add a Cache for AbstractFileSystem in the new FileContext/AbstractFileSystem 
 framework.
 ---

 Key: HADOOP-6356
 URL: https://issues.apache.org/jira/browse/HADOOP-6356
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs
Affects Versions: 0.22.0
Reporter: Sanjay Radia
Assignee: Sanjay Radia

 The new filesystem framework, FileContext and AbstractFileSystem does not 
 implement a cache for AbstractFileSystem.
 This Jira proposes to add a cache to the new framework just like with the old 
 FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)