ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
Is anybody experiencing this? It looks like a bug in JetS3t to me, but
thought I'd sanity check before filing an issue.



I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
(s3://fake-test/1234).

The code does write to S3, but with double forward slashes (e.g.
s3://fake-test//1234/-141933428/.

I did a debug and it seem like the culprit is
Jets3tFileSystemStore#pathToKey(path), which returns /fake-test/1234/...
for the input s3://fake-test/1234/ when it should hack off the first
forward slash. However, I couldn't find any bug report for JetS3t for this.

Am I missing something, or is this likely a JetS3t bug?



ᐧ


Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
ᐧ
I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
to be deprecated anyway in favor of s3n.
Also the comment section notes that Amazon has implemented an EmrFileSystem
for S3 which is built using AWS SDK rather than JetS3t.




On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji eshi...@gmail.com wrote:

 Hey Jay :)

 I tried s3n which uses the Jets3tNativeFileSystemStore, and the double
 slash went away.
 As far as I can see, it does look like a bug in hadoop-common; I'll file a
 ticket for it.

 Hope you are doing well, by the way!

 PS:
  Jets3tNativeFileSystemStore's implementation of pathToKey is:
 ==
   private static String pathToKey(Path path) {
 if (path.toUri().getScheme() != null 
 path.toUri().getPath().isEmpty()) {
   // allow uris without trailing slash after bucket to refer to root,
   // like s3n://mybucket
   return ;
 }
 if (!path.isAbsolute()) {
   throw new IllegalArgumentException(Path must be absolute:  + path);
 }
 String ret = path.toUri().getPath().substring(1); // remove initial
 slash
 if (ret.endsWith(/)  (ret.indexOf(/) != ret.length() - 1)) {
   ret = ret.substring(0, ret.length() -1);
   }
 return ret;
   }
 ==

 whereas Jets3tFileSystemStore uses:
 ==
   private String pathToKey(Path path) {
 if (!path.isAbsolute()) {
   throw new IllegalArgumentException(Path must be absolute:  + path);
 }
 return path.toUri().getPath();
   }
 ==






 On Tue, Dec 23, 2014 at 1:07 PM, Jay Vyas jayunit100.apa...@gmail.com
 wrote:

 Hi enno.  Might be worthwhile to cross post this on dev@hadoop...
 Obviously a simple spark way to test this would be to change the uri to
 write to hdfs:// or maybe you could do file:// , and confirm that the extra
 slash goes away.

 - if it's indeed a jets3t issue we should add a new unit test for this if
 the hcfs tests are passing for jets3tfilesystem, yet this error still
 exists.

 - To learn how to run HCFS tests against any FileSystem , see the wiki
 page : https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th
 entry on that page).

 - Is there another S3FileSystem implementation for AbstractFileSystem or
 is jets3t the only one?  That would be a easy  way to test this. And also a
 good workaround.

 I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used
 by so many - is that the standard impl for storing using AbstractFileSystem
 interface?

 On Dec 23, 2014, at 6:06 AM, Enno Shioji eshi...@gmail.com wrote:

 Is anybody experiencing this? It looks like a bug in JetS3t to me, but
 thought I'd sanity check before filing an issue.


 
 I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3
 URL (s3://fake-test/1234).

 The code does write to S3, but with double forward slashes (e.g.
 s3://fake-test//1234/-141933428/.

 I did a debug and it seem like the culprit is
 Jets3tFileSystemStore#pathToKey(path), which returns /fake-test/1234/...
 for the input s3://fake-test/1234/ when it should hack off the first
 forward slash. However, I couldn't find any bug report for JetS3t for this.

 Am I missing something, or is this likely a JetS3t bug?
 






Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jon Chase
I've had a lot of difficulties with using the s3:// prefix.  s3n:// seems
to work much better.  Can't find the link ATM, but seems I recall that
s3:// (Hadoop's original block format for s3) is no longer recommended for
use.  Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the
scenes.

On Tue, Dec 23, 2014 at 9:29 AM, Enno Shioji eshi...@gmail.com wrote:

 ᐧ
 I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
 to be deprecated anyway in favor of s3n.
 Also the comment section notes that Amazon has implemented an
 EmrFileSystem for S3 which is built using AWS SDK rather than JetS3t.




 On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji eshi...@gmail.com wrote:

 Hey Jay :)

 I tried s3n which uses the Jets3tNativeFileSystemStore, and the double
 slash went away.
 As far as I can see, it does look like a bug in hadoop-common; I'll file
 a ticket for it.

 Hope you are doing well, by the way!

 PS:
  Jets3tNativeFileSystemStore's implementation of pathToKey is:
 ==
   private static String pathToKey(Path path) {
 if (path.toUri().getScheme() != null 
 path.toUri().getPath().isEmpty()) {
   // allow uris without trailing slash after bucket to refer to root,
   // like s3n://mybucket
   return ;
 }
 if (!path.isAbsolute()) {
   throw new IllegalArgumentException(Path must be absolute:  +
 path);
 }
 String ret = path.toUri().getPath().substring(1); // remove initial
 slash
 if (ret.endsWith(/)  (ret.indexOf(/) != ret.length() - 1)) {
   ret = ret.substring(0, ret.length() -1);
   }
 return ret;
   }
 ==

 whereas Jets3tFileSystemStore uses:
 ==
   private String pathToKey(Path path) {
 if (!path.isAbsolute()) {
   throw new IllegalArgumentException(Path must be absolute:  +
 path);
 }
 return path.toUri().getPath();
   }
 ==






 On Tue, Dec 23, 2014 at 1:07 PM, Jay Vyas jayunit100.apa...@gmail.com
 wrote:

 Hi enno.  Might be worthwhile to cross post this on dev@hadoop...
 Obviously a simple spark way to test this would be to change the uri to
 write to hdfs:// or maybe you could do file:// , and confirm that the extra
 slash goes away.

 - if it's indeed a jets3t issue we should add a new unit test for this
 if the hcfs tests are passing for jets3tfilesystem, yet this error still
 exists.

 - To learn how to run HCFS tests against any FileSystem , see the wiki
 page : https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th
 entry on that page).

 - Is there another S3FileSystem implementation for AbstractFileSystem or
 is jets3t the only one?  That would be a easy  way to test this. And also a
 good workaround.

 I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used
 by so many - is that the standard impl for storing using AbstractFileSystem
 interface?

 On Dec 23, 2014, at 6:06 AM, Enno Shioji eshi...@gmail.com wrote:

 Is anybody experiencing this? It looks like a bug in JetS3t to me, but
 thought I'd sanity check before filing an issue.


 
 I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3
 URL (s3://fake-test/1234).

 The code does write to S3, but with double forward slashes (e.g.
 s3://fake-test//1234/-141933428/.

 I did a debug and it seem like the culprit is
 Jets3tFileSystemStore#pathToKey(path), which returns /fake-test/1234/...
 for the input s3://fake-test/1234/ when it should hack off the first
 forward slash. However, I couldn't find any bug report for JetS3t for this.

 Am I missing something, or is this likely a JetS3t bug?