[ 
https://issues.apache.org/jira/browse/HADOOP-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ondrej Kokes updated HADOOP-17241:
----------------------------------
    Description: 
Hi there,
 I'm using Spark to read some data from S3 and I encountered an error when 
reading from a bucket that contains a period (e.g. 
`s3a://okokes-test-v1.1/foo.csv`). I have close to zero Java experience, but 
I've tried to trace this as well as I can. Apologies for any misunderstanding 
on my part.

_Edit: the title is a little misleading - buckets can contain dots and s3a will 
work, but only if these bucket names conform to hostname restrictions - e.g. 
`s3a://foo.bar/bak.csv` would work, but my case - `okokes-test-v1.1` does not, 
because `1` is not conform to a top level domain pattern._

Using hadoop-aws:3.2.0, I get the following:
{code:java}
java.lang.NullPointerException: null uri host.
 at java.base/java.util.Objects.requireNonNull(Objects.java:246)
 at 
org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
 at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
 at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
 ... 47 elided{code}
hadoop-aws:2.7.4 did lead to a similar outcome
{code:java}
java.lang.IllegalArgumentException: The bucketName parameter must be specified.
 at 
com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)
 at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)
 at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
 at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
 at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
 ... 47 elided{code}
I investigated the issue a little bit and found buildFSURI to require the host 
to be not null - [see 
S3xLoginHelper.java|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/S3xLoginHelper.java#L70]
 - but in my case the host is null and the authority part of the URL should be 
used. When I checked AWS' handling of this case, they seem to be using 
authority for all s3:// paths - 
[https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3URI.java#L85].

I verified this URI in a Scala shell (openjdk 1.8.0_252)

 
{code:java}
scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getHost()
val res1: String = null
scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getAuthority()
val res2: String = okokes-test-v1.1
{code}
 

Oh and this is indeed a bucket name. Not only did I create it in the console, 
but there's also enough documentation on the topic - 
[https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html#bucketnamingrules]

  was:
Hi there,
I'm using Spark to read some data from S3 and I encountered an error when 
reading from a bucket that contains a period (e.g. 
`s3a://okokes-test-v1.1/foo.csv`). I have close to zero Java experience, but 
I've tried to trace this as well as I can. Apologies for any misunderstanding 
on my part.

Using hadoop-aws:3.2.0, I get the following:
{code:java}
java.lang.NullPointerException: null uri host.
 at java.base/java.util.Objects.requireNonNull(Objects.java:246)
 at 
org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
 at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
 at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
 ... 47 elided{code}
hadoop-aws:2.7.4 did lead to a similar outcome
{code:java}
java.lang.IllegalArgumentException: The bucketName parameter must be specified.
 at 
com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)
 at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)
 at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
 at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
 at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
 ... 47 elided{code}

I investigated the issue a little bit and found buildFSURI to require the host 
to be not null - [see 
S3xLoginHelper.java|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/S3xLoginHelper.java#L70]
 - but in my case the host is null and the authority part of the URL should be 
used. When I checked AWS' handling of this case, they seem to be using 
authority for all s3:// paths - 
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3URI.java#L85.

I verified this URI in a Scala shell (openjdk 1.8.0_252)

 
{code:java}
scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getHost()
val res1: String = null
scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getAuthority()
val res2: String = okokes-test-v1.1
{code}
 

Oh and this is indeed a bucket name. Not only did I create it in the console, 
but there's also enough documentation on the topic - 
https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html#bucketnamingrules


> s3a: dots not allowed in bucket names
> -------------------------------------
>
>                 Key: HADOOP-17241
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17241
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.7.4, 3.2.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> Hi there,
>  I'm using Spark to read some data from S3 and I encountered an error when 
> reading from a bucket that contains a period (e.g. 
> `s3a://okokes-test-v1.1/foo.csv`). I have close to zero Java experience, but 
> I've tried to trace this as well as I can. Apologies for any misunderstanding 
> on my part.
> _Edit: the title is a little misleading - buckets can contain dots and s3a 
> will work, but only if these bucket names conform to hostname restrictions - 
> e.g. `s3a://foo.bar/bak.csv` would work, but my case - `okokes-test-v1.1` 
> does not, because `1` is not conform to a top level domain pattern._
> Using hadoop-aws:3.2.0, I get the following:
> {code:java}
> java.lang.NullPointerException: null uri host.
>  at java.base/java.util.Objects.requireNonNull(Objects.java:246)
>  at 
> org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
>  ... 47 elided{code}
> hadoop-aws:2.7.4 did lead to a similar outcome
> {code:java}
> java.lang.IllegalArgumentException: The bucketName parameter must be 
> specified.
>  at 
> com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)
>  at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)
>  at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
>  ... 47 elided{code}
> I investigated the issue a little bit and found buildFSURI to require the 
> host to be not null - [see 
> S3xLoginHelper.java|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/S3xLoginHelper.java#L70]
>  - but in my case the host is null and the authority part of the URL should 
> be used. When I checked AWS' handling of this case, they seem to be using 
> authority for all s3:// paths - 
> [https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3URI.java#L85].
> I verified this URI in a Scala shell (openjdk 1.8.0_252)
>  
> {code:java}
> scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getHost()
> val res1: String = null
> scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getAuthority()
> val res2: String = okokes-test-v1.1
> {code}
>  
> Oh and this is indeed a bucket name. Not only did I create it in the console, 
> but there's also enough documentation on the topic - 
> [https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html#bucketnamingrules]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to