[ https://issues.apache.org/jira/browse/HADOOP-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189310#comment-17189310 ]
Ondrej Kokes commented on HADOOP-17241: --------------------------------------- [~ste...@apache.org] excellent stuff, thanks for taking the time to explain the history. It's not really common that people take the time to outline the issue and its resolution so clearly, so I wanted to stress how I appreciate your approach. I actually read that aws blog post some time ago, but didn't connect the dots just now. As a result, I have just sent out an internal memo to start moving away from bucket names with dots to avoid any further issues with this. Interestingly enough, this was only an issue once we started using OSS versions of Hadoop and Spark - as long as we were using EMR or Glue, it all worked fine, which probably comes down to EMRFS, Amazon's own implementation of S3 access. Take care! > s3a: bucket names which aren't parseable hostnames unsupported > -------------------------------------------------------------- > > Key: HADOOP-17241 > URL: https://issues.apache.org/jira/browse/HADOOP-17241 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.7.4, 3.2.0 > Reporter: Ondrej Kokes > Priority: Minor > > Hi there, > I'm using Spark to read some data from S3 and I encountered an error when > reading from a bucket that contains a period (e.g. > `s3a://okokes-test-v1.1/foo.csv`). I have close to zero Java experience, but > I've tried to trace this as well as I can. Apologies for any misunderstanding > on my part. > _Edit: the title is a little misleading - buckets can contain dots and s3a > will work, but only if these bucket names conform to hostname restrictions - > e.g. `s3a://foo.bar/bak.csv` would work, but my case - `okokes-test-v1.1` > does not, because `1` is not conform to a top level domain pattern._ > Using hadoop-aws:3.2.0, I get the following: > {code:java} > java.lang.NullPointerException: null uri host. > at java.base/java.util.Objects.requireNonNull(Objects.java:246) > at > org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71) > at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535) > ... 47 elided{code} > hadoop-aws:2.7.4 did lead to a similar outcome > {code:java} > java.lang.IllegalArgumentException: The bucketName parameter must be > specified. > at > com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816) > at > com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026) > at > com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535) > ... 47 elided{code} > I investigated the issue a little bit and found buildFSURI to require the > host to be not null - [see > S3xLoginHelper.java|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/S3xLoginHelper.java#L70] > - but in my case the host is null and the authority part of the URL should > be used. When I checked AWS' handling of this case, they seem to be using > authority for all s3:// paths - > [https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3URI.java#L85]. > I verified this URI in a Scala shell (openjdk 1.8.0_252) > > {code:java} > scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getHost() > val res1: String = null > scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getAuthority() > val res2: String = okokes-test-v1.1 > {code} > > Oh and this is indeed a bucket name. Not only did I create it in the console, > but there's also enough documentation on the topic - > [https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html#bucketnamingrules] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org