[ https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189701#comment-17189701 ]
Dongjoon Hyun commented on SPARK-32766: --------------------------------------- Thank you for the pointer, [~ste...@apache.org]. > s3a: bucket names with dots cannot be used > ------------------------------------------ > > Key: SPARK-32766 > URL: https://issues.apache.org/jira/browse/SPARK-32766 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.0.0 > Reporter: Ondrej Kokes > Priority: Minor > > Running vanilla spark with > {noformat} > --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat} > I cannot read from S3, if the bucket name contains a dot (a valid name). > A minimal reproducible example looks like this > {{from pyspark.sql import SparkSession}} > {{import pyspark.sql.functions as f}} > {{if __name__ == '__main__':}} > {{ spark = (SparkSession}} > {{ .builder}} > {{ .appName('my_app')}} > {{ .master("local[*]")}} > {{ .getOrCreate()}} > {{ )}} > {{ spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}} > Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read > that CSV. I created the same bucket without the period and it worked fine. > *Now I'm not sure whether this is a thing of prepping the path names and > passing them to the aws-sdk, or whether the fault is within the SDK itself. I > am not Java savvy to investigate the issue further, but I tried to make the > repro as short as possible.* > ---- > I get different errors depending on which Hadoop distributions I use. If I > use the default PySpark distribution (which includes Hadoop 2), I get the > following (using hadoop-aws:2.7.4) > {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}} > {{java.lang.IllegalArgumentException: The bucketName parameter must be > specified.}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}} > {{ at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}} > {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}} > {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}} > {{ at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}} > {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}} > {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}} > {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}} > {{ at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}} > {{ at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}} > {{ at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}} > {{ ... 47 elided}} > When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this > error (with hadoop-aws:3.2.0): > {{java.lang.NullPointerException: null uri host.}} > {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}} > {{ at > org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}} > {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}} > {{ at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}} > {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}} > {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}} > {{ at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}} > {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}} > {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}} > {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}} > {{ at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}} > {{ at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}} > {{ at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}} > {{ ... 47 elided}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org