[ https://issues.apache.org/jira/browse/SPARK-20061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943008#comment-15943008 ]
Steve Loughran commented on SPARK-20061: ---------------------------------------- ":" is one of those "implicitly forbidden characters in a path element", along with "/", as [we've tried to write down in the past|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md]. Now the fact that the S3 fails to be so prescriptive is unfortunate, but given HDFS supports the 0 byte char in a name, it's hard to take the moral high ground here. This is pretty fundamental, given HADOOP-13 and HADOOP-3257. Which means that fixing it is going to be a hard undertaking. It's not just the globbing change, it looks like it happens wherever the code assumes that you can construct a new path p2 from a path P1 by going {{p2 = new Path(p1, child)}}. That happens in a lot of places: to fix this you'd have to get the HDFS team who wrote this stuff involved, because they are the ones who understand the details. This is happening in the Hadoop JARs, nothing you can fix in the spark codebase, as far as I can see. Closing as a duplicate. Workaround: don't have filenames like that. That may be dismissive, but it works and it's the only solution likely to be valid in the near future. Sorry > Reading a file with colon (:) from S3 fails with URISyntaxException > ------------------------------------------------------------------- > > Key: SPARK-20061 > URL: https://issues.apache.org/jira/browse/SPARK-20061 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.1.0 > Environment: EC2, AWS > Reporter: Michel Lemay > > When reading a bunch of files from s3 using wildcards, it fails with the > following exception: > {code} > scala> val fn = "s3a://mybucket/path/*/" > scala> val ds = spark.readStream.schema(schema).json(fn) > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.<init>(Path.java:171) > at org.apache.hadoop.fs.Path.<init>(Path.java:93) > at org.apache.hadoop.fs.Globber.glob(Globber.java:241) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133) > at > org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181) > ... 50 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.<init>(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 73 more > {code} > The file in question sits at the root of s3a://mybucket/path/ > {code} > aws s3 ls s3://mybucket/path/ > PRE subfolder1/ > PRE subfolder2/ > ... > 2017-01-06 20:33:46 1383 > 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json > ... > {code} > Removing the wildcard from path make it work but it obviously does misses all > files in subdirectories. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org