[ 
https://issues.apache.org/jira/browse/SPARK-20061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943008#comment-15943008
 ] 

Steve Loughran commented on SPARK-20061:
----------------------------------------

":" is one of those "implicitly forbidden characters in a path element", along 
with "/", as [we've tried to write down in the 
past|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md].

Now the fact that the S3 fails to be so prescriptive is unfortunate, but given 
HDFS supports the 0 byte char in a name, it's hard to take the moral high 
ground here.

This is pretty fundamental, given HADOOP-13 and HADOOP-3257. Which means that 
fixing it is going to be a hard undertaking. It's not just the globbing change, 
it looks like it happens wherever the code assumes that you can construct a new 
path p2 from a path P1 by going {{p2 = new Path(p1, child)}}. That happens in a 
lot of places: to fix this you'd have to get the HDFS team who wrote this stuff 
involved, because they are the ones who understand the details.

This is happening in the Hadoop JARs, nothing you can fix in the spark 
codebase, as far as I can see. Closing as a duplicate. 

Workaround: don't have filenames like that. That may be dismissive, but it 
works and it's the only solution likely to be valid in the near future. Sorry



> Reading a file with colon (:) from S3 fails with URISyntaxException
> -------------------------------------------------------------------
>
>                 Key: SPARK-20061
>                 URL: https://issues.apache.org/jira/browse/SPARK-20061
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>         Environment: EC2, AWS
>            Reporter: Michel Lemay
>
> When reading a bunch of files from s3 using wildcards, it fails with the 
> following exception:
> {code}
> scala> val fn = "s3a://mybucket/path/*/"
> scala> val ds = spark.readStream.schema(schema).json(fn)
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:171)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:93)
>   at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
>   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181)
>   ... 50 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.<init>(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 73 more
> {code}
> The file in question sits at the root of s3a://mybucket/path/
> {code}
> aws s3 ls s3://mybucket/path/
>                            PRE subfolder1/
>                            PRE subfolder2/
> ...
> 2017-01-06 20:33:46       1383 
> 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
> ...
> {code}
> Removing the wildcard from path make it work but it obviously does misses all 
> files in subdirectories.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to