[jira] [Updated] (SPARK-37111) RDD file loading APIs throw URISyntaxException when there is a colon in the file path

Brady Tello (Jira) Mon, 25 Oct 2021 07:53:08 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-37111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brady Tello updated SPARK-37111:
--------------------------------
    Description: 
When a colon is present in a path to a file, many of Spark's RDD file loading 
APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.  
The following Scala code and stack trace example was generated on my laptop 
running Spark 3.2.0.   I've verified that this issue also affects Python, and 
SQL and I'm assuming it probably also affects Java.
{code:java}
scala> val df = 
sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) 
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259) 
at org.apache.hadoop.fs.Path.<init>(Path.java:217) at 
org.apache.hadoop.fs.Path.<init>(Path.java:125) at 
org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at 
org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
 at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
 at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: 
java.net.URISyntaxException: Relative path in absolute URI: test:me at 
java.base/java.net.URI.checkPath(URI.java:1990) at 
java.base/java.net.URI.<init>(URI.java:780) at 
org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
{code}
Why can't I just not use colons in my paths you ask?  I'm running Spark on top 
of an S3 environment in which users are only permitted to read and write data 
to their personal S3 workspace and the path to their personal workspace 
contains a colon.  Removing that colon would be a major architectural change to 
the entire authentication architecture for several apps outside of our Spark 
app and thus we don't really have the flexibility to remove it.  Without a fix 
to this bug, users simply cannot use the RDD APIs.

  was:
When a colon is present in a path to a file, many of Spark's RDD file loading 
APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.  
The following code and stack trace example was generated on my laptop running 
Spark 3.2.0.  
{code:java}
scala> val df = 
sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) 
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259) 
at org.apache.hadoop.fs.Path.<init>(Path.java:217) at 
org.apache.hadoop.fs.Path.<init>(Path.java:125) at 
org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at 
org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
 at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
 at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: 
java.net.URISyntaxException: Relative path in absolute URI: test:me at 
java.base/java.net.URI.checkPath(URI.java:1990) at 
java.base/java.net.URI.<init>(URI.java:780) at 
org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
{code}
Why can't I just not use colons in my paths you ask?  I'm running Spark on top 
of an S3 environment in which users are only permitted to read and write data 
to their personal S3 workspace and the path to their personal workspace 
contains a colon.  Removing that colon would be a major architectural change to 
the entire authentication architecture for several apps outside of our Spark 
app and thus we don't really have the flexibility to remove it.  Without a fix 
to this bug, users simply cannot use the RDD APIs.


> RDD file loading APIs throw URISyntaxException when there is a colon in the 
> file path
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-37111
>                 URL: https://issues.apache.org/jira/browse/SPARK-37111
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Brady Tello
>            Priority: Major
>
> When a colon is present in a path to a file, many of Spark's RDD file loading 
> APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.  
> The following Scala code and stack trace example was generated on my laptop 
> running Spark 3.2.0.   I've verified that this issue also affects Python, and 
> SQL and I'm assuming it probably also affects Java.
> {code:java}
> scala> val df = 
> sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:me at 
> org.apache.hadoop.fs.Path.initialize(Path.java:259) at 
> org.apache.hadoop.fs.Path.<init>(Path.java:217) at 
> org.apache.hadoop.fs.Path.<init>(Path.java:125) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
>  at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
>  at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
> scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
> scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
> org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
> org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: 
> java.net.URISyntaxException: Relative path in absolute URI: test:me at 
> java.base/java.net.URI.checkPath(URI.java:1990) at 
> java.base/java.net.URI.<init>(URI.java:780) at 
> org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
> {code}
> Why can't I just not use colons in my paths you ask?  I'm running Spark on 
> top of an S3 environment in which users are only permitted to read and write 
> data to their personal S3 workspace and the path to their personal workspace 
> contains a colon.  Removing that colon would be a major architectural change 
> to the entire authentication architecture for several apps outside of our 
> Spark app and thus we don't really have the flexibility to remove it.  
> Without a fix to this bug, users simply cannot use the RDD APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37111) RDD file loading APIs throw URISyntaxException when there is a colon in the file path

Reply via email to