[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332999#comment-16332999 ]
Henry Robinson edited comment on SPARK-23148 at 1/19/18 11:25 PM: ------------------------------------------------------------------ It seems like the problem is that {{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a {{path}} argument that's URL-encoded. We could add an overload for {{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit of being a more localised change (and doesn't change the 'contract' that comes from {{FileScanRDD}} currently having URL-encoded pathnames everywhere). A strawman commit is [here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef]. was (Author: henryr): It seems like the problem is that {{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a {{path}} argument that's URL-encoded. We could add an overload for {{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit of being a more localised change (and doesn't change the 'contract' that comes from {{FileScanRDD}} currently having URL-encoded pathnames everywhere. A strawman commit is [here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef]. > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -------------------------------------------------------------------------------------- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Bogdan Raducanu > Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-00000-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org