[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path
[ https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373552#comment-17373552 ] Mike Pieters commented on SPARK-34883: -- I've got the same error here when I try to run: {code:java} spark.read.csv(URL_ABFS_RAW + "/salesforce/Case/timestamp=2021-07-02 00:14:15.129481", header=True, multiLine=True) {code} I'm running Spark 3.0.1 > Setting CSV reader option "multiLine" to "true" causes URISyntaxException > when colon is in file path > > > Key: SPARK-34883 > URL: https://issues.apache.org/jira/browse/SPARK-34883 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.1 >Reporter: Brady Tello >Priority: Major > > Setting the CSV reader's "multiLine" option to "True" throws the following > exception when a ':' character is in the file path. > > {code:java} > java.net.URISyntaxException: Relative path in absolute URI: test:dir > {code} > I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error > whether I use Scala, Python, or SQL. > The following code works fine: > > {code:java} > csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" > tempDF = (spark.read.option("sep", "\t").csv(csvFile) > {code} > While the following code fails: > > {code:java} > csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" > tempDF = (spark.read.option("sep", "\t").option("multiLine", > "True").csv(csvFile) > {code} > Full Stack Trace from Python: > > {code:java} > --- > IllegalArgumentException Traceback (most recent call last) > in > 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" > 4 > > 5 tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") > /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, > sep, encoding, quote, escape, comment, header, inferSchema, > ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, > positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, > maxCharsPerColumn, maxMalformedLogPerPartition, mode, > columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, > samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, > recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) > 735 path = [path] > 736 if type(path) == list: > --> 737 return > self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) > 738 elif isinstance(path, RDD): > 739 def func(iterator): > /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 1302 > 1303 answer = self.gateway_client.send_command(command) > -> 1304 return_value = get_return_value( > 1305 answer, self.gateway_client, self.target_id, self.name) > 1306 > /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 114 # Hide where the exception came from that shows a non-Pythonic > 115 # JVM exception message. > --> 116 raise converted from None > 117 else: > 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: test:dir > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.
[ https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373526#comment-17373526 ] Mike Pieters commented on SPARK-23814: -- I also got the same error in version 3.0.1 > Couldn't read file with colon in name and new line character in one of the > field. > - > > Key: SPARK-23814 > URL: https://issues.apache.org/jira/browse/SPARK-23814 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 2.2.0 >Reporter: bharath kumar avusherla >Priority: Major > > When the file name has colon and new line character in data, while reading > using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") > function. It is throwing *"**java.lang.IllegalArgumentException: > java.net.URISyntaxException: Relative path in absolute URI: > 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the > option("multiLine","true"), it is working just fine though the file name has > colon in it. It is working fine, If i apply this option > *option("multiLine","true")* on any other file which doesn't have colon in > it. But when both are present (colon in file name and new line in the data), > it's not working. > {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: > Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:93) > at org.apache.hadoop.fs.Globber.glob(Globber.java:253) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) > at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.take(RDD.scala:1327) > at > org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > 2017-08-01T00:00:00Z.csv.gz > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 86 more > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To