[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

2021-07-02 Thread Mike Pieters (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373552#comment-17373552
 ] 

Mike Pieters commented on SPARK-34883:
--

I've got the same error here when I try to run:
{code:java}
spark.read.csv(URL_ABFS_RAW + "/salesforce/Case/timestamp=2021-07-02 
00:14:15.129481", header=True, multiLine=True)
{code}
I'm running Spark 3.0.1

 

> Setting CSV reader option "multiLine" to "true" causes URISyntaxException 
> when colon is in file path
> 
>
> Key: SPARK-34883
> URL: https://issues.apache.org/jira/browse/SPARK-34883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Brady Tello
>Priority: Major
>
> Setting the CSV reader's "multiLine" option to "True" throws the following 
> exception when a ':' character is in the file path.
>  
> {code:java}
> java.net.URISyntaxException: Relative path in absolute URI: test:dir
> {code}
> I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
> whether I use Scala, Python, or SQL.
> The following code works fine:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> tempDF = (spark.read.option("sep", "\t").csv(csvFile)
> {code}
> While the following code fails:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
> tempDF = (spark.read.option("sep", "\t").option("multiLine", 
> "True").csv(csvFile)
> {code}
> Full Stack Trace from Python:
>  
> {code:java}
> --- 
> IllegalArgumentException Traceback (most recent call last)  
> in  
> 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> 4 
> > 5  tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") 
> /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
> sep, encoding, quote, escape, comment, header, inferSchema, 
> ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
> positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
> maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
> columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, 
> samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
> recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 
> 735 path = [path] 
> 736 if type(path) == list: 
> --> 737 return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
> 738 elif isinstance(path, RDD): 
> 739 def func(iterator): 
> /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 
> 1302 
> 1303 answer = self.gateway_client.send_command(command) 
> -> 1304 return_value = get_return_value( 
> 1305 answer, self.gateway_client, self.target_id, self.name) 
> 1306 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 
> 114 # Hide where the exception came from that shows a non-Pythonic 
> 115 # JVM exception message. 
> --> 116 raise converted from None 
> 117 else: 
> 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:dir
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.

2021-07-02 Thread Mike Pieters (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373526#comment-17373526
 ] 

Mike Pieters commented on SPARK-23814:
--

I also got the same error in version 3.0.1

> Couldn't read file with colon in name and new line character in one of the 
> field.
> -
>
> Key: SPARK-23814
> URL: https://issues.apache.org/jira/browse/SPARK-23814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.2.0
>Reporter: bharath kumar avusherla
>Priority: Major
>
> When the file name has colon and new line character in data, while reading 
> using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") 
> function. It is throwing *"**java.lang.IllegalArgumentException: 
> java.net.URISyntaxException: Relative path in absolute URI: 
> 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
> option("multiLine","true"), it is working just fine though the file name has 
> colon in it. It is working fine, If i apply this option 
> *option("multiLine","true")* on any other file which doesn't have colon in 
> it. But when both are present (colon in file name and new line in the data), 
> it's not working.
> {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at org.apache.hadoop.fs.Globber.glob(Globber.java:253)
>   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
>   at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
>   at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>   at 
> org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)
>   at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
>   at 
> org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> 2017-08-01T00:00:00Z.csv.gz
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 86 more
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To