[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Chandan (Jira) Thu, 13 Aug 2020 21:44:54 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chandan updated SPARK-32614:
----------------------------
    Component/s: Spark Core

> Support for treating the line as valid record if it starts with \u0000 or 
> null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 3.0.0
>            Reporter: Chandan
>            Assignee: Jeff Evans
>            Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Reply via email to