[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Chandan (Jira) Thu, 13 Aug 2020 21:52:54 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chandan updated SPARK-32614:
----------------------------
    Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u0000 or null character.Though user can set any 
comment character other than \u0000, but there is a chance the actual record 
can start with those characters.

Currently 

eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")
                                                          .option("delimiter", 
",")
                                                          
.csv("/tmp/delimitedfile.dat)

*+TestData+*
 

 

  was:
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")

                                                         .option("delimiter", 
", ")
                                                          .csv("C:\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple character delimiters and 
presently we need to do a manual data clean up on the source/input file, which 
doesn't work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 


> Support for treating the line as valid record if it starts with \u0000 or 
> null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 3.0.0
>            Reporter: Chandan
>            Assignee: Jeff Evans
>            Priority: Major
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u0000 or null character.Though user 
> can set any comment character other than \u0000, but there is a chance the 
> actual record can start with those characters.
> Currently 
> eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                           
> .option("delimiter", ",")
>                                                           
> .csv("/tmp/delimitedfile.dat)
> *+TestData+*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Reply via email to