[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:
----------------------------
    Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u0000 or null character.Though user can set any 
comment character other than \u0000, but there is a chance the actual record 
can start with those characters.

Currently 

eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")
                                                          .option("delimiter", 
",")
                                                          
.csv("/tmp/delimitedfile.dat)

*+TestData+*
 

 

  was:
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")

                                                         .option("delimiter", 
", ")
                                                          .csv("C:\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple character delimiters and 
presently we need to do a manual data clean up on the source/input file, which 
doesn't work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 


> Support for treating the line as valid record if it starts with \u0000 or 
> null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 3.0.0
>            Reporter: Chandan
>            Assignee: Jeff Evans
>            Priority: Major
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u0000 or null character.Though user 
> can set any comment character other than \u0000, but there is a chance the 
> actual record can start with those characters.
> Currently 
> eg: Dataset<Row> df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                           
> .option("delimiter", ",")
>                                                           
> .csv("/tmp/delimitedfile.dat)
> *+TestData+*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to