[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chandan updated SPARK-32614: ---------------------------- Description: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u0000 or null character.Though user can set any comment character other than \u0000, but there is a chance the actual record can start with those characters. Currently eg: Dataset<Row> df = spark.read().option("inferSchema", "true") .option("header", "false") .option("delimiter", ",") .csv("/tmp/delimitedfile.dat) *+TestData+* was: Currently, the delimiter option Spark 2.0 to read and split CSV files/data only support a single character delimiter. If we try to provide multiple delimiters, we observer the following error message. eg: Dataset<Row> df = spark.read().option("inferSchema", "true") .option("header", "false") .option("delimiter", ", ") .csv("C:\test.txt"); Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot be more than one character: , at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) Generally, the data to be processed contains multiple character delimiters and presently we need to do a manual data clean up on the source/input file, which doesn't work well in large applications which consumes numerous files. There seems to be work-around like reading data as text and using the split option, but this in my opinion defeats the purpose, advantage and efficiency of a direct read from CSV file. > Support for treating the line as valid record if it starts with \u0000 or > null character, or starts with any character mentioned as comment > ------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL > Affects Versions: 3.0.0 > Reporter: Chandan > Assignee: Jeff Evans > Priority: Major > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u0000 or null character.Though user > can set any comment character other than \u0000, but there is a chance the > actual record can start with those characters. > Currently > eg: Dataset<Row> df = spark.read().option("inferSchema", "true") > .option("header", > "false") > > .option("delimiter", ",") > > .csv("/tmp/delimitedfile.dat) > *+TestData+* > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org