[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944783#comment-16944783 ]
Jeff Evans commented on SPARK-24540: ------------------------------------ I created a pull request to support this (which was linked above). I'm not entirely clear on why SPARK-17967 would be a blocker, though. > Support for multiple delimiter in Spark CSV read > ------------------------------------------------ > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Ashwin K > Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset<Row> df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple delimiters and > presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org