[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-32614. ---------------------------------- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29516 [https://github.com/apache/spark/pull/29516] > Support for treating the line as valid record if it starts with \u0000 or > null character, or starts with any character mentioned as comment > ------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.2.3, 2.4.5, 3.0.0 > Reporter: Chandan Ray > Assignee: Sean R. Owen > Priority: Minor > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u0000 or null character.Though user > can set any comment character other than \u0000, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u0000 > character it will throw the below error. > *eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false);* > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > *Note:* > Though its the limitation of the univocity parser and the workaround is to > provide any other comment character by mentioning .option("comment","#"), but > if my actual data starts with this character then the particular row will be > discarded. > Currently I pushed the code in univocity parser to handle this scenario as > part of the below PR > https://github.com/uniVocity/univocity-parsers/pull/412 > please accept the jira so that we can enable this feature in spark-csv by > adding a parameter in spark csvoptions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org