[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Chandan (Jira) Thu, 13 Aug 2020 22:03:56 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chandan updated SPARK-32614:
----------------------------
    Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u0000 or null character.Though user can set any 
comment character other than \u0000, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u0000
character it will throw the below error.

Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
        at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
        at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
        at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
      df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 

  was:
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u0000 or null character.Though user can set any 
comment character other than \u0000, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u0000
character it will throw the 

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
      df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 


> Support for treating the line as valid record if it starts with \u0000 or 
> null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Chandan
>            Assignee: Jeff Evans
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u0000 or null character.Though user 
> can set any comment character other than \u0000, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u0000
> character it will throw the below error.
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>       at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>       at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>       at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>       at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>       df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Reply via email to