[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

chanduhawk (Jira) Sat, 22 Aug 2020 22:21:43 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182586#comment-17182586
 ]


chanduhawk edited comment on SPARK-32614 at 8/23/20, 5:20 AM:
--------------------------------------------------------------

[~srowen]

Univocity parser always takes default comment character as #. It seems spark 
updates the comment settings to \u0000 character. Please see the 
https://github.com/uniVocity/univocity-parsers/pull/412 that raised which 
involves adding one new option which enable/disable the comment processing. 
Currently as per the PR still I think there should be an option which enable or 
disable the comment processing in spark CVS so that the parameter Boolean value 
can be passed to univocity parser

As per PR If we will change that way then it might impact existing users for 
which \u0000 is a comment character by default. So I would say a separate 
optional config is a better solution. What I am saying here is that we need to 
wait for univocity 3.0.0 to be available where the new changes will be 
available then we can add spark changes in a proper manner.


was (Author: chanduhawk):
[~srowen]

Univocity parser always takes default comment character as #. It seems spark 
updates the comment settings to \u0000 character. Please see the 
https://github.com/uniVocity/univocity-parsers/pull/412 that raised which 
involves adding one new option which enable/disable the comment processing. 
Currently as per the PR still I think there should be an option which enable or 
disable the comment processing in spark CVS so that the parameter Boolean value 
can be passed to univocity parser

> Support for treating the line as valid record if it starts with \u0000 or 
> null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.3, 2.4.5, 3.0.0
>            Reporter: chanduhawk
>            Priority: Minor
>              Labels: correctness
>         Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u0000 or null character.Though user 
> can set any comment character other than \u0000, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u0000
> character it will throw the below error.
> *eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>       df.show(false);*
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>       at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>       at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>       at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>       at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by 
> adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Reply via email to