[jira] [Commented] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

zzzzming95 (Jira) Sat, 22 Oct 2022 02:34:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622594#comment-17622594
 ]


zzzzming95 commented on SPARK-40808:
------------------------------------

[~ohadm] 

In spark , infer csv schema will skip first line when set header option is 
true. But only the header of one file will be regarded as the real first line, 
which means that if there are two files with different headers, the header of 
one file will be used as data to infer schema.

 

In this case , You can keep all files with the same header to pass unit test4. 

 
{code:java}
//file2.csv
"int_col","string_col","double_col","int2_col"
12,"hello2",1.432
22,"world2",5.5342
32,"my name2",86.4552
42,"is ohad2",6.2342 {code}

> Infer schema for CSV files - wrong behavior using header + merge schema
> -----------------------------------------------------------------------
>
>                 Key: SPARK-40808
>                 URL: https://issues.apache.org/jira/browse/SPARK-40808
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: ohad
>            Priority: Major
>              Labels: CSVReader, csv, csvparser
>         Attachments: test_csv.py
>
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

Reply via email to