[ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622594#comment-17622594 ]
zzzzming95 commented on SPARK-40808: ------------------------------------ [~ohadm] In spark , infer csv schema will skip first line when set header option is true. But only the header of one file will be regarded as the real first line, which means that if there are two files with different headers, the header of one file will be used as data to infer schema. In this case , You can keep all files with the same header to pass unit test4. {code:java} //file2.csv "int_col","string_col","double_col","int2_col" 12,"hello2",1.432 22,"world2",5.5342 32,"my name2",86.4552 42,"is ohad2",6.2342 {code} > Infer schema for CSV files - wrong behavior using header + merge schema > ----------------------------------------------------------------------- > > Key: SPARK-40808 > URL: https://issues.apache.org/jira/browse/SPARK-40808 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.2 > Reporter: ohad > Priority: Major > Labels: CSVReader, csv, csvparser > Attachments: test_csv.py > > > Hello. > I am writing unit-tests to some functionality in my application that reading > data from CSV files using Spark. > I am reading the data using: > {code:java} > header=True > mergeSchema=True > inferSchema=True{code} > When I am reading this single file: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22{code} > I am getting this schema: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string{code} > When I am duplicating this file, I am getting the same schema. > The strange part is when I am adding new int column, it looks like spark is > getting confused and think that the column that already identified as int are > now string: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22 > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2 > {code} > result: > {code:java} > int_col=string > string_col=string > decimal_col=string > date_col=string > int2_col=int{code} > When I am reading only the second file, it looks fine: > {code:java} > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2{code} > result: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string > int2_col=int{code} > For conclusion, it looks like there is a bug mixing the two features: header > recognition and merge schema. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org