[ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ohad updated SPARK-40808: ------------------------- Description: Hello. I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: ``` header=True mergeSchema=True inferSchema=True ``` When I am reading this single file: ``` Fi "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 ``` I am getting this schema: ``` int_col=int string_col=string decimal_col=double date_col=string ``` When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: ``` File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=string string_col=string decimal_col=string date_col=string int2_col=int ``` When I am reading only the second file, it looks fine: ``` File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=int string_col=string decimal_col=double date_col=string int2_col=int ``` For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. was: Hello. I am writing some unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: ``` header=True mergeSchema=True inferSchema=True ``` When I am reading this single file: ``` Fi "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 ``` I am getting this schema: ``` int_col=int string_col=string decimal_col=double date_col=string ``` When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: ``` File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=string string_col=string decimal_col=string date_col=string int2_col=int ``` When I am reading only the second file, it looks fine: ``` File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=int string_col=string decimal_col=double date_col=string int2_col=int ``` For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. > Infer schema for CSV files - wrong behavior using header + merge schema > ----------------------------------------------------------------------- > > Key: SPARK-40808 > URL: https://issues.apache.org/jira/browse/SPARK-40808 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.2 > Reporter: ohad > Priority: Major > Labels: CSVReader, csv, csvparser > > Hello. > I am writing unit-tests to some functionality in my application that reading > data from CSV files using Spark. > I am reading the data using: > ``` > header=True > mergeSchema=True > inferSchema=True > ``` > When I am reading this single file: > ``` > Fi > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22 > ``` > I am getting this schema: > ``` > int_col=int > string_col=string > decimal_col=double > date_col=string > ``` > When I am duplicating this file, I am getting the same schema. > The strange part is when I am adding new int column, it looks like spark is > getting confused and think that the column that already identified as int are > now string: > ``` > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22 > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2 > ``` > result: > ``` > int_col=string > string_col=string > decimal_col=string > date_col=string > int2_col=int > ``` > When I am reading only the second file, it looks fine: > ``` > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2 > ``` > result: > ``` > int_col=int > string_col=string > decimal_col=double > date_col=string > int2_col=int > ``` > For conclusion, it looks like there is a bug mixing the two features: header > recognition and merge schema. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org