[jira] [Commented] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema
[ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618839#comment-17618839 ] ohad commented on SPARK-40808: -- [~Zing] [~hyukjin.kwon] File test_csv.py attached As you can see the last test failed > Infer schema for CSV files - wrong behavior using header + merge schema > --- > > Key: SPARK-40808 > URL: https://issues.apache.org/jira/browse/SPARK-40808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: ohad >Priority: Major > Labels: CSVReader, csv, csvparser > Attachments: test_csv.py > > > Hello. > I am writing unit-tests to some functionality in my application that reading > data from CSV files using Spark. > I am reading the data using: > {code:java} > header=True > mergeSchema=True > inferSchema=True{code} > When I am reading this single file: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22{code} > I am getting this schema: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string{code} > When I am duplicating this file, I am getting the same schema. > The strange part is when I am adding new int column, it looks like spark is > getting confused and think that the column that already identified as int are > now string: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22 > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2 > {code} > result: > {code:java} > int_col=string > string_col=string > decimal_col=string > date_col=string > int2_col=int{code} > When I am reading only the second file, it looks fine: > {code:java} > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2{code} > result: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string > int2_col=int{code} > For conclusion, it looks like there is a bug mixing the two features: header > recognition and merge schema. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema
[ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ohad updated SPARK-40808: - Attachment: test_csv.py > Infer schema for CSV files - wrong behavior using header + merge schema > --- > > Key: SPARK-40808 > URL: https://issues.apache.org/jira/browse/SPARK-40808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: ohad >Priority: Major > Labels: CSVReader, csv, csvparser > Attachments: test_csv.py > > > Hello. > I am writing unit-tests to some functionality in my application that reading > data from CSV files using Spark. > I am reading the data using: > {code:java} > header=True > mergeSchema=True > inferSchema=True{code} > When I am reading this single file: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22{code} > I am getting this schema: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string{code} > When I am duplicating this file, I am getting the same schema. > The strange part is when I am adding new int column, it looks like spark is > getting confused and think that the column that already identified as int are > now string: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22 > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2 > {code} > result: > {code:java} > int_col=string > string_col=string > decimal_col=string > date_col=string > int2_col=int{code} > When I am reading only the second file, it looks fine: > {code:java} > File2: > "int_col","string_col","decimal_col","date_col","int2_col" > 1,"hello",1.43,2022-02-23,234 > 2,"world",5.534,2021-05-05,5 > 3,"my name",86.455,2011-08-15,32 > 4,"is ohad",6.234,2002-03-22,2{code} > result: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string > int2_col=int{code} > For conclusion, it looks like there is a bug mixing the two features: header > recognition and merge schema. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema
[ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ohad updated SPARK-40808: - Description: Hello. I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: {code:java} header=True mergeSchema=True inferSchema=True{code} When I am reading this single file: {code:java} File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22{code} I am getting this schema: {code:java} int_col=int string_col=string decimal_col=double date_col=string{code} When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: {code:java} File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 {code} result: {code:java} int_col=string string_col=string decimal_col=string date_col=string int2_col=int{code} When I am reading only the second file, it looks fine: {code:java} File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2{code} result: {code:java} int_col=int string_col=string decimal_col=double date_col=string int2_col=int{code} For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. was: Hello. I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: ``` header=True mergeSchema=True inferSchema=True ``` When I am reading this single file: ``` Fi "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 ``` I am getting this schema: ``` int_col=int string_col=string decimal_col=double date_col=string ``` When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: ``` File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=string string_col=string decimal_col=string date_col=string int2_col=int ``` When I am reading only the second file, it looks fine: ``` File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=int string_col=string decimal_col=double date_col=string int2_col=int ``` For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. > Infer schema for CSV files - wrong behavior using header + merge schema > --- > > Key: SPARK-40808 > URL: https://issues.apache.org/jira/browse/SPARK-40808 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ohad >Priority: Major > Labels: CSVReader, csv, csvparser > > Hello. > I am writing unit-tests to some functionality in my application that reading > data from CSV files using Spark. > I am reading the data using: > {code:java} > header=True > mergeSchema=True > inferSchema=True{code} > When I am reading this single file: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22{code} > I am getting this schema: > {code:java} > int_col=int > string_col=string > decimal_col=double > date_col=string{code} > When I am duplicating this file, I am getting the same schema. > The strange part is when I am adding new int column, it looks like spark is > getting confused and think that the column that already identified as int are > now string: > {code:java} > File1: > "int_col","string_col","decimal_col","date_col"
[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema
[ https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ohad updated SPARK-40808: - Description: Hello. I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: ``` header=True mergeSchema=True inferSchema=True ``` When I am reading this single file: ``` Fi "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 ``` I am getting this schema: ``` int_col=int string_col=string decimal_col=double date_col=string ``` When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: ``` File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=string string_col=string decimal_col=string date_col=string int2_col=int ``` When I am reading only the second file, it looks fine: ``` File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=int string_col=string decimal_col=double date_col=string int2_col=int ``` For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. was: Hello. I am writing some unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: ``` header=True mergeSchema=True inferSchema=True ``` When I am reading this single file: ``` Fi "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 ``` I am getting this schema: ``` int_col=int string_col=string decimal_col=double date_col=string ``` When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: ``` File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=string string_col=string decimal_col=string date_col=string int2_col=int ``` When I am reading only the second file, it looks fine: ``` File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=int string_col=string decimal_col=double date_col=string int2_col=int ``` For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. > Infer schema for CSV files - wrong behavior using header + merge schema > --- > > Key: SPARK-40808 > URL: https://issues.apache.org/jira/browse/SPARK-40808 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ohad >Priority: Major > Labels: CSVReader, csv, csvparser > > Hello. > I am writing unit-tests to some functionality in my application that reading > data from CSV files using Spark. > I am reading the data using: > ``` > header=True > mergeSchema=True > inferSchema=True > ``` > When I am reading this single file: > ``` > Fi > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,2002-03-22 > ``` > I am getting this schema: > ``` > int_col=int > string_col=string > decimal_col=double > date_col=string > ``` > When I am duplicating this file, I am getting the same schema. > The strange part is when I am adding new int column, it looks like spark is > getting confused and think that the column that already identified as int are > now string: > ``` > File1: > "int_col","string_col","decimal_col","date_col" > 1,"hello",1.43,2022-02-23 > 2,"world",5.534,2021-05-05 > 3,"my name",86.455,2011-08-15 > 4,"is ohad",6.234,
[jira] [Created] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema
ohad created SPARK-40808: Summary: Infer schema for CSV files - wrong behavior using header + merge schema Key: SPARK-40808 URL: https://issues.apache.org/jira/browse/SPARK-40808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.2 Reporter: ohad Hello. I am writing some unit-tests to some functionality in my application that reading data from CSV files using Spark. I am reading the data using: ``` header=True mergeSchema=True inferSchema=True ``` When I am reading this single file: ``` Fi "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 ``` I am getting this schema: ``` int_col=int string_col=string decimal_col=double date_col=string ``` When I am duplicating this file, I am getting the same schema. The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string: ``` File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=string string_col=string decimal_col=string date_col=string int2_col=int ``` When I am reading only the second file, it looks fine: ``` File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2 ``` result: ``` int_col=int string_col=string decimal_col=double date_col=string int2_col=int ``` For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33557) spark.storage.blockManagerSlaveTimeoutMs default value does not follow spark.network.timeout value when the latter was changed
Ohad created SPARK-33557: Summary: spark.storage.blockManagerSlaveTimeoutMs default value does not follow spark.network.timeout value when the latter was changed Key: SPARK-33557 URL: https://issues.apache.org/jira/browse/SPARK-33557 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 3.0.0 Reporter: Ohad According to the documentation "spark.network.timeout" is the default timeout for "spark.storage.blockManagerSlaveTimeoutMs" which implies that when the user sets "spark.network.timeout" the effective value of "spark.storage.blockManagerSlaveTimeoutMs" should also be changed if it was not specifically changed. However this is not the case since the default value of "spark.storage.blockManagerSlaveTimeoutMs" is always the default value of "spark.network.timeout" (120s) "spark.storage.blockManagerSlaveTimeoutMs" is defined in the package object of "org.apache.spark.internal.config" as follows: {code:java} private[spark] val STORAGE_BLOCKMANAGER_SLAVE_TIMEOUT = ConfigBuilder("spark.storage.blockManagerSlaveTimeoutMs") .version("0.7.0") .timeConf(TimeUnit.MILLISECONDS) .createWithDefaultString(Network.NETWORK_TIMEOUT.defaultValueString) {code} So it seems like the its default value is indeed "fixed" to "spark.network.timeout" default value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org