[jira] [Commented] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-17 Thread ohad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618839#comment-17618839
 ] 

ohad commented on SPARK-40808:
--

[~Zing] [~hyukjin.kwon] 

File test_csv.py attached

As you can see the last test failed

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
> Attachments: test_csv.py
>
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-17 Thread ohad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Attachment: test_csv.py

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
> Attachments: test_csv.py
>
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Description: 
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:


{code:java}
header=True
mergeSchema=True
inferSchema=True{code}
When I am reading this single file:
{code:java}
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22{code}
I am getting this schema:
{code:java}
int_col=int
string_col=string
decimal_col=double
date_col=string{code}




When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:


{code:java}
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
{code}
result:
{code:java}
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int{code}




When I am reading only the second file, it looks fine:
{code:java}
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2{code}
result:
{code:java}
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int{code}
For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.

  was:
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.


> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"

[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Description: 
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.

  was:
Hello. 
I am writing some unit-tests to some functionality in my application that 
reading data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.


> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> ```
> header=True
> mergeSchema=True
> inferSchema=True
> ```
> When I am reading this single file:
> ```
> Fi
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> ```
> I am getting this schema:
> ```
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> ```
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> ```
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,

[jira] [Created] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)
ohad created SPARK-40808:


 Summary: Infer schema for CSV files - wrong behavior using header 
+ merge schema
 Key: SPARK-40808
 URL: https://issues.apache.org/jira/browse/SPARK-40808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.2
Reporter: ohad


Hello. 
I am writing some unit-tests to some functionality in my application that 
reading data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33557) spark.storage.blockManagerSlaveTimeoutMs default value does not follow spark.network.timeout value when the latter was changed

2020-11-25 Thread Ohad (Jira)
Ohad created SPARK-33557:


 Summary: spark.storage.blockManagerSlaveTimeoutMs default value 
does not follow spark.network.timeout value when the latter was changed
 Key: SPARK-33557
 URL: https://issues.apache.org/jira/browse/SPARK-33557
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 3.0.0
Reporter: Ohad


According to the documentation "spark.network.timeout" is the default timeout 
for "spark.storage.blockManagerSlaveTimeoutMs" which implies that when the user 
sets "spark.network.timeout"  the effective value of 
"spark.storage.blockManagerSlaveTimeoutMs" should also be changed if it was not 
specifically changed.

However this is not the case since the default value of 
"spark.storage.blockManagerSlaveTimeoutMs" is always the default value of 
"spark.network.timeout" (120s)

 

"spark.storage.blockManagerSlaveTimeoutMs" is defined in the package object of 
"org.apache.spark.internal.config" as follows:
{code:java}
private[spark] val STORAGE_BLOCKMANAGER_SLAVE_TIMEOUT =
  ConfigBuilder("spark.storage.blockManagerSlaveTimeoutMs")
.version("0.7.0")
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString(Network.NETWORK_TIMEOUT.defaultValueString)
{code}
So it seems like the its default value is indeed "fixed" to 
"spark.network.timeout" default value.

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org