[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Labels: correctness  (was: )

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Attachment: image-2023-03-23-10-55-26-879.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Description: 
pyspark.ml.stat.Correlation

Following is the Scenario where the Correlation function fails for giving 
correct Spearman Coefficient Results.

Tested E.g -> Spark DataFrame has 2 columns A and B.

!image-2023-03-23-10-55-26-879.png|width=562,height=162!

Column A has 3 Distinct Values and total of 108Million rows

Column B has 4 Distinct Values and total of 108Million rows

If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it 
gives the correct answer even if i run the same code multiple times the same 
answer is produced. (Each column has only 3-4 distinct values)

!image-2023-03-23-10-53-37-461.png|width=468,height=287!

 

Coming to Spark and using Spearman Correlation produces a *different results* 
for the *same dataframe* on multiple runs. (see below) (each column in this df 
has only 3-4 distinct values)

!image-2023-03-23-10-52-49-392.png|width=516,height=322!

 

Basically in python Pandas Df.corr it gives same results on same dataframe on 
multiple runs which is expected behaviour. However, in Spark using the same 
data it gives different result, moreover running the same cell with same data 
multiple times produces different results meaning the output is inconsistent.

Coming to data the only observation I could conclude is Ties in data. (Only 3-4 
Distinct values over 108M Rows.) This scenario is not handled in Spark 
Correlation method as the same data when used in python using df.corr produces 
consistent results.

The only Workaround we could find to get consistent and the same output as from 
python in Spark is by using Pandas UDF as shown below:

!image-2023-03-23-10-52-11-481.png|width=518,height=111!

!image-2023-03-23-10-51-28-420.png|width=509,height=270!

 

We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
and inconsistent results for this case too.

Only PandasUDF seems to provide consistent results.

 

  was:
pyspark.ml.stat.Correlation

Following is the Scenario where the Correlation function fails for giving 
correct Spearman Coefficient Results.

Tested E.g -> Spark DataFrame has 2 columns A and B.

Column A has 3 Distinct Values and total of 108Million rows

Column B has 4 Distinct Values and total of 108Million rows

If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it 
gives the correct answer even if i run the same code multiple times the same 
answer is produced.

!image-2023-03-23-10-38-49-071.png|width=526,height=258!

 

Coming to Spark and using Spearman Correlation produces a *different results* 
for the *same dataframe* on multiple runs. (see below)

!image-2023-03-23-10-41-38-696.png|width=527,height=329!

 

Basically in python Pandas Df.corr it gives same results on same dataframe on 
multiple runs which is expected behaviour. However, in Spark using the same 
data it gives different result, moreover running the same cell with same data 
multiple times produces different results meaning the output is inconsistent.

Coming to data the only observation I could conclude is Ties in data. (Only 3-4 
Distinct values over 108M Rows.) This scenario is not handled in Spark 
Correlation method as the same data when used in python using df.corr produces 
consistent results.

The only Workaround we could find to get consistent and the same output as from 
python in Spark is by using Pandas UDF as shown below:

!image-2023-03-23-10-48-01-045.png|width=554,height=94!

!image-2023-03-23-10-48-55-922.png|width=568,height=301!

 

We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
and inconsistent results for this case too.

Only PandasUDF seems to provide consistent results.

 


> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 

[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Attachment: image-2023-03-23-10-53-37-461.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Attachment: image-2023-03-23-10-52-49-392.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Attachment: image-2023-03-23-10-52-11-481.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:

Attachment: image-2023-03-23-10-51-28-420.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Blocker
> Attachments: image-2023-03-23-10-51-28-420.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-03-22 Thread dronzer (Jira)
dronzer created SPARK-42905:
---

 Summary: pyspark.ml.stat.Correlation - Spearman Correlation method 
giving incorrect and inconsistent results for the same DataFrame if it has huge 
amount of Ties.
 Key: SPARK-42905
 URL: https://issues.apache.org/jira/browse/SPARK-42905
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.3.0
Reporter: dronzer


pyspark.ml.stat.Correlation

Following is the Scenario where the Correlation function fails for giving 
correct Spearman Coefficient Results.

Tested E.g -> Spark DataFrame has 2 columns A and B.

Column A has 3 Distinct Values and total of 108Million rows

Column B has 4 Distinct Values and total of 108Million rows

If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it 
gives the correct answer even if i run the same code multiple times the same 
answer is produced.

!image-2023-03-23-10-38-49-071.png|width=526,height=258!

 

Coming to Spark and using Spearman Correlation produces a *different results* 
for the *same dataframe* on multiple runs. (see below)

!image-2023-03-23-10-41-38-696.png|width=527,height=329!

 

Basically in python Pandas Df.corr it gives same results on same dataframe on 
multiple runs which is expected behaviour. However, in Spark using the same 
data it gives different result, moreover running the same cell with same data 
multiple times produces different results meaning the output is inconsistent.

Coming to data the only observation I could conclude is Ties in data. (Only 3-4 
Distinct values over 108M Rows.) This scenario is not handled in Spark 
Correlation method as the same data when used in python using df.corr produces 
consistent results.

The only Workaround we could find to get consistent and the same output as from 
python in Spark is by using Pandas UDF as shown below:

!image-2023-03-23-10-48-01-045.png|width=554,height=94!

!image-2023-03-23-10-48-55-922.png|width=568,height=301!

 

We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
and inconsistent results for this case too.

Only PandasUDF seems to provide consistent results.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42904) Char/Varchar Support for JDBC Catalog

2023-03-22 Thread Kent Yao (Jira)
Kent Yao created SPARK-42904:


 Summary: Char/Varchar Support for JDBC Catalog
 Key: SPARK-42904
 URL: https://issues.apache.org/jira/browse/SPARK-42904
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kent Yao


create table pub.src(c char(10)) -> 
org.apache.spark.SparkIllegalArgumentException: Can't get JDBC type for 
char(10).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42903) Avoid documenting None as as a return value in docstring

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42903:
-
Priority: Trivial  (was: Major)

> Avoid documenting None as as a return value in docstring
> 
>
> Key: SPARK-42903
> URL: https://issues.apache.org/jira/browse/SPARK-42903
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> e.g.):
> {code}
> +++ b/python/pyspark/sql/dataframe.py
> @@ -385,10 +385,6 @@ class DataFrame(PandasMapOpsMixin, 
> PandasConversionMixin):
>  name : str
>  Name of the view.
> -Returns
> ----
> -None
> -
>  Examples
>  
>  Create a local temporary view named 'people'.
> @@ -426,10 +422,6 @@ class DataFrame(PandasMapOpsMixin, 
> PandasConversionMixin):
>  name : str
>  Name of the view.
> -Returns
> ----
> -None
> -
>  Examples
>  
>  Create a global temporary view.
> @@ -467,10 +459,6 @@ class DataFrame(PandasMapOpsMixin, 
> PandasConversionMixin):
>  name : str
>  Name of the view.
> -Returns
> ----
> -None
> {code}
> to be consistent. In Python, it's idiomatic to don't specify the return for 
> return None. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42863) Review and fix issues in PySpark API docs

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42863.
--
Resolution: Done

> Review and fix issues in PySpark API docs
> -
>
> Key: SPARK-42863
> URL: https://issues.apache.org/jira/browse/SPARK-42863
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42903) Avoid documenting None as as a return value in docstring

2023-03-22 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-42903:


 Summary: Avoid documenting None as as a return value in docstring
 Key: SPARK-42903
 URL: https://issues.apache.org/jira/browse/SPARK-42903
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


e.g.):

{code}
+++ b/python/pyspark/sql/dataframe.py
@@ -385,10 +385,6 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 name : str
 Name of the view.

-Returns
----
-None
-
 Examples
 
 Create a local temporary view named 'people'.
@@ -426,10 +422,6 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 name : str
 Name of the view.

-Returns
----
-None
-
 Examples
 
 Create a global temporary view.
@@ -467,10 +459,6 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 name : str
 Name of the view.

-Returns
----
-None
{code}

to be consistent. In Python, it's idiomatic to don't specify the return for 
return None. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42901.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40518
[https://github.com/apache/spark/pull/40518]

> Should move message StorageLevel from `base.proto` to a separate file
> -
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from `base.proto` to a separate file 
> to avoid this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42901:


Assignee: Yang Jie

> Should move message StorageLevel from `base.proto` to a separate file
> -
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from `base.proto` to a separate file 
> to avoid this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42902) CVE-2020-13936 request for upgrading version of Velocity

2023-03-22 Thread JacobZheng (Jira)
JacobZheng created SPARK-42902:
--

 Summary: CVE-2020-13936 request for upgrading version of Velocity
 Key: SPARK-42902
 URL: https://issues.apache.org/jira/browse/SPARK-42902
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.2.3
Reporter: JacobZheng


An attacker that is able to modify Velocity templates may execute arbitrary 
Java code or run arbitrary system commands with the same privileges as the 
account running the Servlet container. This applies to applications that allow 
untrusted users to upload/modify velocity templates running Apache Velocity 
Engine versions up to 2.2.

The current version of Velocity that spark relies on is 1.5, should we upgrade 
to version 2.3?




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file

2023-03-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42901:
-
Summary: Should move message StorageLevel from `base.proto` to a separate 
file  (was: Should move message StorageLevel from `base.proto` into a separate 
file)

> Should move message StorageLevel from `base.proto` to a separate file
> -
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from `base.proto` to a separate file 
> to avoid this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41500) auto generate concat as Double when string minus an INTERVAL type

2023-03-22 Thread JacobZheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JacobZheng resolved SPARK-41500.

Resolution: Won't Fix

> auto generate concat as Double when string minus an INTERVAL type
> -
>
> Key: SPARK-41500
> URL: https://issues.apache.org/jira/browse/SPARK-41500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.2.2
>Reporter: JacobZheng
>Priority: Major
>
> h2. *Describe the bug*
> Here is a sql.
> {code:sql}
> select '2022-02-01'- INTERVAL 1 year
> {code}
> spark generate cast('2022-02-01' as double) - INTERVAL 1 year automatically 
> and type mismatch happened.
> h2. *To Reproduce*
> On Spark 3.0.1 using spark-shell
> {code:java}
> scala> spark.sql("select '2022-02-01'- interval 1 year").show
> +--+  
>   
> |CAST(CAST(2022-02-01 AS TIMESTAMP) - INTERVAL '1 years' AS STRING)|
> +--+
> |   2021-02-01 00:00:00|
> +--+
> {code}
> On Spark 3.2.1 using spark-shell
> {code:java}
> scala> spark.sql("select '2022-02-01'- interval 1 year").show
> org.apache.spark.sql.AnalysisException: cannot resolve '(CAST('2022-02-01' AS 
> DOUBLE) - INTERVAL '1' YEAR)' due to data type mismatch: differing types in 
> '(CAST('2022-02-01' AS DOUBLE) - INTERVAL '1' YEAR)' (double and interval 
> year).; line 1 pos 7;
> 'Project [unresolvedalias((cast(2022-02-01 as double) - INTERVAL '1' YEAR), 
> None)]
> +- OneRowRelation
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.immutable.List.map(List.scala:305)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
>   at 
> 

[jira] [Resolved] (SPARK-42899) DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42899.
--
Fix Version/s: 3.4.1
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/40526

> DataFrame.to(schema) fails when it contains non-nullable nested field in 
> nullable field
> ---
>
> Key: SPARK-42899
> URL: https://issues.apache.org/jira/browse/SPARK-42899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.1
>
>
> {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in 
> nullable field:
> {code:scala}
> scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, 
> b)")
> df: org.apache.spark.sql.DataFrame = [a: int, b: struct]
> scala> df.printSchema()
> root
>  |-- a: integer (nullable = true)
>  |-- b: struct (nullable = true)
>  ||-- i: integer (nullable = false)
> scala> df.to(df.schema)
> org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or 
> field `b`.`i` is nullable while it's required to be non-nullable.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42901) Should move message StorageLevel from `base.proto` into a separate file

2023-03-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42901:
-
Summary: Should move message StorageLevel from `base.proto` into a separate 
file  (was: Should move message StorageLevel from `base.proto` to a separate 
file)

> Should move message StorageLevel from `base.proto` into a separate file
> ---
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from `base.proto` to a separate file 
> to avoid this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42901) Should move message StorageLevel from into a separate file

2023-03-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42901:
-
Summary: Should move message StorageLevel from into a separate file  (was: 
Should move StorageLevel into a separate file)

> Should move message StorageLevel from into a separate file
> --
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from a to a separate file to avoid 
> this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42901) Should move message StorageLevel from into a separate file

2023-03-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42901:
-
Description: 
[https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
"connect-common/compile" to compile, there will be following message in compile 
log:

 
{code:java}
spark/connect/base.proto:23:1: File recursively imports itself: 
spark/connect/base.proto -> spark/connect/commands.proto -> 
spark/connect/relations.proto -> spark/connect/catalog.proto -> 
spark/connect/base.proto
spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
found or had errors.
spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
not found or had errors.
spark/connect/relations.proto:84:5: "Catalog" is not defined.
spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
not found or had errors.
spark/connect/commands.proto:63:3: "Relation" is not defined.
spark/connect/commands.proto:81:3: "Relation" is not defined.
spark/connect/commands.proto:142:3: "Relation" is not defined.
spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
found or had errors.
spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
found or had errors.
 {code}
So we should move `message StorageLevel` from `base.proto` to a separate file 
to avoid this issue

 

 

  was:
https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` to 
`base.proto`, but if we try to import `base.proto` in `catalog.proto` to reuse 
`StorageLevel` in `message CacheTable` and run `build/sbt 
"connect-common/compile" to compile, there will be following message in compile 
log:

 
{code:java}
spark/connect/base.proto:23:1: File recursively imports itself: 
spark/connect/base.proto -> spark/connect/commands.proto -> 
spark/connect/relations.proto -> spark/connect/catalog.proto -> 
spark/connect/base.proto
spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
found or had errors.
spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
not found or had errors.
spark/connect/relations.proto:84:5: "Catalog" is not defined.
spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
not found or had errors.
spark/connect/commands.proto:63:3: "Relation" is not defined.
spark/connect/commands.proto:81:3: "Relation" is not defined.
spark/connect/commands.proto:142:3: "Relation" is not defined.
spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
found or had errors.
spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
found or had errors.
 {code}
So we should move `message StorageLevel` from a to a separate file to avoid 
this issue

 

 


> Should move message StorageLevel from into a separate file
> --
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> 

[jira] [Updated] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file

2023-03-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42901:
-
Summary: Should move message StorageLevel from `base.proto` to a separate 
file  (was: Should move message StorageLevel from into a separate file)

> Should move message StorageLevel from `base.proto` to a separate file
> -
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from `base.proto` to a separate file 
> to avoid this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42901) Should move StorageLevel into a separate file

2023-03-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42901:
-
Summary: Should move StorageLevel into a separate file  (was: Move 
StorageLevel into a separate file to avoid potential file recursively imports)

> Should move StorageLevel into a separate file
> -
>
> Key: SPARK-42901
> URL: https://issues.apache.org/jira/browse/SPARK-42901
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` 
> to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to 
> reuse `StorageLevel` in `message CacheTable` and run `build/sbt 
> "connect-common/compile" to compile, there will be following message in 
> compile log:
>  
> {code:java}
> spark/connect/base.proto:23:1: File recursively imports itself: 
> spark/connect/base.proto -> spark/connect/commands.proto -> 
> spark/connect/relations.proto -> spark/connect/catalog.proto -> 
> spark/connect/base.proto
> spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
> found or had errors.
> spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
> defined in "spark/connect/types.proto", which is not imported by 
> "spark/connect/catalog.proto".  To use it here, please add the necessary 
> import.
> spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
> not found or had errors.
> spark/connect/relations.proto:84:5: "Catalog" is not defined.
> spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
> not found or had errors.
> spark/connect/commands.proto:63:3: "Relation" is not defined.
> spark/connect/commands.proto:81:3: "Relation" is not defined.
> spark/connect/commands.proto:142:3: "Relation" is not defined.
> spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
> found or had errors.
> spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
> found or had errors.
>  {code}
> So we should move `message StorageLevel` from a to a separate file to avoid 
> this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42901) Move StorageLevel into a separate file to avoid potential file recursively imports

2023-03-22 Thread Yang Jie (Jira)
Yang Jie created SPARK-42901:


 Summary: Move StorageLevel into a separate file to avoid potential 
file recursively imports
 Key: SPARK-42901
 URL: https://issues.apache.org/jira/browse/SPARK-42901
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.0
Reporter: Yang Jie


https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` to 
`base.proto`, but if we try to import `base.proto` in `catalog.proto` to reuse 
`StorageLevel` in `message CacheTable` and run `build/sbt 
"connect-common/compile" to compile, there will be following message in compile 
log:

 
{code:java}
spark/connect/base.proto:23:1: File recursively imports itself: 
spark/connect/base.proto -> spark/connect/commands.proto -> 
spark/connect/relations.proto -> spark/connect/catalog.proto -> 
spark/connect/base.proto
spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
found or had errors.
spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
not found or had errors.
spark/connect/relations.proto:84:5: "Catalog" is not defined.
spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was 
not found or had errors.
spark/connect/commands.proto:63:3: "Relation" is not defined.
spark/connect/commands.proto:81:3: "Relation" is not defined.
spark/connect/commands.proto:142:3: "Relation" is not defined.
spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
found or had errors.
spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not 
found or had errors.
 {code}
So we should move `message StorageLevel` from a to a separate file to avoid 
this issue

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session

2023-03-22 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703815#comment-17703815
 ] 

Rui Wang commented on SPARK-42895:
--

Of course though it probably need a better error message locally.

> ValueError when invoking any session operations on a stopped Spark session
> --
>
> Key: SPARK-42895
> URL: https://issues.apache.org/jira/browse/SPARK-42895
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> If a remote Spark session is stopped, trying to invoke any session operations 
> will result in a ValueError. For example:
>  
> {code:java}
> spark.stop()
> spark.sql("select 1")
> ValueError: Cannot invoke RPC: Channel closed!
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   ...
>     return e.code() == grpc.StatusCode.UNAVAILABLE
> AttributeError: 'ValueError' object has no attribute 'code'{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42748) Server-side Artifact Management

2023-03-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42748.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Server-side Artifact Management
> ---
>
> Key: SPARK-42748
> URL: https://issues.apache.org/jira/browse/SPARK-42748
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0
>
>
> https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side 
> transfer of artifacts to the server but currently, the server does not 
> process these requests.
>  
> We need to implement a server-side management mechanism to handle storage of 
> these artifacts on the driver as well as perform further processing (such as 
> adding jars and moving class files to the right directories)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session

2023-03-22 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703814#comment-17703814
 ] 

Rui Wang commented on SPARK-42895:
--

In fact spark.stop() does not stop the remote spark session but just local 
session and the gRPC channel. This error is expected.

> ValueError when invoking any session operations on a stopped Spark session
> --
>
> Key: SPARK-42895
> URL: https://issues.apache.org/jira/browse/SPARK-42895
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> If a remote Spark session is stopped, trying to invoke any session operations 
> will result in a ValueError. For example:
>  
> {code:java}
> spark.stop()
> spark.sql("select 1")
> ValueError: Cannot invoke RPC: Channel closed!
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   ...
>     return e.code() == grpc.StatusCode.UNAVAILABLE
> AttributeError: 'ValueError' object has no attribute 'code'{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42900) Fix createDataFrame to respect both type inference and column names.

2023-03-22 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-42900:
-

 Summary: Fix createDataFrame to respect both type inference and 
column names.
 Key: SPARK-42900
 URL: https://issues.apache.org/jira/browse/SPARK-42900
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42899) DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field

2023-03-22 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-42899:
--
Summary: DataFrame.to(schema) fails when it contains non-nullable nested 
field in nullable field  (was: DataFrame.to(schema) fails with the schema of 
itself.)

> DataFrame.to(schema) fails when it contains non-nullable nested field in 
> nullable field
> ---
>
> Key: SPARK-42899
> URL: https://issues.apache.org/jira/browse/SPARK-42899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in 
> nullable field:
> {code:scala}
> scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, 
> b)")
> df: org.apache.spark.sql.DataFrame = [a: int, b: struct]
> scala> df.printSchema()
> root
>  |-- a: integer (nullable = true)
>  |-- b: struct (nullable = true)
>  ||-- i: integer (nullable = false)
> scala> df.to(df.schema)
> org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or 
> field `b`.`i` is nullable while it's required to be non-nullable.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42899) DataFrame.to(schema) fails with the schema of itself.

2023-03-22 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-42899:
--
Description: 
{{DataFrame.to(schema)}} fails when it contains non-nullable nested field in 
nullable field:
{code:scala}
scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct]
scala> df.printSchema()
root
 |-- a: integer (nullable = true)
 |-- b: struct (nullable = true)
 ||-- i: integer (nullable = false)

scala> df.to(df.schema)
org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or 
field `b`.`i` is nullable while it's required to be non-nullable.
{code}

  was:
{{DataFrame.to(schema)}} fails with the schema of itself, when it contains 
non-nullable nested field in nullable field:

{code:scala}
scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct]
scala> df.printSchema()
root
 |-- a: integer (nullable = true)
 |-- b: struct (nullable = true)
 ||-- i: integer (nullable = false)

scala> df.to(df.schema)
org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or 
field `b`.`i` is nullable while it's required to be non-nullable.
{code}



> DataFrame.to(schema) fails with the schema of itself.
> -
>
> Key: SPARK-42899
> URL: https://issues.apache.org/jira/browse/SPARK-42899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in 
> nullable field:
> {code:scala}
> scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, 
> b)")
> df: org.apache.spark.sql.DataFrame = [a: int, b: struct]
> scala> df.printSchema()
> root
>  |-- a: integer (nullable = true)
>  |-- b: struct (nullable = true)
>  ||-- i: integer (nullable = false)
> scala> df.to(df.schema)
> org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or 
> field `b`.`i` is nullable while it's required to be non-nullable.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42899) DataFrame.to(schema) fails with the schema of itself.

2023-03-22 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-42899:
-

 Summary: DataFrame.to(schema) fails with the schema of itself.
 Key: SPARK-42899
 URL: https://issues.apache.org/jira/browse/SPARK-42899
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Takuya Ueshin


{{DataFrame.to(schema)}} fails with the schema of itself, when it contains 
non-nullable nested field in nullable field:

{code:scala}
scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct]
scala> df.printSchema()
root
 |-- a: integer (nullable = true)
 |-- b: struct (nullable = true)
 ||-- i: integer (nullable = false)

scala> df.to(df.schema)
org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or 
field `b`.`i` is nullable while it's required to be non-nullable.
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42892) Move sameType and relevant methods out of DataType

2023-03-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42892.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move sameType and relevant methods out of DataType
> --
>
> Key: SPARK-42892
> URL: https://issues.apache.org/jira/browse/SPARK-42892
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42832) Remove repartition if it is the child of LocalLimit

2023-03-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42832.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40462
[https://github.com/apache/spark/pull/40462]

> Remove repartition if it is the child of LocalLimit
> ---
>
> Key: SPARK-42832
> URL: https://issues.apache.org/jira/browse/SPARK-42832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit

2023-03-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42832:
-

Assignee: Yuming Wang

> Remove repartition if it is the child of LocalLimit
> ---
>
> Key: SPARK-42832
> URL: https://issues.apache.org/jira/browse/SPARK-42832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42894) Implement cache, persist, unpersist, and storageLevel

2023-03-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42894:
-

Assignee: Yang Jie

> Implement cache, persist, unpersist, and storageLevel
> -
>
> Key: SPARK-42894
> URL: https://issues.apache.org/jira/browse/SPARK-42894
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42894) Implement cache, persist, unpersist, and storageLevel

2023-03-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42894.
---
Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 40516
[https://github.com/apache/spark/pull/40516]

> Implement cache, persist, unpersist, and storageLevel
> -
>
> Key: SPARK-42894
> URL: https://issues.apache.org/jira/browse/SPARK-42894
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used

2023-03-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42898:


Assignee: (was: Apache Spark)

> Cast from string to date and date to string say timezone is needed, but it is 
> not used
> --
>
> Key: SPARK-42898
> URL: https://issues.apache.org/jira/browse/SPARK-42898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> This is really minor but SPARK-35581 removed the need for a timezone when 
> casting from a `StringType` to a `DateType`, but the patch didn't update the 
> `needsTimeZone` function to indicate that it was not longer required.
> Currently Casting from a DateType to a StringType also says that it needs the 
> timezone, but it only uses the `DateFormatter` with it's default parameters 
> that do not use the time zone at all.
> I think this can be fixed with just a two line change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703720#comment-17703720
 ] 

Apache Spark commented on SPARK-42898:
--

User 'revans2' has created a pull request for this issue:
https://github.com/apache/spark/pull/40524

> Cast from string to date and date to string say timezone is needed, but it is 
> not used
> --
>
> Key: SPARK-42898
> URL: https://issues.apache.org/jira/browse/SPARK-42898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> This is really minor but SPARK-35581 removed the need for a timezone when 
> casting from a `StringType` to a `DateType`, but the patch didn't update the 
> `needsTimeZone` function to indicate that it was not longer required.
> Currently Casting from a DateType to a StringType also says that it needs the 
> timezone, but it only uses the `DateFormatter` with it's default parameters 
> that do not use the time zone at all.
> I think this can be fixed with just a two line change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used

2023-03-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42898:


Assignee: Apache Spark

> Cast from string to date and date to string say timezone is needed, but it is 
> not used
> --
>
> Key: SPARK-42898
> URL: https://issues.apache.org/jira/browse/SPARK-42898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Major
>
> This is really minor but SPARK-35581 removed the need for a timezone when 
> casting from a `StringType` to a `DateType`, but the patch didn't update the 
> `needsTimeZone` function to indicate that it was not longer required.
> Currently Casting from a DateType to a StringType also says that it needs the 
> timezone, but it only uses the `DateFormatter` with it's default parameters 
> that do not use the time zone at all.
> I think this can be fixed with just a two line change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used

2023-03-22 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-42898:
---

 Summary: Cast from string to date and date to string say timezone 
is needed, but it is not used
 Key: SPARK-42898
 URL: https://issues.apache.org/jira/browse/SPARK-42898
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Robert Joseph Evans


This is really minor but SPARK-35581 removed the need for a timezone when 
casting from a `StringType` to a `DateType`, but the patch didn't update the 
`needsTimeZone` function to indicate that it was not longer required.

Currently Casting from a DateType to a StringType also says that it needs the 
timezone, but it only uses the `DateFormatter` with it's default parameters 
that do not use the time zone at all.

I think this can be fixed with just a two line change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42815) Subexpression elimination support shortcut expression

2023-03-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42815:
---

Assignee: XiDuo You

> Subexpression elimination support shortcut expression
> -
>
> Key: SPARK-42815
> URL: https://issues.apache.org/jira/browse/SPARK-42815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.5.0
>
>
> The subexpression may not need to eval even if it appears more than once.
> e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if 
> {{a}} is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703686#comment-17703686
 ] 

Apache Spark commented on SPARK-42897:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/40523

> Avoid evaluate more than once for the variables from the left side in the 
> FullOuter SMJ condition
> -
>
> Key: SPARK-42897
> URL: https://issues.apache.org/jira/browse/SPARK-42897
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> Codegen issue for FullOuter SMJ,  for example
> {code}
> val df1 = spark.range(5).select($"id".as("k1"))
> val df2 = spark.range(10).select($"id".as("k2"))
> df1.join(df2.hint("SHUFFLE_MERGE"),
> $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", 
> "full_outer")
> {code}
> the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will 
> evaluate the variable *k1* and caused the codegen failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition

2023-03-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42897:


Assignee: Apache Spark

> Avoid evaluate more than once for the variables from the left side in the 
> FullOuter SMJ condition
> -
>
> Key: SPARK-42897
> URL: https://issues.apache.org/jira/browse/SPARK-42897
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> Codegen issue for FullOuter SMJ,  for example
> {code}
> val df1 = spark.range(5).select($"id".as("k1"))
> val df2 = spark.range(10).select($"id".as("k2"))
> df1.join(df2.hint("SHUFFLE_MERGE"),
> $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", 
> "full_outer")
> {code}
> the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will 
> evaluate the variable *k1* and caused the codegen failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition

2023-03-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42897:


Assignee: (was: Apache Spark)

> Avoid evaluate more than once for the variables from the left side in the 
> FullOuter SMJ condition
> -
>
> Key: SPARK-42897
> URL: https://issues.apache.org/jira/browse/SPARK-42897
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> Codegen issue for FullOuter SMJ,  for example
> {code}
> val df1 = spark.range(5).select($"id".as("k1"))
> val df2 = spark.range(10).select($"id".as("k2"))
> df1.join(df2.hint("SHUFFLE_MERGE"),
> $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", 
> "full_outer")
> {code}
> the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will 
> evaluate the variable *k1* and caused the codegen failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42815) Subexpression elimination support shortcut expression

2023-03-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42815.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40446
[https://github.com/apache/spark/pull/40446]

> Subexpression elimination support shortcut expression
> -
>
> Key: SPARK-42815
> URL: https://issues.apache.org/jira/browse/SPARK-42815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Minor
> Fix For: 3.5.0
>
>
> The subexpression may not need to eval even if it appears more than once.
> e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if 
> {{a}} is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition

2023-03-22 Thread Wan Kun (Jira)
Wan Kun created SPARK-42897:
---

 Summary: Avoid evaluate more than once for the variables from the 
left side in the FullOuter SMJ condition
 Key: SPARK-42897
 URL: https://issues.apache.org/jira/browse/SPARK-42897
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wan Kun


Codegen issue for FullOuter SMJ,  for example
{code}
val df1 = spark.range(5).select($"id".as("k1"))
val df2 = spark.range(10).select($"id".as("k2"))
df1.join(df2.hint("SHUFFLE_MERGE"),
$"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", 
"full_outer")
{code}
the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will 
evaluate the variable *k1* and caused the codegen failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703677#comment-17703677
 ] 

Apache Spark commented on SPARK-42101:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/40522

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0
>
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703637#comment-17703637
 ] 

Apache Spark commented on SPARK-42896:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/40520

> Make `mapInPandas` / mapInArrow` support barrier mode execution
> ---
>
> Key: SPARK-42896
> URL: https://issues.apache.org/jira/browse/SPARK-42896
> Project: Spark
>  Issue Type: New Feature
>  Components: Pandas API on Spark, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Priority: Major
>
> Make `mapInPandas` / mapInArrow` support barrier mode execution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution

2023-03-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42896:


Assignee: (was: Apache Spark)

> Make `mapInPandas` / mapInArrow` support barrier mode execution
> ---
>
> Key: SPARK-42896
> URL: https://issues.apache.org/jira/browse/SPARK-42896
> Project: Spark
>  Issue Type: New Feature
>  Components: Pandas API on Spark, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Priority: Major
>
> Make `mapInPandas` / mapInArrow` support barrier mode execution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution

2023-03-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42896:


Assignee: Apache Spark

> Make `mapInPandas` / mapInArrow` support barrier mode execution
> ---
>
> Key: SPARK-42896
> URL: https://issues.apache.org/jira/browse/SPARK-42896
> Project: Spark
>  Issue Type: New Feature
>  Components: Pandas API on Spark, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> Make `mapInPandas` / mapInArrow` support barrier mode execution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution

2023-03-22 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-42896:
--

 Summary: Make `mapInPandas` / mapInArrow` support barrier mode 
execution
 Key: SPARK-42896
 URL: https://issues.apache.org/jira/browse/SPARK-42896
 Project: Spark
  Issue Type: New Feature
  Components: Pandas API on Spark, PySpark, SQL
Affects Versions: 3.5.0
Reporter: Weichen Xu


Make `mapInPandas` / mapInArrow` support barrier mode execution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703615#comment-17703615
 ] 

Apache Spark commented on SPARK-42864:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40519

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703614#comment-17703614
 ] 

Apache Spark commented on SPARK-42864:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40519

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session

2023-03-22 Thread Allison Wang (Jira)
Allison Wang created SPARK-42895:


 Summary: ValueError when invoking any session operations on a 
stopped Spark session
 Key: SPARK-42895
 URL: https://issues.apache.org/jira/browse/SPARK-42895
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Allison Wang


If a remote Spark session is stopped, trying to invoke any session operations 
will result in a ValueError. For example:

 
{code:java}
spark.stop()
spark.sql("select 1")

ValueError: Cannot invoke RPC: Channel closed!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  ...
    return e.code() == grpc.StatusCode.UNAVAILABLE
AttributeError: 'ValueError' object has no attribute 'code'{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42889) Implement cache, persist, unpersist, and storageLevel

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703541#comment-17703541
 ] 

Apache Spark commented on SPARK-42889:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40518

> Implement cache, persist, unpersist, and storageLevel
> -
>
> Key: SPARK-42889
> URL: https://issues.apache.org/jira/browse/SPARK-42889
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42874) Enable new golden file test framework for analysis for all input files

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42874.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40496
[https://github.com/apache/spark/pull/40496]

> Enable new golden file test framework for analysis for all input files
> --
>
> Key: SPARK-42874
> URL: https://issues.apache.org/jira/browse/SPARK-42874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42874) Enable new golden file test framework for analysis for all input files

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42874:


Assignee: Daniel

> Enable new golden file test framework for analysis for all input files
> --
>
> Key: SPARK-42874
> URL: https://issues.apache.org/jira/browse/SPARK-42874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42893) Block Arrow-optimized Python UDFs

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42893:


Assignee: Xinrong Meng

> Block Arrow-optimized Python UDFs
> -
>
> Key: SPARK-42893
> URL: https://issues.apache.org/jira/browse/SPARK-42893
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Considering the upcoming improvements on the result inconsistencies between 
> traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
> block the feature, otherwise, users who try out the feature will expect 
> behavior changes in the next release.
> In addition, since Spark Connect Python Client(SCPC) has been introduced in 
> Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark 
> and SCPC at the same time for compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42893) Block Arrow-optimized Python UDFs

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42893.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40513
[https://github.com/apache/spark/pull/40513]

> Block Arrow-optimized Python UDFs
> -
>
> Key: SPARK-42893
> URL: https://issues.apache.org/jira/browse/SPARK-42893
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Considering the upcoming improvements on the result inconsistencies between 
> traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
> block the feature, otherwise, users who try out the feature will expect 
> behavior changes in the next release.
> In addition, since Spark Connect Python Client(SCPC) has been introduced in 
> Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark 
> and SCPC at the same time for compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2023-03-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41233:


Assignee: Takuya Ueshin

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42508) Extract the common .ml classes to `mllib-common`

2023-03-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703503#comment-17703503
 ] 

Apache Spark commented on SPARK-42508:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40517

> Extract the common .ml classes to `mllib-common`
> 
>
> Key: SPARK-42508
> URL: https://issues.apache.org/jira/browse/SPARK-42508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

2023-03-22 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-40082.
-
Fix Version/s: 3.5.0
 Assignee: Fencheng Mei
   Resolution: Fixed

> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> --
>
> Key: SPARK-40082
> URL: https://issues.apache.org/jira/browse/SPARK-40082
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.1.1
>Reporter: Penglei Shi
>Assignee: Fencheng Mei
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitted, its all 
> speculative tasks success and register map output, but speculative task 
> successful events can not trigger shuffleMergeFinalized because this stage 
> has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will be resubmitted, but next stages are added to 
> waitingStages after this stage being finished, so next stages will not be 
> submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org