[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Labels: correctness (was: ) > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Labels: correctness > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, > image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > !image-2023-03-23-10-55-26-879.png|width=562,height=162! > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. (Each column has only 3-4 distinct values) > !image-2023-03-23-10-53-37-461.png|width=468,height=287! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) (each column in this > df has only 3-4 distinct values) > !image-2023-03-23-10-52-49-392.png|width=516,height=322! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-52-11-481.png|width=518,height=111! > !image-2023-03-23-10-51-28-420.png|width=509,height=270! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Attachment: image-2023-03-23-10-55-26-879.png > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, > image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. > !image-2023-03-23-10-38-49-071.png|width=526,height=258! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) > !image-2023-03-23-10-41-38-696.png|width=527,height=329! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-48-01-045.png|width=554,height=94! > !image-2023-03-23-10-48-55-922.png|width=568,height=301! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Description: pyspark.ml.stat.Correlation Following is the Scenario where the Correlation function fails for giving correct Spearman Coefficient Results. Tested E.g -> Spark DataFrame has 2 columns A and B. !image-2023-03-23-10-55-26-879.png|width=562,height=162! Column A has 3 Distinct Values and total of 108Million rows Column B has 4 Distinct Values and total of 108Million rows If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it gives the correct answer even if i run the same code multiple times the same answer is produced. (Each column has only 3-4 distinct values) !image-2023-03-23-10-53-37-461.png|width=468,height=287! Coming to Spark and using Spearman Correlation produces a *different results* for the *same dataframe* on multiple runs. (see below) (each column in this df has only 3-4 distinct values) !image-2023-03-23-10-52-49-392.png|width=516,height=322! Basically in python Pandas Df.corr it gives same results on same dataframe on multiple runs which is expected behaviour. However, in Spark using the same data it gives different result, moreover running the same cell with same data multiple times produces different results meaning the output is inconsistent. Coming to data the only observation I could conclude is Ties in data. (Only 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark Correlation method as the same data when used in python using df.corr produces consistent results. The only Workaround we could find to get consistent and the same output as from python in Spark is by using Pandas UDF as shown below: !image-2023-03-23-10-52-11-481.png|width=518,height=111! !image-2023-03-23-10-51-28-420.png|width=509,height=270! We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect and inconsistent results for this case too. Only PandasUDF seems to provide consistent results. was: pyspark.ml.stat.Correlation Following is the Scenario where the Correlation function fails for giving correct Spearman Coefficient Results. Tested E.g -> Spark DataFrame has 2 columns A and B. Column A has 3 Distinct Values and total of 108Million rows Column B has 4 Distinct Values and total of 108Million rows If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it gives the correct answer even if i run the same code multiple times the same answer is produced. !image-2023-03-23-10-38-49-071.png|width=526,height=258! Coming to Spark and using Spearman Correlation produces a *different results* for the *same dataframe* on multiple runs. (see below) !image-2023-03-23-10-41-38-696.png|width=527,height=329! Basically in python Pandas Df.corr it gives same results on same dataframe on multiple runs which is expected behaviour. However, in Spark using the same data it gives different result, moreover running the same cell with same data multiple times produces different results meaning the output is inconsistent. Coming to data the only observation I could conclude is Ties in data. (Only 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark Correlation method as the same data when used in python using df.corr produces consistent results. The only Workaround we could find to get consistent and the same output as from python in Spark is by using Pandas UDF as shown below: !image-2023-03-23-10-48-01-045.png|width=554,height=94! !image-2023-03-23-10-48-55-922.png|width=568,height=301! We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect and inconsistent results for this case too. Only PandasUDF seems to provide consistent results. > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, > image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > !image-2023-03-23-10-55-26-879.png|width=562,height=162! > Column A has 3 Distinct Values and total of
[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Attachment: image-2023-03-23-10-53-37-461.png > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, > image-2023-03-23-10-53-37-461.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. > !image-2023-03-23-10-38-49-071.png|width=526,height=258! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) > !image-2023-03-23-10-41-38-696.png|width=527,height=329! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-48-01-045.png|width=554,height=94! > !image-2023-03-23-10-48-55-922.png|width=568,height=301! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Attachment: image-2023-03-23-10-52-49-392.png > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. > !image-2023-03-23-10-38-49-071.png|width=526,height=258! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) > !image-2023-03-23-10-41-38-696.png|width=527,height=329! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-48-01-045.png|width=554,height=94! > !image-2023-03-23-10-48-55-922.png|width=568,height=301! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Attachment: image-2023-03-23-10-52-11-481.png > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. > !image-2023-03-23-10-38-49-071.png|width=526,height=258! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) > !image-2023-03-23-10-41-38-696.png|width=527,height=329! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-48-01-045.png|width=554,height=94! > !image-2023-03-23-10-48-55-922.png|width=568,height=301! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dronzer updated SPARK-42905: Attachment: image-2023-03-23-10-51-28-420.png > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Blocker > Attachments: image-2023-03-23-10-51-28-420.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. > !image-2023-03-23-10-38-49-071.png|width=526,height=258! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) > !image-2023-03-23-10-41-38-696.png|width=527,height=329! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-48-01-045.png|width=554,height=94! > !image-2023-03-23-10-48-55-922.png|width=568,height=301! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
dronzer created SPARK-42905: --- Summary: pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties. Key: SPARK-42905 URL: https://issues.apache.org/jira/browse/SPARK-42905 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.3.0 Reporter: dronzer pyspark.ml.stat.Correlation Following is the Scenario where the Correlation function fails for giving correct Spearman Coefficient Results. Tested E.g -> Spark DataFrame has 2 columns A and B. Column A has 3 Distinct Values and total of 108Million rows Column B has 4 Distinct Values and total of 108Million rows If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it gives the correct answer even if i run the same code multiple times the same answer is produced. !image-2023-03-23-10-38-49-071.png|width=526,height=258! Coming to Spark and using Spearman Correlation produces a *different results* for the *same dataframe* on multiple runs. (see below) !image-2023-03-23-10-41-38-696.png|width=527,height=329! Basically in python Pandas Df.corr it gives same results on same dataframe on multiple runs which is expected behaviour. However, in Spark using the same data it gives different result, moreover running the same cell with same data multiple times produces different results meaning the output is inconsistent. Coming to data the only observation I could conclude is Ties in data. (Only 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark Correlation method as the same data when used in python using df.corr produces consistent results. The only Workaround we could find to get consistent and the same output as from python in Spark is by using Pandas UDF as shown below: !image-2023-03-23-10-48-01-045.png|width=554,height=94! !image-2023-03-23-10-48-55-922.png|width=568,height=301! We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect and inconsistent results for this case too. Only PandasUDF seems to provide consistent results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42904) Char/Varchar Support for JDBC Catalog
Kent Yao created SPARK-42904: Summary: Char/Varchar Support for JDBC Catalog Key: SPARK-42904 URL: https://issues.apache.org/jira/browse/SPARK-42904 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kent Yao create table pub.src(c char(10)) -> org.apache.spark.SparkIllegalArgumentException: Can't get JDBC type for char(10). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42903) Avoid documenting None as as a return value in docstring
[ https://issues.apache.org/jira/browse/SPARK-42903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42903: - Priority: Trivial (was: Major) > Avoid documenting None as as a return value in docstring > > > Key: SPARK-42903 > URL: https://issues.apache.org/jira/browse/SPARK-42903 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > e.g.): > {code} > +++ b/python/pyspark/sql/dataframe.py > @@ -385,10 +385,6 @@ class DataFrame(PandasMapOpsMixin, > PandasConversionMixin): > name : str > Name of the view. > -Returns > ---- > -None > - > Examples > > Create a local temporary view named 'people'. > @@ -426,10 +422,6 @@ class DataFrame(PandasMapOpsMixin, > PandasConversionMixin): > name : str > Name of the view. > -Returns > ---- > -None > - > Examples > > Create a global temporary view. > @@ -467,10 +459,6 @@ class DataFrame(PandasMapOpsMixin, > PandasConversionMixin): > name : str > Name of the view. > -Returns > ---- > -None > {code} > to be consistent. In Python, it's idiomatic to don't specify the return for > return None. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42863) Review and fix issues in PySpark API docs
[ https://issues.apache.org/jira/browse/SPARK-42863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42863. -- Resolution: Done > Review and fix issues in PySpark API docs > - > > Key: SPARK-42863 > URL: https://issues.apache.org/jira/browse/SPARK-42863 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42903) Avoid documenting None as as a return value in docstring
Hyukjin Kwon created SPARK-42903: Summary: Avoid documenting None as as a return value in docstring Key: SPARK-42903 URL: https://issues.apache.org/jira/browse/SPARK-42903 Project: Spark Issue Type: Documentation Components: PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon e.g.): {code} +++ b/python/pyspark/sql/dataframe.py @@ -385,10 +385,6 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): name : str Name of the view. -Returns ---- -None - Examples Create a local temporary view named 'people'. @@ -426,10 +422,6 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): name : str Name of the view. -Returns ---- -None - Examples Create a global temporary view. @@ -467,10 +459,6 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): name : str Name of the view. -Returns ---- -None {code} to be consistent. In Python, it's idiomatic to don't specify the return for return None. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42901. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40518 [https://github.com/apache/spark/pull/40518] > Should move message StorageLevel from `base.proto` to a separate file > - > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from `base.proto` to a separate file > to avoid this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42901: Assignee: Yang Jie > Should move message StorageLevel from `base.proto` to a separate file > - > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from `base.proto` to a separate file > to avoid this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42902) CVE-2020-13936 request for upgrading version of Velocity
JacobZheng created SPARK-42902: -- Summary: CVE-2020-13936 request for upgrading version of Velocity Key: SPARK-42902 URL: https://issues.apache.org/jira/browse/SPARK-42902 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.2.3 Reporter: JacobZheng An attacker that is able to modify Velocity templates may execute arbitrary Java code or run arbitrary system commands with the same privileges as the account running the Servlet container. This applies to applications that allow untrusted users to upload/modify velocity templates running Apache Velocity Engine versions up to 2.2. The current version of Velocity that spark relies on is 1.5, should we upgrade to version 2.3? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42901: - Summary: Should move message StorageLevel from `base.proto` to a separate file (was: Should move message StorageLevel from `base.proto` into a separate file) > Should move message StorageLevel from `base.proto` to a separate file > - > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from `base.proto` to a separate file > to avoid this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41500) auto generate concat as Double when string minus an INTERVAL type
[ https://issues.apache.org/jira/browse/SPARK-41500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JacobZheng resolved SPARK-41500. Resolution: Won't Fix > auto generate concat as Double when string minus an INTERVAL type > - > > Key: SPARK-41500 > URL: https://issues.apache.org/jira/browse/SPARK-41500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.2.2 >Reporter: JacobZheng >Priority: Major > > h2. *Describe the bug* > Here is a sql. > {code:sql} > select '2022-02-01'- INTERVAL 1 year > {code} > spark generate cast('2022-02-01' as double) - INTERVAL 1 year automatically > and type mismatch happened. > h2. *To Reproduce* > On Spark 3.0.1 using spark-shell > {code:java} > scala> spark.sql("select '2022-02-01'- interval 1 year").show > +--+ > > |CAST(CAST(2022-02-01 AS TIMESTAMP) - INTERVAL '1 years' AS STRING)| > +--+ > | 2021-02-01 00:00:00| > +--+ > {code} > On Spark 3.2.1 using spark-shell > {code:java} > scala> spark.sql("select '2022-02-01'- interval 1 year").show > org.apache.spark.sql.AnalysisException: cannot resolve '(CAST('2022-02-01' AS > DOUBLE) - INTERVAL '1' YEAR)' due to data type mismatch: differing types in > '(CAST('2022-02-01' AS DOUBLE) - INTERVAL '1' YEAR)' (double and interval > year).; line 1 pos 7; > 'Project [unresolvedalias((cast(2022-02-01 as double) - INTERVAL '1' YEAR), > None)] > +- OneRowRelation > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:190) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.immutable.List.foreach(List.scala:431) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.immutable.List.map(List.scala:305) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) > at >
[jira] [Resolved] (SPARK-42899) DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field
[ https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42899. -- Fix Version/s: 3.4.1 Assignee: Takuya Ueshin Resolution: Fixed Fixed in https://github.com/apache/spark/pull/40526 > DataFrame.to(schema) fails when it contains non-nullable nested field in > nullable field > --- > > Key: SPARK-42899 > URL: https://issues.apache.org/jira/browse/SPARK-42899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.1 > > > {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in > nullable field: > {code:scala} > scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, > b)") > df: org.apache.spark.sql.DataFrame = [a: int, b: struct] > scala> df.printSchema() > root > |-- a: integer (nullable = true) > |-- b: struct (nullable = true) > ||-- i: integer (nullable = false) > scala> df.to(df.schema) > org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or > field `b`.`i` is nullable while it's required to be non-nullable. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42901) Should move message StorageLevel from `base.proto` into a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42901: - Summary: Should move message StorageLevel from `base.proto` into a separate file (was: Should move message StorageLevel from `base.proto` to a separate file) > Should move message StorageLevel from `base.proto` into a separate file > --- > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from `base.proto` to a separate file > to avoid this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42901) Should move message StorageLevel from into a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42901: - Summary: Should move message StorageLevel from into a separate file (was: Should move StorageLevel into a separate file) > Should move message StorageLevel from into a separate file > -- > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from a to a separate file to avoid > this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42901) Should move message StorageLevel from into a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42901: - Description: [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to reuse `StorageLevel` in `message CacheTable` and run `build/sbt "connect-common/compile" to compile, there will be following message in compile log: {code:java} spark/connect/base.proto:23:1: File recursively imports itself: spark/connect/base.proto -> spark/connect/commands.proto -> spark/connect/relations.proto -> spark/connect/catalog.proto -> spark/connect/base.proto spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not found or had errors. spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was not found or had errors. spark/connect/relations.proto:84:5: "Catalog" is not defined. spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was not found or had errors. spark/connect/commands.proto:63:3: "Relation" is not defined. spark/connect/commands.proto:81:3: "Relation" is not defined. spark/connect/commands.proto:142:3: "Relation" is not defined. spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not found or had errors. spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not found or had errors. {code} So we should move `message StorageLevel` from `base.proto` to a separate file to avoid this issue was: https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to reuse `StorageLevel` in `message CacheTable` and run `build/sbt "connect-common/compile" to compile, there will be following message in compile log: {code:java} spark/connect/base.proto:23:1: File recursively imports itself: spark/connect/base.proto -> spark/connect/commands.proto -> spark/connect/relations.proto -> spark/connect/catalog.proto -> spark/connect/base.proto spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not found or had errors. spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was not found or had errors. spark/connect/relations.proto:84:5: "Catalog" is not defined. spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was not found or had errors. spark/connect/commands.proto:63:3: "Relation" is not defined. spark/connect/commands.proto:81:3: "Relation" is not defined. spark/connect/commands.proto:142:3: "Relation" is not defined. spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not found or had errors. spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not found or had errors. {code} So we should move `message StorageLevel` from a to a separate file to avoid this issue > Should move message StorageLevel from into a separate file > -- > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. >
[jira] [Updated] (SPARK-42901) Should move message StorageLevel from `base.proto` to a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42901: - Summary: Should move message StorageLevel from `base.proto` to a separate file (was: Should move message StorageLevel from into a separate file) > Should move message StorageLevel from `base.proto` to a separate file > - > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/pull/40510] introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from `base.proto` to a separate file > to avoid this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42901) Should move StorageLevel into a separate file
[ https://issues.apache.org/jira/browse/SPARK-42901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42901: - Summary: Should move StorageLevel into a separate file (was: Move StorageLevel into a separate file to avoid potential file recursively imports) > Should move StorageLevel into a separate file > - > > Key: SPARK-42901 > URL: https://issues.apache.org/jira/browse/SPARK-42901 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` > to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to > reuse `StorageLevel` in `message CacheTable` and run `build/sbt > "connect-common/compile" to compile, there will be following message in > compile log: > > {code:java} > spark/connect/base.proto:23:1: File recursively imports itself: > spark/connect/base.proto -> spark/connect/commands.proto -> > spark/connect/relations.proto -> spark/connect/catalog.proto -> > spark/connect/base.proto > spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not > found or had errors. > spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be > defined in "spark/connect/types.proto", which is not imported by > "spark/connect/catalog.proto". To use it here, please add the necessary > import. > spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was > not found or had errors. > spark/connect/relations.proto:84:5: "Catalog" is not defined. > spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was > not found or had errors. > spark/connect/commands.proto:63:3: "Relation" is not defined. > spark/connect/commands.proto:81:3: "Relation" is not defined. > spark/connect/commands.proto:142:3: "Relation" is not defined. > spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not > found or had errors. > spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not > found or had errors. > {code} > So we should move `message StorageLevel` from a to a separate file to avoid > this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42901) Move StorageLevel into a separate file to avoid potential file recursively imports
Yang Jie created SPARK-42901: Summary: Move StorageLevel into a separate file to avoid potential file recursively imports Key: SPARK-42901 URL: https://issues.apache.org/jira/browse/SPARK-42901 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Yang Jie https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to reuse `StorageLevel` in `message CacheTable` and run `build/sbt "connect-common/compile" to compile, there will be following message in compile log: {code:java} spark/connect/base.proto:23:1: File recursively imports itself: spark/connect/base.proto -> spark/connect/commands.proto -> spark/connect/relations.proto -> spark/connect/catalog.proto -> spark/connect/base.proto spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not found or had errors. spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was not found or had errors. spark/connect/relations.proto:84:5: "Catalog" is not defined. spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was not found or had errors. spark/connect/commands.proto:63:3: "Relation" is not defined. spark/connect/commands.proto:81:3: "Relation" is not defined. spark/connect/commands.proto:142:3: "Relation" is not defined. spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not found or had errors. spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not found or had errors. {code} So we should move `message StorageLevel` from a to a separate file to avoid this issue -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session
[ https://issues.apache.org/jira/browse/SPARK-42895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703815#comment-17703815 ] Rui Wang commented on SPARK-42895: -- Of course though it probably need a better error message locally. > ValueError when invoking any session operations on a stopped Spark session > -- > > Key: SPARK-42895 > URL: https://issues.apache.org/jira/browse/SPARK-42895 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > If a remote Spark session is stopped, trying to invoke any session operations > will result in a ValueError. For example: > > {code:java} > spark.stop() > spark.sql("select 1") > ValueError: Cannot invoke RPC: Channel closed! > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > ... > return e.code() == grpc.StatusCode.UNAVAILABLE > AttributeError: 'ValueError' object has no attribute 'code'{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42748) Server-side Artifact Management
[ https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42748. --- Fix Version/s: 3.5.0 Resolution: Fixed > Server-side Artifact Management > --- > > Key: SPARK-42748 > URL: https://issues.apache.org/jira/browse/SPARK-42748 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.5.0 > > > https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side > transfer of artifacts to the server but currently, the server does not > process these requests. > > We need to implement a server-side management mechanism to handle storage of > these artifacts on the driver as well as perform further processing (such as > adding jars and moving class files to the right directories) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session
[ https://issues.apache.org/jira/browse/SPARK-42895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703814#comment-17703814 ] Rui Wang commented on SPARK-42895: -- In fact spark.stop() does not stop the remote spark session but just local session and the gRPC channel. This error is expected. > ValueError when invoking any session operations on a stopped Spark session > -- > > Key: SPARK-42895 > URL: https://issues.apache.org/jira/browse/SPARK-42895 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > If a remote Spark session is stopped, trying to invoke any session operations > will result in a ValueError. For example: > > {code:java} > spark.stop() > spark.sql("select 1") > ValueError: Cannot invoke RPC: Channel closed! > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > ... > return e.code() == grpc.StatusCode.UNAVAILABLE > AttributeError: 'ValueError' object has no attribute 'code'{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42900) Fix createDataFrame to respect both type inference and column names.
Takuya Ueshin created SPARK-42900: - Summary: Fix createDataFrame to respect both type inference and column names. Key: SPARK-42900 URL: https://issues.apache.org/jira/browse/SPARK-42900 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42899) DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field
[ https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42899: -- Summary: DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field (was: DataFrame.to(schema) fails with the schema of itself.) > DataFrame.to(schema) fails when it contains non-nullable nested field in > nullable field > --- > > Key: SPARK-42899 > URL: https://issues.apache.org/jira/browse/SPARK-42899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in > nullable field: > {code:scala} > scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, > b)") > df: org.apache.spark.sql.DataFrame = [a: int, b: struct] > scala> df.printSchema() > root > |-- a: integer (nullable = true) > |-- b: struct (nullable = true) > ||-- i: integer (nullable = false) > scala> df.to(df.schema) > org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or > field `b`.`i` is nullable while it's required to be non-nullable. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42899) DataFrame.to(schema) fails with the schema of itself.
[ https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42899: -- Description: {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in nullable field: {code:scala} scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) ||-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. {code} was: {{DataFrame.to(schema)}} fails with the schema of itself, when it contains non-nullable nested field in nullable field: {code:scala} scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) ||-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. {code} > DataFrame.to(schema) fails with the schema of itself. > - > > Key: SPARK-42899 > URL: https://issues.apache.org/jira/browse/SPARK-42899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in > nullable field: > {code:scala} > scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, > b)") > df: org.apache.spark.sql.DataFrame = [a: int, b: struct] > scala> df.printSchema() > root > |-- a: integer (nullable = true) > |-- b: struct (nullable = true) > ||-- i: integer (nullable = false) > scala> df.to(df.schema) > org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or > field `b`.`i` is nullable while it's required to be non-nullable. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42899) DataFrame.to(schema) fails with the schema of itself.
Takuya Ueshin created SPARK-42899: - Summary: DataFrame.to(schema) fails with the schema of itself. Key: SPARK-42899 URL: https://issues.apache.org/jira/browse/SPARK-42899 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Takuya Ueshin {{DataFrame.to(schema)}} fails with the schema of itself, when it contains non-nullable nested field in nullable field: {code:scala} scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) ||-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42892) Move sameType and relevant methods out of DataType
[ https://issues.apache.org/jira/browse/SPARK-42892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42892. --- Fix Version/s: 3.5.0 Resolution: Fixed > Move sameType and relevant methods out of DataType > -- > > Key: SPARK-42892 > URL: https://issues.apache.org/jira/browse/SPARK-42892 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42832. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40462 [https://github.com/apache/spark/pull/40462] > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42832: - Assignee: Yuming Wang > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42894) Implement cache, persist, unpersist, and storageLevel
[ https://issues.apache.org/jira/browse/SPARK-42894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42894: - Assignee: Yang Jie > Implement cache, persist, unpersist, and storageLevel > - > > Key: SPARK-42894 > URL: https://issues.apache.org/jira/browse/SPARK-42894 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42894) Implement cache, persist, unpersist, and storageLevel
[ https://issues.apache.org/jira/browse/SPARK-42894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42894. --- Fix Version/s: 3.4.1 Resolution: Fixed Issue resolved by pull request 40516 [https://github.com/apache/spark/pull/40516] > Implement cache, persist, unpersist, and storageLevel > - > > Key: SPARK-42894 > URL: https://issues.apache.org/jira/browse/SPARK-42894 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used
[ https://issues.apache.org/jira/browse/SPARK-42898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42898: Assignee: (was: Apache Spark) > Cast from string to date and date to string say timezone is needed, but it is > not used > -- > > Key: SPARK-42898 > URL: https://issues.apache.org/jira/browse/SPARK-42898 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Priority: Major > > This is really minor but SPARK-35581 removed the need for a timezone when > casting from a `StringType` to a `DateType`, but the patch didn't update the > `needsTimeZone` function to indicate that it was not longer required. > Currently Casting from a DateType to a StringType also says that it needs the > timezone, but it only uses the `DateFormatter` with it's default parameters > that do not use the time zone at all. > I think this can be fixed with just a two line change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used
[ https://issues.apache.org/jira/browse/SPARK-42898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703720#comment-17703720 ] Apache Spark commented on SPARK-42898: -- User 'revans2' has created a pull request for this issue: https://github.com/apache/spark/pull/40524 > Cast from string to date and date to string say timezone is needed, but it is > not used > -- > > Key: SPARK-42898 > URL: https://issues.apache.org/jira/browse/SPARK-42898 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Priority: Major > > This is really minor but SPARK-35581 removed the need for a timezone when > casting from a `StringType` to a `DateType`, but the patch didn't update the > `needsTimeZone` function to indicate that it was not longer required. > Currently Casting from a DateType to a StringType also says that it needs the > timezone, but it only uses the `DateFormatter` with it's default parameters > that do not use the time zone at all. > I think this can be fixed with just a two line change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used
[ https://issues.apache.org/jira/browse/SPARK-42898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42898: Assignee: Apache Spark > Cast from string to date and date to string say timezone is needed, but it is > not used > -- > > Key: SPARK-42898 > URL: https://issues.apache.org/jira/browse/SPARK-42898 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Assignee: Apache Spark >Priority: Major > > This is really minor but SPARK-35581 removed the need for a timezone when > casting from a `StringType` to a `DateType`, but the patch didn't update the > `needsTimeZone` function to indicate that it was not longer required. > Currently Casting from a DateType to a StringType also says that it needs the > timezone, but it only uses the `DateFormatter` with it's default parameters > that do not use the time zone at all. > I think this can be fixed with just a two line change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42898) Cast from string to date and date to string say timezone is needed, but it is not used
Robert Joseph Evans created SPARK-42898: --- Summary: Cast from string to date and date to string say timezone is needed, but it is not used Key: SPARK-42898 URL: https://issues.apache.org/jira/browse/SPARK-42898 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Robert Joseph Evans This is really minor but SPARK-35581 removed the need for a timezone when casting from a `StringType` to a `DateType`, but the patch didn't update the `needsTimeZone` function to indicate that it was not longer required. Currently Casting from a DateType to a StringType also says that it needs the timezone, but it only uses the `DateFormatter` with it's default parameters that do not use the time zone at all. I think this can be fixed with just a two line change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42815) Subexpression elimination support shortcut expression
[ https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42815: --- Assignee: XiDuo You > Subexpression elimination support shortcut expression > - > > Key: SPARK-42815 > URL: https://issues.apache.org/jira/browse/SPARK-42815 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Fix For: 3.5.0 > > > The subexpression may not need to eval even if it appears more than once. > e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if > {{a}} is true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition
[ https://issues.apache.org/jira/browse/SPARK-42897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703686#comment-17703686 ] Apache Spark commented on SPARK-42897: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/40523 > Avoid evaluate more than once for the variables from the left side in the > FullOuter SMJ condition > - > > Key: SPARK-42897 > URL: https://issues.apache.org/jira/browse/SPARK-42897 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > Codegen issue for FullOuter SMJ, for example > {code} > val df1 = spark.range(5).select($"id".as("k1")) > val df2 = spark.range(10).select($"id".as("k2")) > df1.join(df2.hint("SHUFFLE_MERGE"), > $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", > "full_outer") > {code} > the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will > evaluate the variable *k1* and caused the codegen failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition
[ https://issues.apache.org/jira/browse/SPARK-42897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42897: Assignee: Apache Spark > Avoid evaluate more than once for the variables from the left side in the > FullOuter SMJ condition > - > > Key: SPARK-42897 > URL: https://issues.apache.org/jira/browse/SPARK-42897 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > Codegen issue for FullOuter SMJ, for example > {code} > val df1 = spark.range(5).select($"id".as("k1")) > val df2 = spark.range(10).select($"id".as("k2")) > df1.join(df2.hint("SHUFFLE_MERGE"), > $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", > "full_outer") > {code} > the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will > evaluate the variable *k1* and caused the codegen failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition
[ https://issues.apache.org/jira/browse/SPARK-42897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42897: Assignee: (was: Apache Spark) > Avoid evaluate more than once for the variables from the left side in the > FullOuter SMJ condition > - > > Key: SPARK-42897 > URL: https://issues.apache.org/jira/browse/SPARK-42897 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > Codegen issue for FullOuter SMJ, for example > {code} > val df1 = spark.range(5).select($"id".as("k1")) > val df2 = spark.range(10).select($"id".as("k2")) > df1.join(df2.hint("SHUFFLE_MERGE"), > $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", > "full_outer") > {code} > the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will > evaluate the variable *k1* and caused the codegen failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42815) Subexpression elimination support shortcut expression
[ https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42815. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40446 [https://github.com/apache/spark/pull/40446] > Subexpression elimination support shortcut expression > - > > Key: SPARK-42815 > URL: https://issues.apache.org/jira/browse/SPARK-42815 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Minor > Fix For: 3.5.0 > > > The subexpression may not need to eval even if it appears more than once. > e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if > {{a}} is true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42897) Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition
Wan Kun created SPARK-42897: --- Summary: Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition Key: SPARK-42897 URL: https://issues.apache.org/jira/browse/SPARK-42897 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Wan Kun Codegen issue for FullOuter SMJ, for example {code} val df1 = spark.range(5).select($"id".as("k1")) val df2 = spark.range(10).select($"id".as("k2")) df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2" % 3 && $"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2", "full_outer") {code} the join condition *$"k1" + 3 =!= $"k2" && $"k1" + 5 =!= $"k2"* both will evaluate the variable *k1* and caused the codegen failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703677#comment-17703677 ] Apache Spark commented on SPARK-42101: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/40522 > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution
[ https://issues.apache.org/jira/browse/SPARK-42896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703637#comment-17703637 ] Apache Spark commented on SPARK-42896: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/40520 > Make `mapInPandas` / mapInArrow` support barrier mode execution > --- > > Key: SPARK-42896 > URL: https://issues.apache.org/jira/browse/SPARK-42896 > Project: Spark > Issue Type: New Feature > Components: Pandas API on Spark, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Priority: Major > > Make `mapInPandas` / mapInArrow` support barrier mode execution -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution
[ https://issues.apache.org/jira/browse/SPARK-42896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42896: Assignee: (was: Apache Spark) > Make `mapInPandas` / mapInArrow` support barrier mode execution > --- > > Key: SPARK-42896 > URL: https://issues.apache.org/jira/browse/SPARK-42896 > Project: Spark > Issue Type: New Feature > Components: Pandas API on Spark, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Priority: Major > > Make `mapInPandas` / mapInArrow` support barrier mode execution -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution
[ https://issues.apache.org/jira/browse/SPARK-42896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42896: Assignee: Apache Spark > Make `mapInPandas` / mapInArrow` support barrier mode execution > --- > > Key: SPARK-42896 > URL: https://issues.apache.org/jira/browse/SPARK-42896 > Project: Spark > Issue Type: New Feature > Components: Pandas API on Spark, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > Make `mapInPandas` / mapInArrow` support barrier mode execution -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42896) Make `mapInPandas` / mapInArrow` support barrier mode execution
Weichen Xu created SPARK-42896: -- Summary: Make `mapInPandas` / mapInArrow` support barrier mode execution Key: SPARK-42896 URL: https://issues.apache.org/jira/browse/SPARK-42896 Project: Spark Issue Type: New Feature Components: Pandas API on Spark, PySpark, SQL Affects Versions: 3.5.0 Reporter: Weichen Xu Make `mapInPandas` / mapInArrow` support barrier mode execution -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs
[ https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703615#comment-17703615 ] Apache Spark commented on SPARK-42864: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40519 > Review and fix issues in MLlib API docs > --- > > Key: SPARK-42864 > URL: https://issues.apache.org/jira/browse/SPARK-42864 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs
[ https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703614#comment-17703614 ] Apache Spark commented on SPARK-42864: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40519 > Review and fix issues in MLlib API docs > --- > > Key: SPARK-42864 > URL: https://issues.apache.org/jira/browse/SPARK-42864 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session
Allison Wang created SPARK-42895: Summary: ValueError when invoking any session operations on a stopped Spark session Key: SPARK-42895 URL: https://issues.apache.org/jira/browse/SPARK-42895 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Allison Wang If a remote Spark session is stopped, trying to invoke any session operations will result in a ValueError. For example: {code:java} spark.stop() spark.sql("select 1") ValueError: Cannot invoke RPC: Channel closed! During handling of the above exception, another exception occurred: Traceback (most recent call last): ... return e.code() == grpc.StatusCode.UNAVAILABLE AttributeError: 'ValueError' object has no attribute 'code'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42889) Implement cache, persist, unpersist, and storageLevel
[ https://issues.apache.org/jira/browse/SPARK-42889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703541#comment-17703541 ] Apache Spark commented on SPARK-42889: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40518 > Implement cache, persist, unpersist, and storageLevel > - > > Key: SPARK-42889 > URL: https://issues.apache.org/jira/browse/SPARK-42889 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42874) Enable new golden file test framework for analysis for all input files
[ https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42874. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40496 [https://github.com/apache/spark/pull/40496] > Enable new golden file test framework for analysis for all input files > -- > > Key: SPARK-42874 > URL: https://issues.apache.org/jira/browse/SPARK-42874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42874) Enable new golden file test framework for analysis for all input files
[ https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42874: Assignee: Daniel > Enable new golden file test framework for analysis for all input files > -- > > Key: SPARK-42874 > URL: https://issues.apache.org/jira/browse/SPARK-42874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42893) Block Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42893: Assignee: Xinrong Meng > Block Arrow-optimized Python UDFs > - > > Key: SPARK-42893 > URL: https://issues.apache.org/jira/browse/SPARK-42893 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Considering the upcoming improvements on the result inconsistencies between > traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better > block the feature, otherwise, users who try out the feature will expect > behavior changes in the next release. > In addition, since Spark Connect Python Client(SCPC) has been introduced in > Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark > and SCPC at the same time for compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42893) Block Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42893. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40513 [https://github.com/apache/spark/pull/40513] > Block Arrow-optimized Python UDFs > - > > Key: SPARK-42893 > URL: https://issues.apache.org/jira/browse/SPARK-42893 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Considering the upcoming improvements on the result inconsistencies between > traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better > block the feature, otherwise, users who try out the feature will expect > behavior changes in the next release. > In addition, since Spark Connect Python Client(SCPC) has been introduced in > Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark > and SCPC at the same time for compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41233: Assignee: Takuya Ueshin > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42508) Extract the common .ml classes to `mllib-common`
[ https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703503#comment-17703503 ] Apache Spark commented on SPARK-42508: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40517 > Extract the common .ml classes to `mllib-common` > > > Key: SPARK-42508 > URL: https://issues.apache.org/jira/browse/SPARK-42508 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-40082. - Fix Version/s: 3.5.0 Assignee: Fencheng Mei Resolution: Fixed > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > -- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.1.1 >Reporter: Penglei Shi >Assignee: Fencheng Mei >Priority: Major > Fix For: 3.5.0 > > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org