[jira] [Updated] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_app
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41857: -- Summary: Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile (was: Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile) > Enable test_between_function, test_datetime_functions, test_expr, > test_math_functions, test_window_functions_cumulative_sum, test_corr, > test_cov, test_crosstab, test_approxQuantile > > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov,
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41857: Assignee: Apache Spark (was: Hyukjin Kwon) > Enable test_between_function, test_datetime_functions, test_expr, > test_function_parity, test_math_functions, > test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, > test_approxQuantile > -- > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov,
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653829#comment-17653829 ] Apache Spark commented on SPARK-41857: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39359 > Enable test_between_function, test_datetime_functions, test_expr, > test_function_parity, test_math_functions, > test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, > test_approxQuantile > -- > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov,
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41857: Assignee: Hyukjin Kwon (was: Apache Spark) > Enable test_between_function, test_datetime_functions, test_expr, > test_function_parity, test_math_functions, > test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, > test_approxQuantile > -- > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, t
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41857: -- Summary: Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile (was: Enable 10 tests that pass) > Enable test_between_function, test_datetime_functions, test_expr, > test_function_parity, test_math_functions, > test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, > test_approxQuantile > -- > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41857) Enable 10 tests that pass
Sandeep Singh created SPARK-41857: - Summary: Enable 10 tests that pass Key: SPARK-41857 URL: https://issues.apache.org/jira/browse/SPARK-41857 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Hyukjin Kwon Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653819#comment-17653819 ] Apache Spark commented on SPARK-41856: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39358 > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41856: Assignee: Apache Spark (was: Hyukjin Kwon) > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653818#comment-17653818 ] Apache Spark commented on SPARK-41856: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39358 > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41856: Assignee: Hyukjin Kwon (was: Apache Spark) > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41856: -- Description: 5 tests pass now. Should enable them. (was: These tests pass now. Should enable them.) > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
Sandeep Singh created SPARK-41856: - Summary: Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found Key: SPARK-41856 URL: https://issues.apache.org/jira/browse/SPARK-41856 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Hyukjin Kwon Fix For: 3.4.0 These tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper
[ https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653802#comment-17653802 ] Apache Spark commented on SPARK-41677: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/39357 > Protobuf serializer for StreamingQueryProgressWrapper > - > > Key: SPARK-41677 > URL: https://issues.apache.org/jira/browse/SPARK-41677 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper
[ https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41677: Assignee: (was: Apache Spark) > Protobuf serializer for StreamingQueryProgressWrapper > - > > Key: SPARK-41677 > URL: https://issues.apache.org/jira/browse/SPARK-41677 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper
[ https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41677: Assignee: Apache Spark > Protobuf serializer for StreamingQueryProgressWrapper > - > > Key: SPARK-41677 > URL: https://issues.apache.org/jira/browse/SPARK-41677 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper
[ https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653803#comment-17653803 ] Apache Spark commented on SPARK-41677: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/39357 > Protobuf serializer for StreamingQueryProgressWrapper > - > > Key: SPARK-41677 > URL: https://issues.apache.org/jira/browse/SPARK-41677 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41423) Protobuf serializer for StageDataWrapper
[ https://issues.apache.org/jira/browse/SPARK-41423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653798#comment-17653798 ] Apache Spark commented on SPARK-41423: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39356 > Protobuf serializer for StageDataWrapper > > > Key: SPARK-41423 > URL: https://issues.apache.org/jira/browse/SPARK-41423 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: BingKun Pan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40263) Use interruptible lock instead of synchronized in TransportClientFactory.createClient()
[ https://issues.apache.org/jira/browse/SPARK-40263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653795#comment-17653795 ] Apache Spark commented on SPARK-40263: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39355 > Use interruptible lock instead of synchronized in > TransportClientFactory.createClient() > --- > > Key: SPARK-40263 > URL: https://issues.apache.org/jira/browse/SPARK-40263 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Major > > Followup to SPARK-40235: we should apply a similar fix in > TransportClientFactory.createClient -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40263) Use interruptible lock instead of synchronized in TransportClientFactory.createClient()
[ https://issues.apache.org/jira/browse/SPARK-40263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40263: Assignee: (was: Apache Spark) > Use interruptible lock instead of synchronized in > TransportClientFactory.createClient() > --- > > Key: SPARK-40263 > URL: https://issues.apache.org/jira/browse/SPARK-40263 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Major > > Followup to SPARK-40235: we should apply a similar fix in > TransportClientFactory.createClient -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40263) Use interruptible lock instead of synchronized in TransportClientFactory.createClient()
[ https://issues.apache.org/jira/browse/SPARK-40263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40263: Assignee: Apache Spark > Use interruptible lock instead of synchronized in > TransportClientFactory.createClient() > --- > > Key: SPARK-40263 > URL: https://issues.apache.org/jira/browse/SPARK-40263 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Major > > Followup to SPARK-40235: we should apply a similar fix in > TransportClientFactory.createClient -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40263) Use interruptible lock instead of synchronized in TransportClientFactory.createClient()
[ https://issues.apache.org/jira/browse/SPARK-40263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653790#comment-17653790 ] Apache Spark commented on SPARK-40263: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39355 > Use interruptible lock instead of synchronized in > TransportClientFactory.createClient() > --- > > Key: SPARK-40263 > URL: https://issues.apache.org/jira/browse/SPARK-40263 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Major > > Followup to SPARK-40235: we should apply a similar fix in > TransportClientFactory.createClient -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41656) Enable doctests in pyspark.sql.connect.dataframe
[ https://issues.apache.org/jira/browse/SPARK-41656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653787#comment-17653787 ] Apache Spark commented on SPARK-41656: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39354 > Enable doctests in pyspark.sql.connect.dataframe > > > Key: SPARK-41656 > URL: https://issues.apache.org/jira/browse/SPARK-41656 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41839) Implement SparkSession.sparkContext
[ https://issues.apache.org/jira/browse/SPARK-41839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41839. -- Resolution: Invalid I am resolving this because Spark Connect is not designed to support Spark Context or RDD API. > Implement SparkSession.sparkContext > --- > > Key: SPARK-41839 > URL: https://issues.apache.org/jira/browse/SPARK-41839 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41656) Enable doctests in pyspark.sql.connect.dataframe
[ https://issues.apache.org/jira/browse/SPARK-41656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653786#comment-17653786 ] Apache Spark commented on SPARK-41656: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39354 > Enable doctests in pyspark.sql.connect.dataframe > > > Key: SPARK-41656 > URL: https://issues.apache.org/jira/browse/SPARK-41656 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41819) Implement Dataframe.rdd getNumPartitions
[ https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41819. -- Resolution: Invalid I am resolving this because Spark Connect is not designed to support Spark Context or RDD API. > Implement Dataframe.rdd getNumPartitions > > > Key: SPARK-41819 > URL: https://issues.apache.org/jira/browse/SPARK-41819 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce > Failed example: > df.coalesce(1).rdd.getNumPartitions() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.coalesce(1).rdd.getNumPartitions() > AttributeError: 'function' object has no attribute > 'getNumPartitions'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41658) Enable doctests in pyspark.sql.connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653784#comment-17653784 ] Apache Spark commented on SPARK-41658: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39354 > Enable doctests in pyspark.sql.connect.functions > > > Key: SPARK-41658 > URL: https://issues.apache.org/jira/browse/SPARK-41658 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache
[ https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653781#comment-17653781 ] wuyi commented on SPARK-41497: -- > If I am not wrong, SQL makes very heavy use of accumulators, and so most > stages will end up having them anyway - right ? Right. > I would expect this scenario (even without accumulator) to be fairly low > frequency enough that the cost of extra recomputation might be fine. Agree. So shall we proceed with the improved Option 4 that was proposed by you [~mridulm80] ? [~ivoson] can help with the fix. > Accumulator undercounting in the case of retry task with rdd cache > -- > > Key: SPARK-41497 > URL: https://issues.apache.org/jira/browse/SPARK-41497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1 >Reporter: wuyi >Priority: Major > > Accumulator could be undercounted when the retried task has rdd cache. See > the example below and you could also find the completed and reproducible > example at > [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc] > > {code:scala} > test("SPARK-XXX") { > // Set up a cluster with 2 executors > val conf = new SparkConf() > .setMaster("local-cluster[2, 1, > 1024]").setAppName("TaskSchedulerImplSuite") > sc = new SparkContext(conf) > // Set up a custom task scheduler. The scheduler will fail the first task > attempt of the job > // submitted below. In particular, the failed first attempt task would > success on computation > // (accumulator accounting, result caching) but only fail to report its > success status due > // to the concurrent executor lost. The second task attempt would success. > taskScheduler = setupSchedulerWithCustomStatusUpdate(sc) > val myAcc = sc.longAccumulator("myAcc") > // Initiate a rdd with only one partition so there's only one task and > specify the storage level > // with MEMORY_ONLY_2 so that the rdd result will be cached on both two > executors. > val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter => > myAcc.add(100) > iter.map(x => x + 1) > }.persist(StorageLevel.MEMORY_ONLY_2) > // This will pass since the second task attempt will succeed > assert(rdd.count() === 10) > // This will fail due to `myAcc.add(100)` won't be executed during the > second task attempt's > // execution. Because the second task attempt will load the rdd cache > directly instead of > // executing the task function so `myAcc.add(100)` is skipped. > assert(myAcc.value === 100) > } {code} > > We could also hit this issue with decommission even if the rdd only has one > copy. For example, decommission could migrate the rdd cache block to another > executor (the result is actually the same with 2 copies) and the > decommissioned executor lost before the task reports its success status to > the driver. > > And the issue is a bit more complicated than expected to fix. I have tried to > give some fixes but all of them are not ideal: > Option 1: Clean up any rdd cache related to the failed task: in practice, > this option can already fix the issue in most cases. However, theoretically, > rdd cache could be reported to the driver right after the driver cleans up > the failed task's caches due to asynchronous communication. So this option > can’t resolve the issue thoroughly; > Option 2: Disallow rdd cache reuse across the task attempts for the same > task: this option can 100% fix the issue. The problem is this way can also > affect the case where rdd cache can be reused across the attempts (e.g., when > there is no accumulator operation in the task), which can have perf > regression; > Option 3: Introduce accumulator cache: first, this requires a new framework > for supporting accumulator cache; second, the driver should improve its logic > to distinguish whether the accumulator cache value should be reported to the > user to avoid overcounting. For example, in the case of task retry, the value > should be reported. However, in the case of rdd cache reuse, the value > shouldn’t be reported (should it?); > Option 4: Do task success validation when a task trying to load the rdd > cache: this way defines a rdd cache is only valid/accessible if the task has > succeeded. This way could be either overkill or a bit complex (because > currently Spark would clean up the task state once it’s finished. So we need > to maintain a structure to know if task once succeeded or not. ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Resolved] (SPARK-41826) Implement Dataframe.readStream
[ https://issues.apache.org/jira/browse/SPARK-41826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41826. -- Resolution: Duplicate > Implement Dataframe.readStream > -- > > Key: SPARK-41826 > URL: https://issues.apache.org/jira/browse/SPARK-41826 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line ?, in pyspark.sql.connect.dataframe.DataFrame.isStreaming > Failed example: > df = spark.readStream.format("rate").load() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.dataframe.DataFrame.isStreaming[0]>", line 1, in > df = spark.readStream.format("rate").load() > AttributeError: 'SparkSession' object has no attribute 'readStream'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41234) High-order function: array_insert
[ https://issues.apache.org/jira/browse/SPARK-41234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653778#comment-17653778 ] Ruifeng Zheng commented on SPARK-41234: --- {code:sql} ++ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), 0, 4) | || | [ | | 4, | | 1, | | 2, | | 3| | ] | ++ 1 Row(s) produced. Time Elapsed: 0.139s ++ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), 2, 4) | || | [ | | 1, | | 2, | | 4, | | 3| | ] | ++ 1 Row(s) produced. Time Elapsed: 0.116s +---+ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), 2, NULL) | |---| | [ | | 1, | | 2, | | undefined, | | 3 | | ] | +---+ 1 Row(s) produced. Time Elapsed: 0.130s +--+ | ARRAY_INSERT(NULL, 2, 1) | |--| | NULL | +--+ 1 Row(s) produced. Time Elapsed: 0.106s +-+ | ARRAY_INSERT(NULL, 2, NULL) | |-| | NULL| +-+ 1 Row(s) produced. Time Elapsed: 0.113s +-+ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), 10, 1) | |-| | [ | | 1,| | 2,| | 3,| | undefined,| | undefined,| | undefined,| | undefined,| | undefined,| | undefined,| | undefined,| | 1 | | ] | +-+ 1 Row(s) produced. Time Elapsed: 0.116s +--+ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), -10, 1) | |--| | [| | 1, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined, | | 1, | | 2, | | 3 | | ]| +--+ 1 Row(s) produced. Time Elapsed: 0.111s ++ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), 10, NULL) | || | [ | | 1, | | 2, | | 3, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined, | | undefined| | ] | ++ 1 Row(s) produced. Time Elapsed: 0.109s +-+ | ARRAY_INSERT(ARRAY_CONSTRUCT(1,2,3), -10, NULL) | |-| | [
[jira] [Assigned] (SPARK-41311) Rewrite test RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space
[ https://issues.apache.org/jira/browse/SPARK-41311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41311: Assignee: Immanuel Buder > Rewrite test RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space > --- > > Key: SPARK-41311 > URL: https://issues.apache.org/jira/browse/SPARK-41311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Immanuel Buder >Assignee: Immanuel Buder >Priority: Minor > > Rewrite the test for error class *RENAME_SRC_PATH_NOT_FOUND* in > [QueryExecutionErrorsSuite.scala|https://github.com/apache/spark/pull/38782/files/bea17e4fa61a06ff566b0ff1c3fcc39fa1100912#diff-b1989c7e0e4b50291fb7bdd2993da387102c669a47fe6e2077b23b37d78b18be] > to trigger the error from user space. The current test uses non-user-facing > class FileSystemBasedCheckpointFileManager directly to trigger the error. > (see > [https://github.com/apache/spark/pull/38782/files/bea17e4fa61a06ff566b0ff1c3fcc39fa1100912#diff-b1989c7e0e4b50291fb7bdd2993da387102c669a47fe6e2077b23b37d78b18beR680] > ) > Done when: the test uses user-facing APIs as much as possible. > Proposed solution: rewrite the test following the example of > [https://github.com/apache/spark/pull/38782/files/bea17e4fa61a06ff566b0ff1c3fcc39fa1100912#diff-b1989c7e0e4b50291fb7bdd2993da387102c669a47fe6e2077b23b37d78b18beR641] > See [https://github.com/apache/spark/pull/38782#discussion_r1033013064] for > more context -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41311) Rewrite test RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space
[ https://issues.apache.org/jira/browse/SPARK-41311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41311. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39348 [https://github.com/apache/spark/pull/39348] > Rewrite test RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space > --- > > Key: SPARK-41311 > URL: https://issues.apache.org/jira/browse/SPARK-41311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Immanuel Buder >Assignee: Immanuel Buder >Priority: Minor > Fix For: 3.4.0 > > > Rewrite the test for error class *RENAME_SRC_PATH_NOT_FOUND* in > [QueryExecutionErrorsSuite.scala|https://github.com/apache/spark/pull/38782/files/bea17e4fa61a06ff566b0ff1c3fcc39fa1100912#diff-b1989c7e0e4b50291fb7bdd2993da387102c669a47fe6e2077b23b37d78b18be] > to trigger the error from user space. The current test uses non-user-facing > class FileSystemBasedCheckpointFileManager directly to trigger the error. > (see > [https://github.com/apache/spark/pull/38782/files/bea17e4fa61a06ff566b0ff1c3fcc39fa1100912#diff-b1989c7e0e4b50291fb7bdd2993da387102c669a47fe6e2077b23b37d78b18beR680] > ) > Done when: the test uses user-facing APIs as much as possible. > Proposed solution: rewrite the test following the example of > [https://github.com/apache/spark/pull/38782/files/bea17e4fa61a06ff566b0ff1c3fcc39fa1100912#diff-b1989c7e0e4b50291fb7bdd2993da387102c669a47fe6e2077b23b37d78b18beR641] > See [https://github.com/apache/spark/pull/38782#discussion_r1033013064] for > more context -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41661) Support for Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41661: Assignee: Xinrong Meng > Support for Python UDFs > --- > > Key: SPARK-41661 > URL: https://issues.apache.org/jira/browse/SPARK-41661 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Xinrong Meng >Priority: Major > > Spark Connect should support Python UDFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41651) Test parity: pyspark.sql.tests.test_dataframe
[ https://issues.apache.org/jira/browse/SPARK-41651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653775#comment-17653775 ] Hyukjin Kwon commented on SPARK-41651: -- cc [~techaddict] in case you're interested in this. We can do a similar approach by enabling some tests and/or fixing the skipping messages with new JIRAs > Test parity: pyspark.sql.tests.test_dataframe > - > > Key: SPARK-41651 > URL: https://issues.apache.org/jira/browse/SPARK-41651 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses > the same test cases, see > {{python/pyspark/sql/tests/connect/test_parity_dataframe.py}}. > We should remove all the test cases defined there, and fix Spark Connect > behaviours accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41652) Test parity: pyspark.sql.tests.test_functions
[ https://issues.apache.org/jira/browse/SPARK-41652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653776#comment-17653776 ] Hyukjin Kwon commented on SPARK-41652: -- cc [~techaddict] in case you're interested in this. We can do a similar approach by enabling some tests and/or fixing the skipping messages with new JIRAs > Test parity: pyspark.sql.tests.test_functions > - > > Key: SPARK-41652 > URL: https://issues.apache.org/jira/browse/SPARK-41652 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses > the same test cases, see > {{python/pyspark/sql/tests/connect/test_parity_functions.py}}. > We should remove all the test cases defined there, and fix Spark Connect > behaviours accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41658) Enable doctests in pyspark.sql.connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653774#comment-17653774 ] Hyukjin Kwon commented on SPARK-41658: -- Fixed in https://github.com/apache/spark/pull/39347 > Enable doctests in pyspark.sql.connect.functions > > > Key: SPARK-41658 > URL: https://issues.apache.org/jira/browse/SPARK-41658 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41658) Enable doctests in pyspark.sql.connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41658. -- Resolution: Fixed > Enable doctests in pyspark.sql.connect.functions > > > Key: SPARK-41658 > URL: https://issues.apache.org/jira/browse/SPARK-41658 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41653) Test parity: enable doctests in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-41653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41653. -- Resolution: Done > Test parity: enable doctests in Spark Connect > - > > Key: SPARK-41653 > URL: https://issues.apache.org/jira/browse/SPARK-41653 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > > We should actually run the doctests of Spark Connect. > We should add something like > https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1227-L1247 > to Spark Connect modules, and add the module into > https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L507 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41658) Enable doctests in pyspark.sql.connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41658: Assignee: Sandeep Singh > Enable doctests in pyspark.sql.connect.functions > > > Key: SPARK-41658 > URL: https://issues.apache.org/jira/browse/SPARK-41658 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41807) Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY
[ https://issues.apache.org/jira/browse/SPARK-41807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41807: Assignee: BingKun Pan > Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY > -- > > Key: SPARK-41807 > URL: https://issues.apache.org/jira/browse/SPARK-41807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41807) Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY
[ https://issues.apache.org/jira/browse/SPARK-41807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41807. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39335 [https://github.com/apache/spark/pull/39335] > Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY > -- > > Key: SPARK-41807 > URL: https://issues.apache.org/jira/browse/SPARK-41807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41841) Support PyPI packaging without JVM
[ https://issues.apache.org/jira/browse/SPARK-41841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41841: Assignee: Apache Spark > Support PyPI packaging without JVM > -- > > Key: SPARK-41841 > URL: https://issues.apache.org/jira/browse/SPARK-41841 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Blocker > > We should support pip install pyspark without JVM so Spark Connect can be > real lightweight library. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41841) Support PyPI packaging without JVM
[ https://issues.apache.org/jira/browse/SPARK-41841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653770#comment-17653770 ] Apache Spark commented on SPARK-41841: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39353 > Support PyPI packaging without JVM > -- > > Key: SPARK-41841 > URL: https://issues.apache.org/jira/browse/SPARK-41841 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > We should support pip install pyspark without JVM so Spark Connect can be > real lightweight library. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41841) Support PyPI packaging without JVM
[ https://issues.apache.org/jira/browse/SPARK-41841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41841: Assignee: (was: Apache Spark) > Support PyPI packaging without JVM > -- > > Key: SPARK-41841 > URL: https://issues.apache.org/jira/browse/SPARK-41841 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > We should support pip install pyspark without JVM so Spark Connect can be > real lightweight library. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41854) Automatic reformat/check python/setup.py
[ https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41854. -- Fix Version/s: 3.4.0 Assignee: Hyukjin Kwon Resolution: Fixed Fixed in https://github.com/apache/spark/pull/39352 > Automatic reformat/check python/setup.py > - > > Key: SPARK-41854 > URL: https://issues.apache.org/jira/browse/SPARK-41854 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly
[ https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41855: -- Description: {code:python} data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, value=None)] # +---+-+ # | id|value| # +---+-+ # | 1| NaN| # | 2| 42.0| # | 3| null| # +---+-+ cdf = self.connect.createDataFrame(data) sdf = self.spark.createDataFrame(data) print() print() print(cdf._show_string(100, 100, False)) print() print(cdf.schema) print() print(sdf._jdf.showString(100, 100, False)) print() print(sdf.schema) self.compare_by_show(cdf, sdf) {code} {code:java} +---+-+ | id|value| +---+-+ | 1| null| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) +---+-+ | id|value| +---+-+ | 1| NaN| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) {code} this issue is due to that `createDataFrame` can't handle None/NaN properly: 1, in the conversion from local data to pd.DataFrame, it automatically converts both None and NaN to NaN 2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN to null was: {code:python} data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, value=None)] # +---+-+ # | id|value| # +---+-+ # | 1| NaN| # | 2| 42.0| # | 3| null| # +---+-+ cdf = self.connect.createDataFrame(data) sdf = self.spark.createDataFrame(data) print() print() print(cdf._show_string(100, 100, False)) print() print(cdf.schema) print() print(sdf._jdf.showString(100, 100, False)) print() print(sdf.schema) self.compare_by_show(cdf, sdf) {code} {code:java} +---+-+ | id|value| +---+-+ | 1| null| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) +---+-+ | id|value| +---+-+ | 1| NaN| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) {code} this issue is due to that `createDataFrame` can't handle None/NaN properly: 1, in the conversion from local data to pd.DataFrame, it automatically converts None to NaN 2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN to null > `createDataFrame` doesn't handle None/NaN properly > -- > > Key: SPARK-41855 > URL: https://issues.apache.org/jira/browse/SPARK-41855 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > {code:python} > data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), > Row(id=3, value=None)] > # +---+-+ > # | id|value| > # +---+-+ > # | 1| NaN| > # | 2| 42.0| > # | 3| null| > # +---+-+ > cdf = self.connect.createDataFrame(data) > sdf = self.spark.createDataFrame(data) > print() > print() > print(cdf._show_string(100, 100, False)) > print() > print(cdf.schema) > print() > print(sdf._jdf.showString(100, 100, False)) > print() > print(sdf.schema) > self.compare_by_show(cdf, sdf) > {code} > {code:java} > +---+-+ > | id|value| > +---+-+ > | 1| null| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > +---+-+ > | id|value| > +---+-+ > | 1| NaN| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > {code} > this issue is due to that `createDataFrame` can't handle None/NaN properly: > 1, in the conversion from local data to pd.DataFrame, it automatically > converts both None and NaN to NaN > 2, then in the conversion from pd.DataFrame to pa.Table, it always converts > NaN to null -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly
[ https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41855: -- Description: {code:python} data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, value=None)] # +---+-+ # | id|value| # +---+-+ # | 1| NaN| # | 2| 42.0| # | 3| null| # +---+-+ cdf = self.connect.createDataFrame(data) sdf = self.spark.createDataFrame(data) print() print() print(cdf._show_string(100, 100, False)) print() print(cdf.schema) print() print(sdf._jdf.showString(100, 100, False)) print() print(sdf.schema) self.compare_by_show(cdf, sdf) {code} {code:java} +---+-+ | id|value| +---+-+ | 1| null| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) +---+-+ | id|value| +---+-+ | 1| NaN| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) {code} this issue is due to that `createDataFrame` can't handle None/NaN properly: 1, in the conversion from local data to pd.DataFrame, it automatically converts None to NaN 2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN to null was: {code:python} data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, value=None)] # +---+-+ # | id|value| # +---+-+ # | 1| NaN| # | 2| 42.0| # | 3| null| # +---+-+ cdf = self.connect.createDataFrame(data) sdf = self.spark.createDataFrame(data) print() print() print(cdf._show_string(100, 100, False)) print() print(cdf.schema) print() print(sdf._jdf.showString(100, 100, False)) print() print(sdf.schema) self.compare_by_show(cdf, sdf) {code} {code:java} +---+-+ | id|value| +---+-+ | 1| null| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) +---+-+ | id|value| +---+-+ | 1| NaN| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) {code} this issue is due to that `createDataFrame` can't handle None properly: 1, in the conversion from local data to pd.DataFrame, it automatically converts None to NaN 2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN to null > `createDataFrame` doesn't handle None/NaN properly > -- > > Key: SPARK-41855 > URL: https://issues.apache.org/jira/browse/SPARK-41855 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > {code:python} > data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), > Row(id=3, value=None)] > # +---+-+ > # | id|value| > # +---+-+ > # | 1| NaN| > # | 2| 42.0| > # | 3| null| > # +---+-+ > cdf = self.connect.createDataFrame(data) > sdf = self.spark.createDataFrame(data) > print() > print() > print(cdf._show_string(100, 100, False)) > print() > print(cdf.schema) > print() > print(sdf._jdf.showString(100, 100, False)) > print() > print(sdf.schema) > self.compare_by_show(cdf, sdf) > {code} > {code:java} > +---+-+ > | id|value| > +---+-+ > | 1| null| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > +---+-+ > | id|value| > +---+-+ > | 1| NaN| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > {code} > this issue is due to that `createDataFrame` can't handle None/NaN properly: > 1, in the conversion from local data to pd.DataFrame, it automatically > converts None to NaN > 2, then in the conversion from pd.DataFrame to pa.Table, it always converts > NaN to null -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly
[ https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41855: -- Summary: `createDataFrame` doesn't handle None/NaN properly (was: `createDataFrame` doesn't handle None properly) > `createDataFrame` doesn't handle None/NaN properly > -- > > Key: SPARK-41855 > URL: https://issues.apache.org/jira/browse/SPARK-41855 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > {code:python} > data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), > Row(id=3, value=None)] > # +---+-+ > # | id|value| > # +---+-+ > # | 1| NaN| > # | 2| 42.0| > # | 3| null| > # +---+-+ > cdf = self.connect.createDataFrame(data) > sdf = self.spark.createDataFrame(data) > print() > print() > print(cdf._show_string(100, 100, False)) > print() > print(cdf.schema) > print() > print(sdf._jdf.showString(100, 100, False)) > print() > print(sdf.schema) > self.compare_by_show(cdf, sdf) > {code} > {code:java} > +---+-+ > | id|value| > +---+-+ > | 1| null| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > +---+-+ > | id|value| > +---+-+ > | 1| NaN| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > {code} > this issue is due to that `createDataFrame` can't handle None properly: > 1, in the conversion from local data to pd.DataFrame, it automatically > converts None to NaN > 2, then in the conversion from pd.DataFrame to pa.Table, it always converts > NaN to null -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41855) `createDataFrame` doesn't handle None properly
Ruifeng Zheng created SPARK-41855: - Summary: `createDataFrame` doesn't handle None properly Key: SPARK-41855 URL: https://issues.apache.org/jira/browse/SPARK-41855 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng {code:python} data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, value=None)] # +---+-+ # | id|value| # +---+-+ # | 1| NaN| # | 2| 42.0| # | 3| null| # +---+-+ cdf = self.connect.createDataFrame(data) sdf = self.spark.createDataFrame(data) print() print() print(cdf._show_string(100, 100, False)) print() print(cdf.schema) print() print(sdf._jdf.showString(100, 100, False)) print() print(sdf.schema) self.compare_by_show(cdf, sdf) {code} {code:java} +---+-+ | id|value| +---+-+ | 1| null| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) +---+-+ | id|value| +---+-+ | 1| NaN| | 2| 42.0| | 3| null| +---+-+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) {code} this issue is due to that `createDataFrame` can't handle None properly: 1, in the conversion from local data to pd.DataFrame, it automatically converts None to NaN 2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN to null -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41854) Automatic reformat/check python/setup.py
[ https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653756#comment-17653756 ] Apache Spark commented on SPARK-41854: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39352 > Automatic reformat/check python/setup.py > - > > Key: SPARK-41854 > URL: https://issues.apache.org/jira/browse/SPARK-41854 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41854) Automatic reformat/check python/setup.py
[ https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41854: Assignee: Apache Spark > Automatic reformat/check python/setup.py > - > > Key: SPARK-41854 > URL: https://issues.apache.org/jira/browse/SPARK-41854 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41854) Automatic reformat/check python/setup.py
[ https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41854: Assignee: (was: Apache Spark) > Automatic reformat/check python/setup.py > - > > Key: SPARK-41854 > URL: https://issues.apache.org/jira/browse/SPARK-41854 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41854) Automatic reformat/check python/setup.py
Hyukjin Kwon created SPARK-41854: Summary: Automatic reformat/check python/setup.py Key: SPARK-41854 URL: https://issues.apache.org/jira/browse/SPARK-41854 Project: Spark Issue Type: Test Components: Build, PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653753#comment-17653753 ] Hyukjin Kwon commented on SPARK-39995: -- I think i will be able to pick this up before Spark 3.4. > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653752#comment-17653752 ] Hyukjin Kwon commented on SPARK-39995: -- For: {quote} Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI but not possible with package managers like Poetry. {quote} We can't do this because of the issue in pip itself, see SPARK-32837 > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653750#comment-17653750 ] Sandeep Singh commented on SPARK-41852: --- [~podongfeng] these are from the doctests {code:java} >>> from pyspark.sql.functions import pmod >>> df = spark.createDataFrame([ ... (1.0, float('nan')), (float('nan'), 2.0), (10.0, 3.0), ... (float('nan'), float('nan')), (-3.0, 4.0), (-10.0, 3.0), ... (-5.0, -6.0), (7.0, -8.0), (1.0, 2.0)], ... ("a", "b")) >>> df.select(pmod("a", "b")).show() {code} > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653751#comment-17653751 ] Sandeep Singh commented on SPARK-41851: --- [~podongfeng] {code:java} >>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], >>> ("a", "b")) >>> df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, >>> df.b).alias("r2")).collect() {code} > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41815) Column.isNull returns nan instead of None
[ https://issues.apache.org/jira/browse/SPARK-41815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653748#comment-17653748 ] Ruifeng Zheng commented on SPARK-41815: --- similar to the issue in `createDataFrame` https://issues.apache.org/jira/browse/SPARK-41814 > Column.isNull returns nan instead of None > - > > Key: SPARK-41815 > URL: https://issues.apache.org/jira/browse/SPARK-41815 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 99, in > pyspark.sql.connect.column.Column.isNull > Failed example: > df.filter(df.height.isNull()).collect() > Expected: > [Row(name='Alice', height=None)] > Got: > [Row(name='Alice', height=nan)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41814) Column.eqNullSafe fails on NaN comparison
[ https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653736#comment-17653736 ] Ruifeng Zheng edited comment on SPARK-41814 at 1/3/23 3:12 AM: --- this issue is due to that `createDataFrame` can't handle NaN/None properly: 1, the conversion from rows to pd.DataFrame, which automatically convert None to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null was (Author: podongfeng): this issue is due to that `createDataFrame` can't handle NaN/None properly: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null > Column.eqNullSafe fails on NaN comparison > - > > Key: SPARK-41814 > URL: https://issues.apache.org/jira/browse/SPARK-41814 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df2.select( > df2['value'].eqNullSafe(None), > df2['value'].eqNullSafe(float('NaN')), > df2['value'].eqNullSafe(42.0) > ).show() > Expected: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > | false| true| false| > | false| false|true| > |true| false| false| > ++---++ > Got: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > |true| false| false| > | false| false|true| > |true| false| false| > ++---++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653746#comment-17653746 ] Ruifeng Zheng commented on SPARK-41851: --- could you please also provide the code to create the dataframe? a known issue is that `session.createDataFrame` doesn't handle NaN/None correctly. https://issues.apache.org/jira/browse/SPARK-41814 > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41814) Column.eqNullSafe fails on NaN comparison
[ https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653736#comment-17653736 ] Ruifeng Zheng edited comment on SPARK-41814 at 1/3/23 3:06 AM: --- this issue is due to that `createDataFrame` can't handle NaN/None properly: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null was (Author: podongfeng): this issue is due to: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null > Column.eqNullSafe fails on NaN comparison > - > > Key: SPARK-41814 > URL: https://issues.apache.org/jira/browse/SPARK-41814 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df2.select( > df2['value'].eqNullSafe(None), > df2['value'].eqNullSafe(float('NaN')), > df2['value'].eqNullSafe(42.0) > ).show() > Expected: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > | false| true| false| > | false| false|true| > |true| false| false| > ++---++ > Got: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > |true| false| false| > | false| false|true| > |true| false| false| > ++---++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653743#comment-17653743 ] Ruifeng Zheng commented on SPARK-41852: --- could you please also provide the code to create the dataframe? a known issue is that `session.createDataFrame` doesn't handle NaN/None correctly. > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653741#comment-17653741 ] Apache Spark commented on SPARK-41853: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/39351 > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Commented] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653740#comment-17653740 ] Apache Spark commented on SPARK-41853: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/39351 > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41853: Assignee: Apache Spark > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41853: Assignee: (was: Apache Spark) > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
Ted Yu created SPARK-41853: -- Summary: Use Map in place of SortedMap for ErrorClassesJsonReader Key: SPARK-41853 URL: https://issues.apache.org/jira/browse/SPARK-41853 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.2.3 Reporter: Ted Yu The use of SortedMap in ErrorClassesJsonReader was mostly for making tests easier to write. This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41852) Fix `pmod` function
Sandeep Singh created SPARK-41852: - Summary: Fix `pmod` function Key: SPARK-41852 URL: https://issues.apache.org/jira/browse/SPARK-41852 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41852: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 622, in pyspark.sql.connect.functions.pmod Failed example: df.select(pmod("a", "b")).show() Expected: +--+ |pmod(a, b)| +--+ | NaN| | NaN| | 1.0| | NaN| | 1.0| | 2.0| | -5.0| | 7.0| | 1.0| +--+ Got: +--+ |pmod(a, b)| +--+ | null| | null| | 1.0| | null| | 1.0| | 2.0| | -5.0| | 7.0| | 1.0| +--+ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
[ https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-41848: - Priority: Blocker (was: Major) > Tasks are over-scheduled with TaskResourceProfile > - > > Key: SPARK-41848 > URL: https://issues.apache.org/jira/browse/SPARK-41848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: wuyi >Priority: Blocker > > {code:java} > test("SPARK-XXX") { > val conf = new > SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") > sc = new SparkContext(conf) > val req = new TaskResourceRequests().cpus(3) > val rp = new ResourceProfileBuilder().require(req).build() > val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => > Thread.sleep(5000) > x * 2 > }.collect() > assert(res === Array(0, 2)) > } {code} > In this test, tasks are supposed to be scheduled in order since each task > requires 3 cores but the executor only has 4 cores. However, we noticed 2 > tasks are launched concurrently from the logs. > It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset > for task scheduling: > {code:java} > val rpId = taskSet.taskSet.resourceProfileId > val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, > conf) {code} > but the ResourceProfile (taskCpus=1) of the executor for updating the free > cores in ExecutorData: > {code:java} > val rpId = executorData.resourceProfileId > val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) > executorData.freeCores -= taskCpus {code} > which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
[ https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653739#comment-17653739 ] wuyi commented on SPARK-41848: -- cc [~ivoson] > Tasks are over-scheduled with TaskResourceProfile > - > > Key: SPARK-41848 > URL: https://issues.apache.org/jira/browse/SPARK-41848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: wuyi >Priority: Major > > {code:java} > test("SPARK-XXX") { > val conf = new > SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") > sc = new SparkContext(conf) > val req = new TaskResourceRequests().cpus(3) > val rp = new ResourceProfileBuilder().require(req).build() > val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => > Thread.sleep(5000) > x * 2 > }.collect() > assert(res === Array(0, 2)) > } {code} > In this test, tasks are supposed to be scheduled in order since each task > requires 3 cores but the executor only has 4 cores. However, we noticed 2 > tasks are launched concurrently from the logs. > It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset > for task scheduling: > {code:java} > val rpId = taskSet.taskSet.resourceProfileId > val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, > conf) {code} > but the ResourceProfile (taskCpus=1) of the executor for updating the free > cores in ExecutorData: > {code:java} > val rpId = executorData.resourceProfileId > val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) > executorData.freeCores -= taskCpus {code} > which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41851) Fix `nanvl` function
Sandeep Singh created SPARK-41851: - Summary: Fix `nanvl` function Key: SPARK-41851 URL: https://issues.apache.org/jira/browse/SPARK-41851 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41851: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Commented] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653738#comment-17653738 ] Sandeep Singh commented on SPARK-41850: --- This should be moved under SPARK-41283 > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 288, in pyspark.sql.connect.functions.isnan > Failed example: > df.select("a", "b", isnan("a").alias("r1"), > isnan(df.b).alias("r2")).show() > Expected: > +---+---+-+-+ > | a| b| r1| r2| > +---+---+-+-+ > |1.0|NaN|false| true| > |NaN|2.0| true|false| > +---+---+-+-+ > Got: > +++-+-+ > | a| b| r1| r2| > +++-+-+ > | 1.0|null|false|false| > |null| 2.0|false|false| > +++-+-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41850: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 288, in pyspark.sql.connect.functions.isnan Failed example: df.select("a", "b", isnan("a").alias("r1"), isnan(df.b).alias("r2")).show() Expected: +---+---+-+-+ | a| b| r1| r2| +---+---+-+-+ |1.0|NaN|false| true| |NaN|2.0| true|false| +---+---+-+-+ Got: +++-+-+ | a| b| r1| r2| +++-+-+ | 1.0|null|false|false| |null| 2.0|false|false| +++-+-+ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 288, in pyspark.sql.connect.functions.isnan > Failed example: > df.select("a", "b", isnan("a").alias("r1"), > isnan(df.b).alias("r2")).show() > Expected: > +---+---+-+-+ > | a| b| r1| r2| > +---+---+-+-+ > |1.0|NaN|false| true| > |NaN|2.0| true|false| > +---+---+-+-+ > Got: > +++-+-+ > | a| b| r1| r2| > +++-+-+ > | 1.0|null|false|false| > |null| 2.0|false|false| > +++-+-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41850: -- Summary: Fix `isnan` function (was: Fix DataFrameReader.isnan) > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41850) Fix DataFrameReader.isnan
Sandeep Singh created SPARK-41850: - Summary: Fix DataFrameReader.isnan Key: SPARK-41850 URL: https://issues.apache.org/jira/browse/SPARK-41850 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41814) Column.eqNullSafe fails on NaN comparison
[ https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653736#comment-17653736 ] Ruifeng Zheng commented on SPARK-41814: --- this issue is due to: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null > Column.eqNullSafe fails on NaN comparison > - > > Key: SPARK-41814 > URL: https://issues.apache.org/jira/browse/SPARK-41814 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df2.select( > df2['value'].eqNullSafe(None), > df2['value'].eqNullSafe(float('NaN')), > df2['value'].eqNullSafe(42.0) > ).show() > Expected: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > | false| true| false| > | false| false|true| > |true| false| false| > ++---++ > Got: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > |true| false| false| > | false| false|true| > |true| false| false| > ++---++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41849) Implement DataFrameReader.text
Sandeep Singh created SPARK-41849: - Summary: Implement DataFrameReader.text Key: SPARK-41849 URL: https://issues.apache.org/jira/browse/SPARK-41849 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41849: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", >
[jira] [Created] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
wuyi created SPARK-41848: Summary: Tasks are over-scheduled with TaskResourceProfile Key: SPARK-41848 URL: https://issues.apache.org/jira/browse/SPARK-41848 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: wuyi {code:java} test("SPARK-XXX") { val conf = new SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") sc = new SparkContext(conf) val req = new TaskResourceRequests().cpus(3) val rp = new ResourceProfileBuilder().require(req).build() val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => Thread.sleep(5000) x * 2 }.collect() assert(res === Array(0, 2)) } {code} In this test, tasks are supposed to be scheduled in order since each task requires 3 cores but the executor only has 4 cores. However, we noticed 2 tasks are launched concurrently from the logs. It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset for task scheduling: {code:java} val rpId = taskSet.taskSet.resourceProfileId val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, conf) {code} but the ResourceProfile (taskCpus=1) of the executor for updating the free cores in ExecutorData: {code:java} val rpId = executorData.resourceProfileId val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) executorData.freeCores -= taskCpus {code} which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Summary: DataFrame mapfield,structlist invalid type (was: DataFrame mapfield invalid type) > DataFrame mapfield,structlist invalid type > -- > > Key: SPARK-41847 > URL: https://issues.apache.org/jira/browse/SPARK-41847 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1270, in pyspark.sql.connect.functions.explode > Failed example: > eDF.select(explode(eDF.mapfield).alias("key", "value")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > eDF.select(explode(eDF.mapfield).alias("key", "value")).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type > "STRUCT" while it's required to be "MAP". > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py",
[jira] [Created] (SPARK-41847) DataFrame mapfield invalid type
Sandeep Singh created SPARK-41847: - Summary: DataFrame mapfield invalid type Key: SPARK-41847 URL: https://issues.apache.org/jira/browse/SPARK-41847 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS cd#2205] +- Project [0#2200L AS _1#2202L] +- LocalRelation [0#2200L] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41846) DataFrame windowspec functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Summary: DataFrame windowspec functions : unresolved columns (was: DataFrame aggregation functions : unresolved columns) > DataFrame windowspec functions : unresolved columns > --- > > Key: SPARK-41846 > URL: https://issues.apache.org/jira/browse/SPARK-41846 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1098, in pyspark.sql.connect.functions.rank > Failed example: > df.withColumn("drank", rank().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("drank", rank().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `value` cannot be resolved. Did you mean one of the following? [`_1`] > Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS drank#4003] > +- Project [0#3998L AS _1#4000L] > +- LocalRelation [0#3998L] {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1032, in pyspark.sql.connect.functions.cume_dist > Failed example: > df.withColumn("cd", cume_dist().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("cd", cume_dist().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `value` cannot be resolved. Did you mean one of the following? [`_1`] > Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC > NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), > currentrow$())) AS cd#2205] > +- Project [0#2200L AS _1#2202L] > +- LocalRelation [0#2200L] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS cd#2205] +- Project [0#2200L AS _1#2202L] +- LocalRelation [0#2200L] {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File
[jira] [Updated] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled
[ https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-39853: - Fix Version/s: 3.4.0 > Support stage level schedule for standalone cluster when dynamic allocation > is disabled > --- > > Key: SPARK-39853 > URL: https://issues.apache.org/jira/browse/SPARK-39853 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: huangtengfei >Assignee: huangtengfei >Priority: Major > Fix For: 3.4.0 > > > [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage > level schedule support for standalone cluster when dynamic allocation was > enabled, spark would request for executors for different resource profiles. > While when dynamic allocation is disabled, we can also leverage stage level > schedule to schedule tasks based on resource profile(task resource requests) > to executors with default resource profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} was: {code} File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in pyspark.sql.connect.column.Column.eqNullSafe Failed example: df1.join(df2, df1["value"] == df2["value"]).count() Exception raised: Traceback (most recent call last): File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in df1.join(df2, df1["value"] == df2["value"]).count() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in count pdd = self.agg(_invoke_function("count", lit(1))).toPandas() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, `value`]. {code} > DataFrame aggregation functions : unresolved columns > > > Key: SPARK-41846 > URL: https://issues.apache.org/jira/browse/SPARK-41846 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1098, in pyspark.sql.connect.functions.rank > Failed example: > df.withColumn("drank", rank().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("drank", rank().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File
[jira] [Created] (SPARK-41846) DataFrame aggregation functions : unresolved columns
Sandeep Singh created SPARK-41846: - Summary: DataFrame aggregation functions : unresolved columns Key: SPARK-41846 URL: https://issues.apache.org/jira/browse/SPARK-41846 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code} File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in pyspark.sql.connect.column.Column.eqNullSafe Failed example: df1.join(df2, df1["value"] == df2["value"]).count() Exception raised: Traceback (most recent call last): File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in df1.join(df2, df1["value"] == df2["value"]).count() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in count pdd = self.agg(_invoke_function("count", lit(1))).toPandas() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, `value`]. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute
[ https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653733#comment-17653733 ] jingxiong zhong commented on SPARK-37677: - At present, I have repaired Hadoop version 3.3.5, but it has not been released yet. In the future, Spark needs to update the Hadoop version to solve this problem.[~valux] > spark on k8s, when the user want to push python3.6.6.zip to the pod , but no > permission to execute > -- > > Key: SPARK-37677 > URL: https://issues.apache.org/jira/browse/SPARK-37677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: jingxiong zhong >Priority: Major > > In cluster mode, I hava another question that when I unzip python3.6.6.zip in > pod , but no permission to execute, my execute operation as follows: > {code:sh} > spark-submit \ > --archives ./python3.6.6.zip#python3.6.6 \ > --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \ > --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > ./examples/src/main/python/pi.py 100 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed
[ https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jingxiong zhong resolved SPARK-37521. - Resolution: Won't Fix > insert overwrite table but the partition information stored in Metastore was > not changed > > > Key: SPARK-37521 > URL: https://issues.apache.org/jira/browse/SPARK-37521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark3.2.0 > hive2.3.9 > metastore2.3.9 >Reporter: jingxiong zhong >Priority: Major > > I create a partitioned table in SparkSQL, insert a data entry, add a regular > field, and finally insert a new data entry into the partition,The query is > normal in SparkSQL, but the return value of the newly inserted field is NULL > in Hive 2.3.9 > for example > create table updata_col_test1(a int) partitioned by (dt string); > insert overwrite table updata_col_test1 partition(dt='20200101') values(1); > insert overwrite table updata_col_test1 partition(dt='20200102') values(1); > insert overwrite table updata_col_test1 partition(dt='20200103') values(1); > alter table updata_col_test1 add columns (b int); > insert overwrite table updata_col_test1 partition(dt) values(1, 2, > '20200101'); fail > insert overwrite table updata_col_test1 partition(dt='20200101') values(1, > 2); fail > insert overwrite table updata_col_test1 partition(dt='20200104') values(1, > 2); sucessfully -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh resolved SPARK-41823. --- Resolution: Duplicate > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41845) Fix `count(expr("*"))` function
[ https://issues.apache.org/jira/browse/SPARK-41845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41845: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2332, in pyspark.sql.connect.functions.call_udf Failed example: df.select(call_udf("intX2", "id")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(call_udf("intX2", "id")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. Plan: {code} > Fix `count(expr("*"))` function > --- > > Key: SPARK-41845 > URL: https://issues.apache.org/jira/browse/SPARK-41845 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 801, in pyspark.sql.connect.functions.count > Failed example: > df.select(count(expr("*")), count(df.alphabets)).show() > Expected: > +++ > |count(1)|count(alphabets)| > +++ > | 4| 3| > +++ > Got: > +++ > |count(alphabets)|count(alphabets)| > +++ > | 3| 3| > +++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41845) Fix `count(expr("*"))` function
Sandeep Singh created SPARK-41845: - Summary: Fix `count(expr("*"))` function Key: SPARK-41845 URL: https://issues.apache.org/jira/browse/SPARK-41845 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2332, in pyspark.sql.connect.functions.call_udf Failed example: df.select(call_udf("intX2", "id")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(call_udf("intX2", "id")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41844) Implement `intX2` function
[ https://issues.apache.org/jira/browse/SPARK-41844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh resolved SPARK-41844. --- Resolution: Invalid > Implement `intX2` function > -- > > Key: SPARK-41844 > URL: https://issues.apache.org/jira/browse/SPARK-41844 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2332, in pyspark.sql.connect.functions.call_udf > Failed example: > df.select(call_udf("intX2", "id")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(call_udf("intX2", "id")).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path > [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41844) Implement `intX2` function
Sandeep Singh created SPARK-41844: - Summary: Implement `intX2` function Key: SPARK-41844 URL: https://issues.apache.org/jira/browse/SPARK-41844 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1611, in pyspark.sql.connect.functions.transform_keys Failed example: df.select(transform_keys( "data", lambda k, _: upper(k)).alias("data_upper") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(transform_keys( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 1 requires the "MAP" type, however "data" has the type "STRUCT". Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] +- LocalRelation [0#4488L, 1#4489] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41844) Implement `intX2` function
[ https://issues.apache.org/jira/browse/SPARK-41844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41844: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2332, in pyspark.sql.connect.functions.call_udf Failed example: df.select(call_udf("intX2", "id")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(call_udf("intX2", "id")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1611, in pyspark.sql.connect.functions.transform_keys Failed example: df.select(transform_keys( "data", lambda k, _: upper(k)).alias("data_upper") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(transform_keys( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 1 requires the "MAP" type, however "data" has the type "STRUCT". Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] +- LocalRelation [0#4488L, 1#4489] {code} > Implement `intX2` function > -- > > Key: SPARK-41844 > URL: https://issues.apache.org/jira/browse/SPARK-41844 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2332, in pyspark.sql.connect.functions.call_udf > Failed example: > df.select(call_udf("intX2", "id")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(call_udf("intX2", "id")).show() > File >
[jira] [Updated] (SPARK-41835) Implement `transform_keys` function
[ https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41835: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1611, in pyspark.sql.connect.functions.transform_keys Failed example: df.select(transform_keys( "data", lambda k, _: upper(k)).alias("data_upper") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(transform_keys( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 1 requires the "MAP" type, however "data" has the type "STRUCT". Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] +- LocalRelation [0#4488L, 1#4489] {code} > Implement `transform_keys` function > --- > > Key: SPARK-41835 > URL: https://issues.apache.org/jira/browse/SPARK-41835 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1611, in pyspark.sql.connect.functions.transform_keys > Failed example: > df.select(transform_keys( > "data", lambda k, _: upper(k)).alias("data_upper") > ).show(truncate=False) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.select(transform_keys( > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve > "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data > type mismatch: Parameter 1 requires the "MAP" type, however "data" has the > type "STRUCT". > Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda > 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] > +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] > +- LocalRelation [0#4488L, 1#4489] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)