[jira] [Created] (SPARK-41202) Update ORC to 1.7.7
William Hyun created SPARK-41202: Summary: Update ORC to 1.7.7 Key: SPARK-41202 URL: https://issues.apache.org/jira/browse/SPARK-41202 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.2 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41175) Assign a name to the error class _LEGACY_ERROR_TEMP_1078
[ https://issues.apache.org/jira/browse/SPARK-41175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41175. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38696 [https://github.com/apache/spark/pull/38696] > Assign a name to the error class _LEGACY_ERROR_TEMP_1078 > > > Key: SPARK-41175 > URL: https://issues.apache.org/jira/browse/SPARK-41175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: BingKun Pan >Priority: Major > Fix For: 3.4.0 > > > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1078 and make it > visible to users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41175) Assign a name to the error class _LEGACY_ERROR_TEMP_1078
[ https://issues.apache.org/jira/browse/SPARK-41175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41175: Assignee: BingKun Pan > Assign a name to the error class _LEGACY_ERROR_TEMP_1078 > > > Key: SPARK-41175 > URL: https://issues.apache.org/jira/browse/SPARK-41175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: BingKun Pan >Priority: Major > > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1078 and make it > visible to users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
[ https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41201: Assignee: (was: Apache Spark) > Implement `DataFrame.SelectExpr` in Python client > - > > Key: SPARK-41201 > URL: https://issues.apache.org/jira/browse/SPARK-41201 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
[ https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636109#comment-17636109 ] Apache Spark commented on SPARK-41201: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38723 > Implement `DataFrame.SelectExpr` in Python client > - > > Key: SPARK-41201 > URL: https://issues.apache.org/jira/browse/SPARK-41201 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
[ https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41201: Assignee: Apache Spark > Implement `DataFrame.SelectExpr` in Python client > - > > Key: SPARK-41201 > URL: https://issues.apache.org/jira/browse/SPARK-41201 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
Rui Wang created SPARK-41201: Summary: Implement `DataFrame.SelectExpr` in Python client Key: SPARK-41201 URL: https://issues.apache.org/jira/browse/SPARK-41201 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY
[ https://issues.apache.org/jira/browse/SPARK-41200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636106#comment-17636106 ] Apache Spark commented on SPARK-41200: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/38722 > BytesToBytesMap's longArray size can be up to MAX_CAPACITY > -- > > Key: SPARK-41200 > URL: https://issues.apache.org/jira/browse/SPARK-41200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY
[ https://issues.apache.org/jira/browse/SPARK-41200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41200: Assignee: Apache Spark > BytesToBytesMap's longArray size can be up to MAX_CAPACITY > -- > > Key: SPARK-41200 > URL: https://issues.apache.org/jira/browse/SPARK-41200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY
[ https://issues.apache.org/jira/browse/SPARK-41200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636105#comment-17636105 ] Apache Spark commented on SPARK-41200: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/38722 > BytesToBytesMap's longArray size can be up to MAX_CAPACITY > -- > > Key: SPARK-41200 > URL: https://issues.apache.org/jira/browse/SPARK-41200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY
[ https://issues.apache.org/jira/browse/SPARK-41200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41200: Assignee: (was: Apache Spark) > BytesToBytesMap's longArray size can be up to MAX_CAPACITY > -- > > Key: SPARK-41200 > URL: https://issues.apache.org/jira/browse/SPARK-41200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY
EdisonWang created SPARK-41200: -- Summary: BytesToBytesMap's longArray size can be up to MAX_CAPACITY Key: SPARK-41200 URL: https://issues.apache.org/jira/browse/SPARK-41200 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41172) Migrate the ambiguous ref error to an error class
[ https://issues.apache.org/jira/browse/SPARK-41172 ] BingKun Pan deleted comment on SPARK-41172: - was (Author: panbingkun): I work on it. > Migrate the ambiguous ref error to an error class > - > > Key: SPARK-41172 > URL: https://issues.apache.org/jira/browse/SPARK-41172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Use an error class in > https://github.com/apache/spark/blob/99ae1d9a897909990881f14c5ea70a0d1a0bf456/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L372 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41172) Migrate the ambiguous ref error to an error class
[ https://issues.apache.org/jira/browse/SPARK-41172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636098#comment-17636098 ] Apache Spark commented on SPARK-41172: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/38721 > Migrate the ambiguous ref error to an error class > - > > Key: SPARK-41172 > URL: https://issues.apache.org/jira/browse/SPARK-41172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Use an error class in > https://github.com/apache/spark/blob/99ae1d9a897909990881f14c5ea70a0d1a0bf456/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L372 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41172) Migrate the ambiguous ref error to an error class
[ https://issues.apache.org/jira/browse/SPARK-41172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41172: Assignee: (was: Apache Spark) > Migrate the ambiguous ref error to an error class > - > > Key: SPARK-41172 > URL: https://issues.apache.org/jira/browse/SPARK-41172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Use an error class in > https://github.com/apache/spark/blob/99ae1d9a897909990881f14c5ea70a0d1a0bf456/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L372 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41172) Migrate the ambiguous ref error to an error class
[ https://issues.apache.org/jira/browse/SPARK-41172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636097#comment-17636097 ] Apache Spark commented on SPARK-41172: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/38721 > Migrate the ambiguous ref error to an error class > - > > Key: SPARK-41172 > URL: https://issues.apache.org/jira/browse/SPARK-41172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Use an error class in > https://github.com/apache/spark/blob/99ae1d9a897909990881f14c5ea70a0d1a0bf456/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L372 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41172) Migrate the ambiguous ref error to an error class
[ https://issues.apache.org/jira/browse/SPARK-41172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41172: Assignee: Apache Spark > Migrate the ambiguous ref error to an error class > - > > Key: SPARK-41172 > URL: https://issues.apache.org/jira/browse/SPARK-41172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Use an error class in > https://github.com/apache/spark/blob/99ae1d9a897909990881f14c5ea70a0d1a0bf456/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L372 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41186) Fix doctest for new version mlfow
[ https://issues.apache.org/jira/browse/SPARK-41186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang resolved SPARK-41186. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38698 [https://github.com/apache/spark/pull/38698] > Fix doctest for new version mlfow > - > > Key: SPARK-41186 > URL: https://issues.apache.org/jira/browse/SPARK-41186 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > > > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 168, in > pyspark.pandas.mlflow.load_model > Failed example: > run_info = client.list_run_infos(exp_id)[-1] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > run_info = client.list_run_infos(exp_id)[-1] > AttributeError: 'MlflowClient' object has no attribute 'list_run_infos' > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 169, in > pyspark.pandas.mlflow.load_model > Failed example: > model = > load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid)) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > model = > load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid)) > NameError: name 'run_info' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 171, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df["prediction"] = model.predict(prediction_df) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df["prediction"] = model.predict(prediction_df) > NameError: name 'model' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Expected: > x1 x2 prediction > 0 2.0 4.01.31 > Got: > x1 x2 > 0 2.0 4.0 > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 178, in > pyspark.pandas.mlflow.load_model > Failed example: > model.predict(prediction_df[["x1", "x2"]].to_pandas()) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > model.predict(prediction_df[["x1", "x2"]].to_pandas()) > NameError: name 'model' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 189, in > pyspark.pandas.mlflow.load_model > Failed example: > y = model.predict(features) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > y = model.predict(features) > NameError: name 'model' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 198, in > pyspark.pandas.mlflow.load_model > Failed example: > features['y'] = y > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > features['y'] = y > NameError: name 'y' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 200, in > pyspark.pandas.mlflow.load_model > Failed example: > everything > Expected: > x1 x2 z y > 0 2.0 3.0 -1 1.376932 > Got: > x1 x2 z > 0 2.0 3.0 -1 >
[jira] [Assigned] (SPARK-41186) Fix doctest for new version mlfow
[ https://issues.apache.org/jira/browse/SPARK-41186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang reassigned SPARK-41186: --- Assignee: Yikun Jiang > Fix doctest for new version mlfow > - > > Key: SPARK-41186 > URL: https://issues.apache.org/jira/browse/SPARK-41186 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > > > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 168, in > pyspark.pandas.mlflow.load_model > Failed example: > run_info = client.list_run_infos(exp_id)[-1] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > run_info = client.list_run_infos(exp_id)[-1] > AttributeError: 'MlflowClient' object has no attribute 'list_run_infos' > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 169, in > pyspark.pandas.mlflow.load_model > Failed example: > model = > load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid)) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > model = > load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid)) > NameError: name 'run_info' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 171, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df["prediction"] = model.predict(prediction_df) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df["prediction"] = model.predict(prediction_df) > NameError: name 'model' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Expected: > x1 x2 prediction > 0 2.0 4.01.31 > Got: > x1 x2 > 0 2.0 4.0 > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 178, in > pyspark.pandas.mlflow.load_model > Failed example: > model.predict(prediction_df[["x1", "x2"]].to_pandas()) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > model.predict(prediction_df[["x1", "x2"]].to_pandas()) > NameError: name 'model' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 189, in > pyspark.pandas.mlflow.load_model > Failed example: > y = model.predict(features) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > y = model.predict(features) > NameError: name 'model' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 198, in > pyspark.pandas.mlflow.load_model > Failed example: > features['y'] = y > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > features['y'] = y > NameError: name 'y' is not defined > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 200, in > pyspark.pandas.mlflow.load_model > Failed example: > everything > Expected: > x1 x2 z y > 0 2.0 3.0 -1 1.376932 > Got: > x1 x2 z > 0 2.0 3.0 -1 > ** >8 of 26 in pyspark.pandas.mlflow.load_model -- This message
[jira] [Commented] (SPARK-41172) Migrate the ambiguous ref error to an error class
[ https://issues.apache.org/jira/browse/SPARK-41172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636083#comment-17636083 ] BingKun Pan commented on SPARK-41172: - I work on it. > Migrate the ambiguous ref error to an error class > - > > Key: SPARK-41172 > URL: https://issues.apache.org/jira/browse/SPARK-41172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Use an error class in > https://github.com/apache/spark/blob/99ae1d9a897909990881f14c5ea70a0d1a0bf456/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L372 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41184) Fill NA tests are flaky
[ https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636082#comment-17636082 ] Apache Spark commented on SPARK-41184: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/38720 > Fill NA tests are flaky > --- > > Key: SPARK-41184 > URL: https://issues.apache.org/jira/browse/SPARK-41184 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's fill.na tests for python are flakey. We need to disable them, and > investigate what is going on with the typing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41184) Fill NA tests are flaky
[ https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636081#comment-17636081 ] Apache Spark commented on SPARK-41184: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/38720 > Fill NA tests are flaky > --- > > Key: SPARK-41184 > URL: https://issues.apache.org/jira/browse/SPARK-41184 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's fill.na tests for python are flakey. We need to disable them, and > investigate what is going on with the typing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41165) Arrow collect should factor in failures
[ https://issues.apache.org/jira/browse/SPARK-41165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636080#comment-17636080 ] Apache Spark commented on SPARK-41165: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/38720 > Arrow collect should factor in failures > --- > > Key: SPARK-41165 > URL: https://issues.apache.org/jira/browse/SPARK-41165 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's arrow collect path does not factor in failures. If a failure occurs > the collect code path will hang. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41199) Streaming query metrics is broken with mixed-up usage of DSv1 streaming source and DSv2 streaming source
[ https://issues.apache.org/jira/browse/SPARK-41199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41199: Assignee: Apache Spark > Streaming query metrics is broken with mixed-up usage of DSv1 streaming > source and DSv2 streaming source > > > Key: SPARK-41199 > URL: https://issues.apache.org/jira/browse/SPARK-41199 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > (It seems like a long standing issue. It probably applies to 2.x as well. I > just marked the version line we did not EOL.) > If a streaming query contains both DSv1 and DSv2 streaming sources together, > it only collects metrics properly for DSv1 sources. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41199) Streaming query metrics is broken with mixed-up usage of DSv1 streaming source and DSv2 streaming source
[ https://issues.apache.org/jira/browse/SPARK-41199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41199: Assignee: (was: Apache Spark) > Streaming query metrics is broken with mixed-up usage of DSv1 streaming > source and DSv2 streaming source > > > Key: SPARK-41199 > URL: https://issues.apache.org/jira/browse/SPARK-41199 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Priority: Major > > (It seems like a long standing issue. It probably applies to 2.x as well. I > just marked the version line we did not EOL.) > If a streaming query contains both DSv1 and DSv2 streaming sources together, > it only collects metrics properly for DSv1 sources. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41199) Streaming query metrics is broken with mixed-up usage of DSv1 streaming source and DSv2 streaming source
[ https://issues.apache.org/jira/browse/SPARK-41199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636041#comment-17636041 ] Apache Spark commented on SPARK-41199: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/38719 > Streaming query metrics is broken with mixed-up usage of DSv1 streaming > source and DSv2 streaming source > > > Key: SPARK-41199 > URL: https://issues.apache.org/jira/browse/SPARK-41199 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Priority: Major > > (It seems like a long standing issue. It probably applies to 2.x as well. I > just marked the version line we did not EOL.) > If a streaming query contains both DSv1 and DSv2 streaming sources together, > it only collects metrics properly for DSv1 sources. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41196) Homogenize the protobuf version across server and client
[ https://issues.apache.org/jira/browse/SPARK-41196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636034#comment-17636034 ] Apache Spark commented on SPARK-41196: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38718 > Homogenize the protobuf version across server and client > > > Key: SPARK-41196 > URL: https://issues.apache.org/jira/browse/SPARK-41196 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > > Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41196) Homogenize the protobuf version across server and client
[ https://issues.apache.org/jira/browse/SPARK-41196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636033#comment-17636033 ] Apache Spark commented on SPARK-41196: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38718 > Homogenize the protobuf version across server and client > > > Key: SPARK-41196 > URL: https://issues.apache.org/jira/browse/SPARK-41196 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > > Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41199) Streaming query metrics is broken with mixed-up usage of DSv1 streaming source and DSv2 streaming source
Jungtaek Lim created SPARK-41199: Summary: Streaming query metrics is broken with mixed-up usage of DSv1 streaming source and DSv2 streaming source Key: SPARK-41199 URL: https://issues.apache.org/jira/browse/SPARK-41199 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.3.1, 3.2.2, 3.4.0 Reporter: Jungtaek Lim (It seems like a long standing issue. It probably applies to 2.x as well. I just marked the version line we did not EOL.) If a streaming query contains both DSv1 and DSv2 streaming sources together, it only collects metrics properly for DSv1 sources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41198) Streaming query metrics is broken with CTE
[ https://issues.apache.org/jira/browse/SPARK-41198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636024#comment-17636024 ] Apache Spark commented on SPARK-41198: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/38717 > Streaming query metrics is broken with CTE > -- > > Key: SPARK-41198 > URL: https://issues.apache.org/jira/browse/SPARK-41198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Priority: Major > > We have observed a case the metrics are not available for the streaming query > which contains CTE. > Looks like CTE was inlined in analysis phase in Spark 3.1.x and it was > changed to be inlined in optimization phase in Spark 3.2.x. ProgressReporter > depends on analyzed plan, hence the change made ProgressReporter to see CTE > nodes, which ends up with having different number of leaf nodes between > analyzed plan and executed plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41198) Streaming query metrics is broken with CTE
[ https://issues.apache.org/jira/browse/SPARK-41198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636026#comment-17636026 ] Apache Spark commented on SPARK-41198: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/38717 > Streaming query metrics is broken with CTE > -- > > Key: SPARK-41198 > URL: https://issues.apache.org/jira/browse/SPARK-41198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Priority: Major > > We have observed a case the metrics are not available for the streaming query > which contains CTE. > Looks like CTE was inlined in analysis phase in Spark 3.1.x and it was > changed to be inlined in optimization phase in Spark 3.2.x. ProgressReporter > depends on analyzed plan, hence the change made ProgressReporter to see CTE > nodes, which ends up with having different number of leaf nodes between > analyzed plan and executed plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41198) Streaming query metrics is broken with CTE
[ https://issues.apache.org/jira/browse/SPARK-41198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41198: Assignee: (was: Apache Spark) > Streaming query metrics is broken with CTE > -- > > Key: SPARK-41198 > URL: https://issues.apache.org/jira/browse/SPARK-41198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Priority: Major > > We have observed a case the metrics are not available for the streaming query > which contains CTE. > Looks like CTE was inlined in analysis phase in Spark 3.1.x and it was > changed to be inlined in optimization phase in Spark 3.2.x. ProgressReporter > depends on analyzed plan, hence the change made ProgressReporter to see CTE > nodes, which ends up with having different number of leaf nodes between > analyzed plan and executed plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41198) Streaming query metrics is broken with CTE
[ https://issues.apache.org/jira/browse/SPARK-41198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41198: Assignee: Apache Spark > Streaming query metrics is broken with CTE > -- > > Key: SPARK-41198 > URL: https://issues.apache.org/jira/browse/SPARK-41198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > We have observed a case the metrics are not available for the streaming query > which contains CTE. > Looks like CTE was inlined in analysis phase in Spark 3.1.x and it was > changed to be inlined in optimization phase in Spark 3.2.x. ProgressReporter > depends on analyzed plan, hence the change made ProgressReporter to see CTE > nodes, which ends up with having different number of leaf nodes between > analyzed plan and executed plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41173) Move `require()` out from the constructors of string expressions
[ https://issues.apache.org/jira/browse/SPARK-41173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41173: Assignee: Yang Jie > Move `require()` out from the constructors of string expressions > > > Key: SPARK-41173 > URL: https://issues.apache.org/jira/browse/SPARK-41173 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Yang Jie >Priority: Major > > 1. ConcatWs: > https://github.com/apache/spark/blob/fabea7101ea55db991590ca2fbe1d4dfd25e5b28/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L70 > 2. FormatString > https://github.com/apache/spark/blob/fabea7101ea55db991590ca2fbe1d4dfd25e5b28/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L1665 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41173) Move `require()` out from the constructors of string expressions
[ https://issues.apache.org/jira/browse/SPARK-41173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41173. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38705 [https://github.com/apache/spark/pull/38705] > Move `require()` out from the constructors of string expressions > > > Key: SPARK-41173 > URL: https://issues.apache.org/jira/browse/SPARK-41173 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > 1. ConcatWs: > https://github.com/apache/spark/blob/fabea7101ea55db991590ca2fbe1d4dfd25e5b28/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L70 > 2. FormatString > https://github.com/apache/spark/blob/fabea7101ea55db991590ca2fbe1d4dfd25e5b28/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L1665 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41198) Streaming query metrics is broken with CTE
Jungtaek Lim created SPARK-41198: Summary: Streaming query metrics is broken with CTE Key: SPARK-41198 URL: https://issues.apache.org/jira/browse/SPARK-41198 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.3.1, 3.2.2, 3.4.0 Reporter: Jungtaek Lim We have observed a case the metrics are not available for the streaming query which contains CTE. Looks like CTE was inlined in analysis phase in Spark 3.1.x and it was changed to be inlined in optimization phase in Spark 3.2.x. ProgressReporter depends on analyzed plan, hence the change made ProgressReporter to see CTE nodes, which ends up with having different number of leaf nodes between analyzed plan and executed plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41197) Upgrade Kafka version to 3.3 release
[ https://issues.apache.org/jira/browse/SPARK-41197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635992#comment-17635992 ] Apache Spark commented on SPARK-41197: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38715 > Upgrade Kafka version to 3.3 release > > > Key: SPARK-41197 > URL: https://issues.apache.org/jira/browse/SPARK-41197 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.3.1 >Reporter: Ted Yu >Priority: Minor > > Kafka 3.3 has been released. > This issue upgrades Kafka dependency to 3.3 release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41197) Upgrade Kafka version to 3.3 release
[ https://issues.apache.org/jira/browse/SPARK-41197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41197: Assignee: Apache Spark > Upgrade Kafka version to 3.3 release > > > Key: SPARK-41197 > URL: https://issues.apache.org/jira/browse/SPARK-41197 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.3.1 >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Minor > > Kafka 3.3 has been released. > This issue upgrades Kafka dependency to 3.3 release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41197) Upgrade Kafka version to 3.3 release
[ https://issues.apache.org/jira/browse/SPARK-41197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635991#comment-17635991 ] Apache Spark commented on SPARK-41197: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38715 > Upgrade Kafka version to 3.3 release > > > Key: SPARK-41197 > URL: https://issues.apache.org/jira/browse/SPARK-41197 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.3.1 >Reporter: Ted Yu >Priority: Minor > > Kafka 3.3 has been released. > This issue upgrades Kafka dependency to 3.3 release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41197) Upgrade Kafka version to 3.3 release
[ https://issues.apache.org/jira/browse/SPARK-41197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41197: Assignee: (was: Apache Spark) > Upgrade Kafka version to 3.3 release > > > Key: SPARK-41197 > URL: https://issues.apache.org/jira/browse/SPARK-41197 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.3.1 >Reporter: Ted Yu >Priority: Minor > > Kafka 3.3 has been released. > This issue upgrades Kafka dependency to 3.3 release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41141) avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
[ https://issues.apache.org/jira/browse/SPARK-41141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635989#comment-17635989 ] Asif commented on SPARK-41141: -- Opened the following PR [SPARK-41141-PR|https://github.com/apache/spark/pull/38714/files] > avoid introducing a new aggregate expression in the analysis phase when > subquery is referencing it > -- > > Key: SPARK-41141 > URL: https://issues.apache.org/jira/browse/SPARK-41141 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Minor > Labels: spark-sql > > Currently the analyzer phase rules on subquery referencing the aggregate > expression in outer query, avoids introducing a new aggregate only for a > single level aggregate function. It introduces new aggregate expression for > nested aggregate functions. > It is possible to avoid adding this extra aggregate expression easily, > atleast if the outer projection involving aggregate function is exactly same > as the one that is used in subquery, or if the outer query's projection > involving aggregate function is a subtree of the subquery's expression. > > Thus consider the following 2 cases: > 1) select cos (sum(a)) , b from t1 group by b having exists (select x from > t2 where y = cos(sum(a)) ) > 2) select sum(a) , b from t1 group by b having exists (select x from t2 > where y = cos(sum(a)) ) > > In both the above cases, there is no need for adding extra aggregate > expression. > > I am also investigating if its possible to avoid if the case is > > 3) select Cos(sum(a)) , b from t1 group by b having exists (select x from > t2 where y = sum(a) ) > > This Jira also is needed for another issue where subquery datasource v2 is > projecting columns which are not needed. ( no Jira filed yet for that, will > do that..) > > Will be opening a PR for this soon.. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41141) avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
[ https://issues.apache.org/jira/browse/SPARK-41141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41141: Assignee: (was: Apache Spark) > avoid introducing a new aggregate expression in the analysis phase when > subquery is referencing it > -- > > Key: SPARK-41141 > URL: https://issues.apache.org/jira/browse/SPARK-41141 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Minor > Labels: spark-sql > > Currently the analyzer phase rules on subquery referencing the aggregate > expression in outer query, avoids introducing a new aggregate only for a > single level aggregate function. It introduces new aggregate expression for > nested aggregate functions. > It is possible to avoid adding this extra aggregate expression easily, > atleast if the outer projection involving aggregate function is exactly same > as the one that is used in subquery, or if the outer query's projection > involving aggregate function is a subtree of the subquery's expression. > > Thus consider the following 2 cases: > 1) select cos (sum(a)) , b from t1 group by b having exists (select x from > t2 where y = cos(sum(a)) ) > 2) select sum(a) , b from t1 group by b having exists (select x from t2 > where y = cos(sum(a)) ) > > In both the above cases, there is no need for adding extra aggregate > expression. > > I am also investigating if its possible to avoid if the case is > > 3) select Cos(sum(a)) , b from t1 group by b having exists (select x from > t2 where y = sum(a) ) > > This Jira also is needed for another issue where subquery datasource v2 is > projecting columns which are not needed. ( no Jira filed yet for that, will > do that..) > > Will be opening a PR for this soon.. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41141) avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
[ https://issues.apache.org/jira/browse/SPARK-41141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635988#comment-17635988 ] Apache Spark commented on SPARK-41141: -- User 'ahshahid' has created a pull request for this issue: https://github.com/apache/spark/pull/38714 > avoid introducing a new aggregate expression in the analysis phase when > subquery is referencing it > -- > > Key: SPARK-41141 > URL: https://issues.apache.org/jira/browse/SPARK-41141 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Minor > Labels: spark-sql > > Currently the analyzer phase rules on subquery referencing the aggregate > expression in outer query, avoids introducing a new aggregate only for a > single level aggregate function. It introduces new aggregate expression for > nested aggregate functions. > It is possible to avoid adding this extra aggregate expression easily, > atleast if the outer projection involving aggregate function is exactly same > as the one that is used in subquery, or if the outer query's projection > involving aggregate function is a subtree of the subquery's expression. > > Thus consider the following 2 cases: > 1) select cos (sum(a)) , b from t1 group by b having exists (select x from > t2 where y = cos(sum(a)) ) > 2) select sum(a) , b from t1 group by b having exists (select x from t2 > where y = cos(sum(a)) ) > > In both the above cases, there is no need for adding extra aggregate > expression. > > I am also investigating if its possible to avoid if the case is > > 3) select Cos(sum(a)) , b from t1 group by b having exists (select x from > t2 where y = sum(a) ) > > This Jira also is needed for another issue where subquery datasource v2 is > projecting columns which are not needed. ( no Jira filed yet for that, will > do that..) > > Will be opening a PR for this soon.. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41141) avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
[ https://issues.apache.org/jira/browse/SPARK-41141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41141: Assignee: Apache Spark > avoid introducing a new aggregate expression in the analysis phase when > subquery is referencing it > -- > > Key: SPARK-41141 > URL: https://issues.apache.org/jira/browse/SPARK-41141 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Assignee: Apache Spark >Priority: Minor > Labels: spark-sql > > Currently the analyzer phase rules on subquery referencing the aggregate > expression in outer query, avoids introducing a new aggregate only for a > single level aggregate function. It introduces new aggregate expression for > nested aggregate functions. > It is possible to avoid adding this extra aggregate expression easily, > atleast if the outer projection involving aggregate function is exactly same > as the one that is used in subquery, or if the outer query's projection > involving aggregate function is a subtree of the subquery's expression. > > Thus consider the following 2 cases: > 1) select cos (sum(a)) , b from t1 group by b having exists (select x from > t2 where y = cos(sum(a)) ) > 2) select sum(a) , b from t1 group by b having exists (select x from t2 > where y = cos(sum(a)) ) > > In both the above cases, there is no need for adding extra aggregate > expression. > > I am also investigating if its possible to avoid if the case is > > 3) select Cos(sum(a)) , b from t1 group by b having exists (select x from > t2 where y = sum(a) ) > > This Jira also is needed for another issue where subquery datasource v2 is > projecting columns which are not needed. ( no Jira filed yet for that, will > do that..) > > Will be opening a PR for this soon.. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41141) avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
[ https://issues.apache.org/jira/browse/SPARK-41141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-41141: - Priority: Minor (was: Major) > avoid introducing a new aggregate expression in the analysis phase when > subquery is referencing it > -- > > Key: SPARK-41141 > URL: https://issues.apache.org/jira/browse/SPARK-41141 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Minor > Labels: spark-sql > > Currently the analyzer phase rules on subquery referencing the aggregate > expression in outer query, avoids introducing a new aggregate only for a > single level aggregate function. It introduces new aggregate expression for > nested aggregate functions. > It is possible to avoid adding this extra aggregate expression easily, > atleast if the outer projection involving aggregate function is exactly same > as the one that is used in subquery, or if the outer query's projection > involving aggregate function is a subtree of the subquery's expression. > > Thus consider the following 2 cases: > 1) select cos (sum(a)) , b from t1 group by b having exists (select x from > t2 where y = cos(sum(a)) ) > 2) select sum(a) , b from t1 group by b having exists (select x from t2 > where y = cos(sum(a)) ) > > In both the above cases, there is no need for adding extra aggregate > expression. > > I am also investigating if its possible to avoid if the case is > > 3) select Cos(sum(a)) , b from t1 group by b having exists (select x from > t2 where y = sum(a) ) > > This Jira also is needed for another issue where subquery datasource v2 is > projecting columns which are not needed. ( no Jira filed yet for that, will > do that..) > > Will be opening a PR for this soon.. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41196) Homogenize the protobuf version across server and client
[ https://issues.apache.org/jira/browse/SPARK-41196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-41196. --- Fix Version/s: 3.4.0 Assignee: Martin Grund Resolution: Fixed > Homogenize the protobuf version across server and client > > > Key: SPARK-41196 > URL: https://issues.apache.org/jira/browse/SPARK-41196 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > > Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41197) Upgrade Kafka version to 3.3 release
Ted Yu created SPARK-41197: -- Summary: Upgrade Kafka version to 3.3 release Key: SPARK-41197 URL: https://issues.apache.org/jira/browse/SPARK-41197 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 3.3.1 Reporter: Ted Yu Kafka 3.3 has been released. This issue upgrades Kafka dependency to 3.3 release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41196) Homogenize the protobuf version across server and client
[ https://issues.apache.org/jira/browse/SPARK-41196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41196: Assignee: (was: Apache Spark) > Homogenize the protobuf version across server and client > > > Key: SPARK-41196 > URL: https://issues.apache.org/jira/browse/SPARK-41196 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41196) Homogenize the protobuf version across server and client
[ https://issues.apache.org/jira/browse/SPARK-41196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41196: Assignee: Apache Spark > Homogenize the protobuf version across server and client > > > Key: SPARK-41196 > URL: https://issues.apache.org/jira/browse/SPARK-41196 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Major > > Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41196) Homogenize the protobuf version across server and client
[ https://issues.apache.org/jira/browse/SPARK-41196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635973#comment-17635973 ] Apache Spark commented on SPARK-41196: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/38693 > Homogenize the protobuf version across server and client > > > Key: SPARK-41196 > URL: https://issues.apache.org/jira/browse/SPARK-41196 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41196) Homogenize the protobuf version across server and client
Martin Grund created SPARK-41196: Summary: Homogenize the protobuf version across server and client Key: SPARK-41196 URL: https://issues.apache.org/jira/browse/SPARK-41196 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Martin Grund Homogenize the protobuf version across server and client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41195) Support PIVOT/UNPIVOT with join children
[ https://issues.apache.org/jira/browse/SPARK-41195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41195: Assignee: (was: Apache Spark) > Support PIVOT/UNPIVOT with join children > > > Key: SPARK-41195 > URL: https://issues.apache.org/jira/browse/SPARK-41195 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41195) Support PIVOT/UNPIVOT with join children
[ https://issues.apache.org/jira/browse/SPARK-41195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635916#comment-17635916 ] Apache Spark commented on SPARK-41195: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/38713 > Support PIVOT/UNPIVOT with join children > > > Key: SPARK-41195 > URL: https://issues.apache.org/jira/browse/SPARK-41195 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41195) Support PIVOT/UNPIVOT with join children
[ https://issues.apache.org/jira/browse/SPARK-41195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41195: Assignee: Apache Spark > Support PIVOT/UNPIVOT with join children > > > Key: SPARK-41195 > URL: https://issues.apache.org/jira/browse/SPARK-41195 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37648) Spark catalog and Delta tables
[ https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635913#comment-17635913 ] Michael F commented on SPARK-37648: --- Any update here? This continues to be an issue in 3.3.1 > Spark catalog and Delta tables > -- > > Key: SPARK-37648 > URL: https://issues.apache.org/jira/browse/SPARK-37648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark version 3.1.2 > Scala version 2.12.10 > Hive version 2.3.7 > Delta version 1.0.0 >Reporter: Hanna Liashchuk >Priority: Major > > I'm using Spark with Delta tables, while tables are created, there are no > columns in the table. > Steps to reproduce: > 1. Start spark-shell > {code:java} > spark-shell --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code} > 2. Create delta table > {code:java} > spark.range(10).write.format("delta").option("path", > "tmp/delta").saveAsTable("delta"){code} > 3. Make sure table exists > {code:java} > spark.catalog.listTables.show{code} > 4. Find out that columns are not > {code:java} > spark.catalog.listColumns("delta").show{code} > This is critical for Delta integration with different BI tools such as Power > BI or Tableau, as they are querying spark catalog for the metadata and we are > getting errors that no columns are found. > Discussion can be found in Delta repository - > https://github.com/delta-io/delta/issues/695 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41195) Support PIVOT/UNPIVOT with join children
Wenchen Fan created SPARK-41195: --- Summary: Support PIVOT/UNPIVOT with join children Key: SPARK-41195 URL: https://issues.apache.org/jira/browse/SPARK-41195 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40999: --- Assignee: Fredrik Klauß > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Assignee: Fredrik Klauß >Priority: Major > Fix For: 3.4.0 > > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > It worked prior to a refactoring that added hints as a field to joins > (SPARK-26065) and can cause a regression if someone made use of hints on > subqueries before. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40999. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38497 [https://github.com/apache/spark/pull/38497] > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > Fix For: 3.4.0 > > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > It worked prior to a refactoring that added hints as a field to joins > (SPARK-26065) and can cause a regression if someone made use of hints on > subqueries before. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41161) Upgrade `scala-parser-combinators` to 2.1.1
[ https://issues.apache.org/jira/browse/SPARK-41161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41161: Assignee: Yang Jie > Upgrade `scala-parser-combinators` to 2.1.1 > --- > > Key: SPARK-41161 > URL: https://issues.apache.org/jira/browse/SPARK-41161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41161) Upgrade `scala-parser-combinators` to 2.1.1
[ https://issues.apache.org/jira/browse/SPARK-41161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41161. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38675 [https://github.com/apache/spark/pull/38675] > Upgrade `scala-parser-combinators` to 2.1.1 > --- > > Key: SPARK-41161 > URL: https://issues.apache.org/jira/browse/SPARK-41161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41166) Check errorSubClass of DataTypeMismatch in *ExpressionSuites
[ https://issues.apache.org/jira/browse/SPARK-41166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41166. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38688 [https://github.com/apache/spark/pull/38688] > Check errorSubClass of DataTypeMismatch in *ExpressionSuites > > > Key: SPARK-41166 > URL: https://issues.apache.org/jira/browse/SPARK-41166 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41166) Check errorSubClass of DataTypeMismatch in *ExpressionSuites
[ https://issues.apache.org/jira/browse/SPARK-41166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41166: Assignee: BingKun Pan > Check errorSubClass of DataTypeMismatch in *ExpressionSuites > > > Key: SPARK-41166 > URL: https://issues.apache.org/jira/browse/SPARK-41166 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38093) Set shuffleMergeAllowed to false for a determinate stage after the stage is finalized
[ https://issues.apache.org/jira/browse/SPARK-38093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635786#comment-17635786 ] Mars commented on SPARK-38093: -- comment https://github.com/apache/spark/pull/34122#discussion_r796929787 > Set shuffleMergeAllowed to false for a determinate stage after the stage is > finalized > - > > Key: SPARK-38093 > URL: https://issues.apache.org/jira/browse/SPARK-38093 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > Currently we are setting shuffleMergeAllowed to false before > prepareShuffleServicesForShuffleMapStage if the shuffle dependency is already > finalized. Ideally it is better to do it right after shuffle dependency > finalization for a determinate stage. cc [~mridulm80] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41130) Rename OUT_OF_DECIMAL_TYPE_RANGE to NUMERIC_OUT_OF_SUPPORTED_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41130: Assignee: Haejoon Lee > Rename OUT_OF_DECIMAL_TYPE_RANGE to NUMERIC_OUT_OF_SUPPORTED_RANGE > -- > > Key: SPARK-41130 > URL: https://issues.apache.org/jira/browse/SPARK-41130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should use proper name for error class and clear error message -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41130) Rename OUT_OF_DECIMAL_TYPE_RANGE to NUMERIC_OUT_OF_SUPPORTED_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41130. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38644 [https://github.com/apache/spark/pull/38644] > Rename OUT_OF_DECIMAL_TYPE_RANGE to NUMERIC_OUT_OF_SUPPORTED_RANGE > -- > > Key: SPARK-41130 > URL: https://issues.apache.org/jira/browse/SPARK-41130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We should use proper name for error class and clear error message -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41192) Task finished before speculative task scheduled leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-41192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635763#comment-17635763 ] Apache Spark commented on SPARK-41192: -- User 'toujours33' has created a pull request for this issue: https://github.com/apache/spark/pull/38711 > Task finished before speculative task scheduled leads to holding idle > executors > --- > > Key: SPARK-41192 > URL: https://issues.apache.org/jira/browse/SPARK-41192 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Yazhi Wang >Priority: Minor > Labels: dynamic_allocation > Attachments: dynamic-executors, dynamic-log > > > When task finished before speculative task has been scheduled by > DAGScheduler, then the speculative tasks will be considered as pending and > count towards the calculation of number of needed executors, which will lead > to request more executors than needed > h2. Background & Reproduce > In one of our production job, we found that ExecutorAllocationManager was > holding more executors than needed. > We found it's difficult to reproduce in the test environment. In order to > stably reproduce and debug, we temporarily annotated the scheduling code of > speculative tasks in TaskSetManager:363 to ensure that the task be completed > before the speculative task being scheduled. > {code:java} > // Original code > private def dequeueTask( > execId: String, > host: String, > maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, > Boolean)] = { > // Tries to schedule a regular task first; if it returns None, then > schedules > // a speculative task > dequeueTaskHelper(execId, host, maxLocality, false).orElse( > dequeueTaskHelper(execId, host, maxLocality, true)) > } > // Speculative task will never be scheduled > private def dequeueTask( > execId: String, > host: String, > maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, > Boolean)] = { > // Tries to schedule a regular task first; if it returns None, then > schedules > // a speculative task > dequeueTaskHelper(execId, host, maxLocality, false) > } {code} > Referring to examples in SPARK-30511 > You will see when running the last task, we would be hold 38 executors (see > attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only > 2 tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed. > {code:java} > ./bin/spark-shell --master yarn --conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.minExecutors=20 --conf > spark.dynamicAllocation.maxExecutors=1000 {code} > {code:java} > val n = 4000 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index > 3998) { > Thread.sleep(1000 * 1000) > } else if (index > 3850) { > Thread.sleep(50 * 1000) // Fake running tasks > } else { > Thread.sleep(100) > } > Array.fill[Int](1)(1).iterator{code} > > I will have a PR ready to fix this issue -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41192) Task finished before speculative task scheduled leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-41192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635762#comment-17635762 ] Apache Spark commented on SPARK-41192: -- User 'toujours33' has created a pull request for this issue: https://github.com/apache/spark/pull/38711 > Task finished before speculative task scheduled leads to holding idle > executors > --- > > Key: SPARK-41192 > URL: https://issues.apache.org/jira/browse/SPARK-41192 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Yazhi Wang >Priority: Minor > Labels: dynamic_allocation > Attachments: dynamic-executors, dynamic-log > > > When task finished before speculative task has been scheduled by > DAGScheduler, then the speculative tasks will be considered as pending and > count towards the calculation of number of needed executors, which will lead > to request more executors than needed > h2. Background & Reproduce > In one of our production job, we found that ExecutorAllocationManager was > holding more executors than needed. > We found it's difficult to reproduce in the test environment. In order to > stably reproduce and debug, we temporarily annotated the scheduling code of > speculative tasks in TaskSetManager:363 to ensure that the task be completed > before the speculative task being scheduled. > {code:java} > // Original code > private def dequeueTask( > execId: String, > host: String, > maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, > Boolean)] = { > // Tries to schedule a regular task first; if it returns None, then > schedules > // a speculative task > dequeueTaskHelper(execId, host, maxLocality, false).orElse( > dequeueTaskHelper(execId, host, maxLocality, true)) > } > // Speculative task will never be scheduled > private def dequeueTask( > execId: String, > host: String, > maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, > Boolean)] = { > // Tries to schedule a regular task first; if it returns None, then > schedules > // a speculative task > dequeueTaskHelper(execId, host, maxLocality, false) > } {code} > Referring to examples in SPARK-30511 > You will see when running the last task, we would be hold 38 executors (see > attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only > 2 tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed. > {code:java} > ./bin/spark-shell --master yarn --conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.minExecutors=20 --conf > spark.dynamicAllocation.maxExecutors=1000 {code} > {code:java} > val n = 4000 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index > 3998) { > Thread.sleep(1000 * 1000) > } else if (index > 3850) { > Thread.sleep(50 * 1000) // Fake running tasks > } else { > Thread.sleep(100) > } > Array.fill[Int](1)(1).iterator{code} > > I will have a PR ready to fix this issue -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38958) Override S3 Client in Spark Write/Read calls
[ https://issues.apache.org/jira/browse/SPARK-38958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635756#comment-17635756 ] Daniel Carl Jones commented on SPARK-38958: --- Upgrade from V1 to V2 AWS SDK is likely to introduce a breaking change to the interface of the client factory, since we will be changing the Java interface being returned by the factory for starters. What it will mean is the factory function signatures will need to be updated, and given the V2 SDK has a sync and an async client, may need a second method (with the same headers attached again). > Override S3 Client in Spark Write/Read calls > > > Key: SPARK-38958 > URL: https://issues.apache.org/jira/browse/SPARK-38958 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Hershal >Priority: Major > > Hello, > I have been working to use spark to read and write data to S3. Unfortunately, > there are a few S3 headers that I need to add to my spark read/write calls. > After much looking, I have not found a way to replace the S3 client that > spark uses to make the read/write calls. I also have not found a > configuration that allows me to pass in S3 headers. Here is an example of > some common S3 request headers > ([https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html).] > Does there already exist functionality to add S3 headers to spark read/write > calls or pass in a custom client that would pass these headers on every > read/write request? Appreciate the help and feedback > > Thanks, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38958) Override S3 Client in Spark Write/Read calls
[ https://issues.apache.org/jira/browse/SPARK-38958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635755#comment-17635755 ] Daniel Carl Jones commented on SPARK-38958: --- I had someone reach out to me with a similar request - static headers on all S3 requests for a given S3A file system. If static headers per fs by config were to be added as a feature, do we have any idea what configuration for a feature like this might look like? i.e. how do we model a list of key value pairs in the Hadoop configurations? Best I see is "getStrings" which we need to figure out if its even (right number of k,v pairs) or maybe have each k,v pair be one string joined by equals symbol. Also, any reasons not to have such a configuration or any better way to design it? > Override S3 Client in Spark Write/Read calls > > > Key: SPARK-38958 > URL: https://issues.apache.org/jira/browse/SPARK-38958 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Hershal >Priority: Major > > Hello, > I have been working to use spark to read and write data to S3. Unfortunately, > there are a few S3 headers that I need to add to my spark read/write calls. > After much looking, I have not found a way to replace the S3 client that > spark uses to make the read/write calls. I also have not found a > configuration that allows me to pass in S3 headers. Here is an example of > some common S3 request headers > ([https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html).] > Does there already exist functionality to add S3 headers to spark read/write > calls or pass in a custom client that would pass these headers on every > read/write request? Appreciate the help and feedback > > Thanks, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41107) Install memory-profiler in the CI
[ https://issues.apache.org/jira/browse/SPARK-41107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang reassigned SPARK-41107: --- Assignee: Xinrong Meng > Install memory-profiler in the CI > - > > Key: SPARK-41107 > URL: https://issues.apache.org/jira/browse/SPARK-41107 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > PySpark memory profiler depends on > [memory-profiler|https://pypi.org/project/memory-profiler/]. > The ticket proposes to install memory-profiler in the CI to enable related > tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41107) Install memory-profiler in the CI
[ https://issues.apache.org/jira/browse/SPARK-41107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang resolved SPARK-41107. - Resolution: Fixed Resolved by https://github.com/apache/spark/pull/38611 > Install memory-profiler in the CI > - > > Key: SPARK-41107 > URL: https://issues.apache.org/jira/browse/SPARK-41107 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > PySpark memory profiler depends on > [memory-profiler|https://pypi.org/project/memory-profiler/]. > The ticket proposes to install memory-profiler in the CI to enable related > tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41189) Add an environment to switch on and off namedtuple hack
[ https://issues.apache.org/jira/browse/SPARK-41189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41189. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38700 [https://github.com/apache/spark/pull/38700] > Add an environment to switch on and off namedtuple hack > > > Key: SPARK-41189 > URL: https://issues.apache.org/jira/browse/SPARK-41189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.3.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > SPARK-32079 removed the namedtuple hack but there are still bugs being fixed > in the cloudpickle upstream. This JIRA aims to have a switch to on and off > this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41189) Add an environment to switch on and off namedtuple hack
[ https://issues.apache.org/jira/browse/SPARK-41189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41189: Assignee: Hyukjin Kwon > Add an environment to switch on and off namedtuple hack > > > Key: SPARK-41189 > URL: https://issues.apache.org/jira/browse/SPARK-41189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.3.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-32079 removed the namedtuple hack but there are still bugs being fixed > in the cloudpickle upstream. This JIRA aims to have a switch to on and off > this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org