[jira] [Updated] (SPARK-46674) Remove the Hive Index methods in HiveShim
[ https://issues.apache.org/jira/browse/SPARK-46674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-46674: - Summary: Remove the Hive Index methods in HiveShim (was: Remove the Hive Index methods) > Remove the Hive Index methods in HiveShim > - > > Key: SPARK-46674 > URL: https://issues.apache.org/jira/browse/SPARK-46674 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46667) XML: Throw error on multiple XML data source
[ https://issues.apache.org/jira/browse/SPARK-46667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46667: --- Labels: pull-request-available (was: ) > XML: Throw error on multiple XML data source > > > Key: SPARK-46667 > URL: https://issues.apache.org/jira/browse/SPARK-46667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46674) Remove the Hive Index methods
Kent Yao created SPARK-46674: Summary: Remove the Hive Index methods Key: SPARK-46674 URL: https://issues.apache.org/jira/browse/SPARK-46674 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46673) Refine docstring `aes_encrypt\aes_decrypt\try_aes_decrypt`
BingKun Pan created SPARK-46673: --- Summary: Refine docstring `aes_encrypt\aes_decrypt\try_aes_decrypt` Key: SPARK-46673 URL: https://issues.apache.org/jira/browse/SPARK-46673 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46673) Refine docstring `aes_encrypt/aes_decrypt/try_aes_decrypt`
[ https://issues.apache.org/jira/browse/SPARK-46673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-46673: Summary: Refine docstring `aes_encrypt/aes_decrypt/try_aes_decrypt` (was: Refine docstring `aes_encrypt\aes_decrypt\try_aes_decrypt`) > Refine docstring `aes_encrypt/aes_decrypt/try_aes_decrypt` > -- > > Key: SPARK-46673 > URL: https://issues.apache.org/jira/browse/SPARK-46673 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46672) Upgrade log4j2 to 2.22.1
[ https://issues.apache.org/jira/browse/SPARK-46672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46672: --- Labels: pull-request-available (was: ) > Upgrade log4j2 to 2.22.1 > > > Key: SPARK-46672 > URL: https://issues.apache.org/jira/browse/SPARK-46672 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46672) Upgrade log4j2 to 2.22.1
Yang Jie created SPARK-46672: Summary: Upgrade log4j2 to 2.22.1 Key: SPARK-46672 URL: https://issues.apache.org/jira/browse/SPARK-46672 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46614) Refine docstring `make_timestamp/make_timestamp_ltz/make_timestamp_ntz/make_ym_interval`
[ https://issues.apache.org/jira/browse/SPARK-46614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46614. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/44679 > Refine docstring > `make_timestamp/make_timestamp_ltz/make_timestamp_ntz/make_ym_interval` > > > Key: SPARK-46614 > URL: https://issues.apache.org/jira/browse/SPARK-46614 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46614) Refine docstring `make_timestamp/make_timestamp_ltz/make_timestamp_ntz/make_ym_interval`
[ https://issues.apache.org/jira/browse/SPARK-46614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46614: Assignee: BingKun Pan > Refine docstring > `make_timestamp/make_timestamp_ltz/make_timestamp_ntz/make_ym_interval` > > > Key: SPARK-46614 > URL: https://issues.apache.org/jira/browse/SPARK-46614 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46666) Make lxml as an optional testing dependency in test_session
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44676 [https://github.com/apache/spark/pull/44676] > Make lxml as an optional testing dependency in test_session > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > {code} > Traceback (most recent call last): > File "", line 198, in _run_module_as_main > File "", line 88, in _run_code > File "/__w/spark/spark/python/pyspark/sql/tests/test_session.py", line 22, > in > from lxml import etree > ModuleNotFoundError: No module named 'lxml' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46666) Make lxml as an optional testing dependency in test_session
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-4: Assignee: Hyukjin Kwon > Make lxml as an optional testing dependency in test_session > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > {code} > Traceback (most recent call last): > File "", line 198, in _run_module_as_main > File "", line 88, in _run_code > File "/__w/spark/spark/python/pyspark/sql/tests/test_session.py", line 22, > in > from lxml import etree > ModuleNotFoundError: No module named 'lxml' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter
Asif created SPARK-46671: Summary: InferFiltersFromConstraint rule is creating a redundant filter Key: SPARK-46671 URL: https://issues.apache.org/jira/browse/SPARK-46671 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Asif while bring my old PR which uses a different approach to the ConstraintPropagation algorithm ( [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with current master, I noticed a test failure in my branch for SPARK-33152: The test which is failing is InferFiltersFromConstraintSuite: {code} test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: Infer Filters") { val x = testRelation.as("x") val y = testRelation.as("y") val z = testRelation.as("z") // Removes EqualNullSafe when constructing candidate constraints comparePlans( InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), x.select($"x.a", $"x.a".as("xa")) .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" === $"x.a").analyze) // Once strategy's idempotence is not broken val originalQuery = x.join(y, condition = Some($"x.a" === $"y.a")) .select($"x.a", $"x.a".as("xa")).as("xy") .join(z, condition = Some($"xy.a" === $"z.a")).analyze val correctAnswer = x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = Some($"x.a" === $"y.a")) .select($"x.a", $"x.a".as("xa")).as("xy") .join(z.where($"a".isNotNull), condition = Some($"xy.a" === $"z.a")).analyze val optimizedQuery = InferFiltersFromConstraints(originalQuery) comparePlans(optimizedQuery, correctAnswer) comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) } {code} In the above test, I believe the below assertion is not proper. There is a redundant filter which is getting created. Out of these two isNotNull constraints, only one should be created. $"xa".isNotNull && $"x.a".isNotNull Because presence of (xa#0 = a#0), automatically implies that is one attribute is not null, the other also has to be not null. // Removes EqualNullSafe when constructing candidate constraints comparePlans( InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), x.select($"x.a", $"x.a".as("xa")) .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" === $"x.a").analyze) This is not a big issue, but it highlights the need to take a relook at the code of ConstraintPropagation and related code. I am filing this jira so that constraint code can be tightened/made more robust. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46670) Make DataSourceManager isolated and self clone-able
[ https://issues.apache.org/jira/browse/SPARK-46670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46670: --- Labels: pull-request-available (was: ) > Make DataSourceManager isolated and self clone-able > > > Key: SPARK-46670 > URL: https://issues.apache.org/jira/browse/SPARK-46670 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > Make DataSourceManager isolated and self clone-able -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46670) Make DataSourceManager isolated and self clone-able
Hyukjin Kwon created SPARK-46670: Summary: Make DataSourceManager isolated and self clone-able Key: SPARK-46670 URL: https://issues.apache.org/jira/browse/SPARK-46670 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 4.0.0 Reporter: Hyukjin Kwon Make DataSourceManager isolated and self clone-able -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46668) Parallelize Sphinx build of Python API docs
[ https://issues.apache.org/jira/browse/SPARK-46668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46668: --- Labels: pull-request-available (was: ) > Parallelize Sphinx build of Python API docs > --- > > Key: SPARK-46668 > URL: https://issues.apache.org/jira/browse/SPARK-46668 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46669) Bump Kubernetes Client 6.10.0
[ https://issues.apache.org/jira/browse/SPARK-46669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Pan resolved SPARK-46669. --- Resolution: Duplicate > Bump Kubernetes Client 6.10.0 > - > > Key: SPARK-46669 > URL: https://issues.apache.org/jira/browse/SPARK-46669 > Project: Spark > Issue Type: Dependency upgrade > Components: k8s >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46669) Bump Kubernetes Client 6.10.0
Cheng Pan created SPARK-46669: - Summary: Bump Kubernetes Client 6.10.0 Key: SPARK-46669 URL: https://issues.apache.org/jira/browse/SPARK-46669 Project: Spark Issue Type: Dependency upgrade Components: k8s Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46668) Parallelize Sphinx build of Python API docs
Nicholas Chammas created SPARK-46668: Summary: Parallelize Sphinx build of Python API docs Key: SPARK-46668 URL: https://issues.apache.org/jira/browse/SPARK-46668 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46653) Code-gen for full outer sort merge join output line by line
[ https://issues.apache.org/jira/browse/SPARK-46653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Zhu updated SPARK-46653: -- Description: Be consistent with closing code-gen, avoid oom when the parent of SortMergeJoin cannot codegen and there are a large number of duplicate keys in full outer sort merge join. (was: Be consistent with closing code-gen, avoid oom when there are a large number of duplicate keys in full outer sort merge join.) > Code-gen for full outer sort merge join output line by line > > > Key: SPARK-46653 > URL: https://issues.apache.org/jira/browse/SPARK-46653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Mingliang Zhu >Priority: Major > Labels: pull-request-available > > Be consistent with closing code-gen, avoid oom when the parent of > SortMergeJoin cannot codegen and there are a large number of duplicate keys > in full outer sort merge join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46575) Make HiveThriftServer2.startWithContext DevelopApi retriable
[ https://issues.apache.org/jira/browse/SPARK-46575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-46575: Assignee: Kent Yao > Make HiveThriftServer2.startWithContext DevelopApi retriable > > > Key: SPARK-46575 > URL: https://issues.apache.org/jira/browse/SPARK-46575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46667) XML: Throw error on multiple XML data source
Sandip Agarwala created SPARK-46667: --- Summary: XML: Throw error on multiple XML data source Key: SPARK-46667 URL: https://issues.apache.org/jira/browse/SPARK-46667 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Sandip Agarwala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46575) Make HiveThriftServer2.startWithContext DevelopApi retriable
[ https://issues.apache.org/jira/browse/SPARK-46575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46575. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44575 [https://github.com/apache/spark/pull/44575] > Make HiveThriftServer2.startWithContext DevelopApi retriable > > > Key: SPARK-46575 > URL: https://issues.apache.org/jira/browse/SPARK-46575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46614) Refine docstring `make_timestamp/make_timestamp_ltz/make_timestamp_ntz/make_ym_interval`
[ https://issues.apache.org/jira/browse/SPARK-46614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46614: --- Labels: pull-request-available (was: ) > Refine docstring > `make_timestamp/make_timestamp_ltz/make_timestamp_ntz/make_ym_interval` > > > Key: SPARK-46614 > URL: https://issues.apache.org/jira/browse/SPARK-46614 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46656) Split `GroupbyParitySplitApplyTests`
[ https://issues.apache.org/jira/browse/SPARK-46656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-46656: - Assignee: Ruifeng Zheng > Split `GroupbyParitySplitApplyTests` > > > Key: SPARK-46656 > URL: https://issues.apache.org/jira/browse/SPARK-46656 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46656) Split `GroupbyParitySplitApplyTests`
[ https://issues.apache.org/jira/browse/SPARK-46656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-46656. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44664 [https://github.com/apache/spark/pull/44664] > Split `GroupbyParitySplitApplyTests` > > > Key: SPARK-46656 > URL: https://issues.apache.org/jira/browse/SPARK-46656 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46642) Add `getMessageTemplate` to PySpark error framework
[ https://issues.apache.org/jira/browse/SPARK-46642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-46642. - Resolution: Won't Fix > Add `getMessageTemplate` to PySpark error framework > --- > > Key: SPARK-46642 > URL: https://issues.apache.org/jira/browse/SPARK-46642 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > We should have `getMessageTemplate` to PySpark error framework to meet the > feature parity with JVM side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46638) Create API to acquire execution memory for 'eval' and 'terminate' methods
[ https://issues.apache.org/jira/browse/SPARK-46638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46638: --- Labels: pull-request-available (was: ) > Create API to acquire execution memory for 'eval' and 'terminate' methods > - > > Key: SPARK-46638 > URL: https://issues.apache.org/jira/browse/SPARK-46638 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Daniel >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46662) Upgrade kubernetes-client to 6.10.0
[ https://issues.apache.org/jira/browse/SPARK-46662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46662: - Assignee: Bjørn Jørgensen > Upgrade kubernetes-client to 6.10.0 > --- > > Key: SPARK-46662 > URL: https://issues.apache.org/jira/browse/SPARK-46662 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > new version https://github.com/fabric8io/kubernetes-client/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46662) Upgrade kubernetes-client to 6.10.0
[ https://issues.apache.org/jira/browse/SPARK-46662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46662. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44672 [https://github.com/apache/spark/pull/44672] > Upgrade kubernetes-client to 6.10.0 > --- > > Key: SPARK-46662 > URL: https://issues.apache.org/jira/browse/SPARK-46662 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > new version https://github.com/fabric8io/kubernetes-client/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46658) Loosen Ruby dependency specs for doc build
[ https://issues.apache.org/jira/browse/SPARK-46658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46658. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44667 [https://github.com/apache/spark/pull/44667] > Loosen Ruby dependency specs for doc build > -- > > Key: SPARK-46658 > URL: https://issues.apache.org/jira/browse/SPARK-46658 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46658) Loosen Ruby dependency specs for doc build
[ https://issues.apache.org/jira/browse/SPARK-46658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46658: Assignee: Nicholas Chammas > Loosen Ruby dependency specs for doc build > -- > > Key: SPARK-46658 > URL: https://issues.apache.org/jira/browse/SPARK-46658 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46666) Make lxml as an optional testing dependency in test_session
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-4: --- Labels: pull-request-available (was: ) > Make lxml as an optional testing dependency in test_session > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > {code} > Traceback (most recent call last): > File "", line 198, in _run_module_as_main > File "", line 88, in _run_code > File "/__w/spark/spark/python/pyspark/sql/tests/test_session.py", line 22, > in > from lxml import etree > ModuleNotFoundError: No module named 'lxml' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46666) Make lxml as an optional testing dependency in test_session
Hyukjin Kwon created SPARK-4: Summary: Make lxml as an optional testing dependency in test_session Key: SPARK-4 URL: https://issues.apache.org/jira/browse/SPARK-4 Project: Spark Issue Type: Test Components: PySpark, Tests Affects Versions: 4.0.0 Reporter: Hyukjin Kwon {code} Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/__w/spark/spark/python/pyspark/sql/tests/test_session.py", line 22, in from lxml import etree ModuleNotFoundError: No module named 'lxml' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46665) Remove Pandas dependency for pyspark.testing
[ https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46665: --- Labels: pull-request-available (was: ) > Remove Pandas dependency for pyspark.testing > > > Key: SPARK-46665 > URL: https://issues.apache.org/jira/browse/SPARK-46665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > We should not make pyspark.testing depending on Pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46665) Remove Pandas dependency for pyspark.testing
Haejoon Lee created SPARK-46665: --- Summary: Remove Pandas dependency for pyspark.testing Key: SPARK-46665 URL: https://issues.apache.org/jira/browse/SPARK-46665 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Reporter: Haejoon Lee We should not make pyspark.testing depending on Pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46662) Upgrade kubernetes-client to 6.10.0
[ https://issues.apache.org/jira/browse/SPARK-46662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46662: --- Labels: pull-request-available (was: ) > Upgrade kubernetes-client to 6.10.0 > --- > > Key: SPARK-46662 > URL: https://issues.apache.org/jira/browse/SPARK-46662 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > new version https://github.com/fabric8io/kubernetes-client/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases
[ https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikhil Sheoran updated SPARK-46640: --- Fix Version/s: (was: 4.0.0) > RemoveRedundantAliases does not account for SubqueryExpression when removing > aliases > > > Key: SPARK-46640 > URL: https://issues.apache.org/jira/browse/SPARK-46640 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Priority: Minor > Labels: pull-request-available > > `RemoveRedundantAliases{{{}`{}}} does not take into account the outer > attributes of a `SubqueryExpression` aliases, potentially removing them if it > thinks they are redundant. > This can cause scenarios where a subquery expression has conditions like `a#x > = a#x` i.e. both the attribute names and the expression ID(s) are the same. > This can then lead to conflicting expression ID(s) error. > In `RemoveRedundantAliases`, we have an excluded AttributeSet argument > denoting the references for which we should not remove aliases. For a query > with a subquery expression, adding the references of this subquery in the > excluded set prevents such rewrite from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46662) Upgrade kubernetes-client to 6.10.0
Bjørn Jørgensen created SPARK-46662: --- Summary: Upgrade kubernetes-client to 6.10.0 Key: SPARK-46662 URL: https://issues.apache.org/jira/browse/SPARK-46662 Project: Spark Issue Type: Dependency upgrade Components: Kubernetes Affects Versions: 4.0.0 Reporter: Bjørn Jørgensen new version https://github.com/fabric8io/kubernetes-client/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46657) Install `lxml` in Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46657. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44666 [https://github.com/apache/spark/pull/44666] > Install `lxml` in Python 3.12 > - > > Key: SPARK-46657 > URL: https://issues.apache.org/jira/browse/SPARK-46657 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46657) Install `lxml` in Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46657: - Assignee: Dongjoon Hyun > Install `lxml` in Python 3.12 > - > > Key: SPARK-46657 > URL: https://issues.apache.org/jira/browse/SPARK-46657 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46660) ReattachExecute requests do not refresh aliveness of SessionHolder
[ https://issues.apache.org/jira/browse/SPARK-46660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46660: --- Labels: pull-request-available (was: ) > ReattachExecute requests do not refresh aliveness of SessionHolder > -- > > Key: SPARK-46660 > URL: https://issues.apache.org/jira/browse/SPARK-46660 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > Labels: pull-request-available > > In the first executePlan request, creating the {{ExecuteHolder}} triggers > {{getOrCreateIsolatedSession}} which refreshes the aliveness of > {{{}SessionHolder{}}}. However in {{ReattachExecute}} , we fetch the > {{ExecuteHolder}} directly without going through the {{SessionHolder}} (and > hence making it seem like the {{SessionHolder}} is idle). > > This would result in long-running queries (which do not send release execute > requests since that refreshes aliveness) failing because the > {{SessionHolder}} would expire during active query execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46661) Add customizable property spark.dynamicAllocation.lastExecutorIdleTimeout for last remaining executor, defaulting to spark.dynamicAllocation.executorIdleTimeout
[ https://issues.apache.org/jira/browse/SPARK-46661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arnaud Nauwynck updated SPARK-46661: Description: when using dynamicAllocation, the parameter "spark.dynamicAllocation.executorIdleTimeout" is used for any executor, regardless whether it is the last running one, or any other useless one. However, it might be interresting to preserve the last executor running longer when it is the last remaining one, so that any incoming new task would be immediatly processed faster, instead of waiting for a complete restart of executors that may take >= 30 secondes. This is particularly frequent in scenario when using Spark Streaming, and when polling for micro-batches. Preserving 1 alive executors help responding faster, while still allowing dynamic alllocation for 2,3..N executors. In practise, this might change only the following source code lines In [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L647-L653|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L647-L653] to add {code:java} private[spark] val DYN_ALLOCATION_LAST_EXECUTOR_IDLE_TIMEOUT = ConfigBuilder("spark.dynamicAllocation.lastExecutorIdleTimeout") .version("3.6.0") .timeConf(TimeUnit.SECONDS) .checkValue(_ >= 0L, "Last Timeout must be >= 0 (and preferrably >= spark.dynamicAllocation.executorIdleTimeout)") .createWithDefault(60) {code} In [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L46|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L46] {code:java} private val idleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)) {code} to add {code:java} private val idleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)) private val lastIdleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_LAST_IDLE_TIMEOUT)) {code} and replace (insert if-condition) in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L573|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L573] {code:java} def updateTimeout(): Unit = { ... val timeout = Seq(_cacheTimeout, _shuffleTimeout, idleTimeoutNs).max ... {code} to be something like {code:java} def updateTimeout(): Unit = { ... val isOnlyOneLastExecutorRemaining = val currIddleTimeoutNs = if (isOnlyOneLastExecutorRemaining) lastIdleTimeoutNs else idleTimeoutNs val timeout = Seq(_cacheTimeout, _shuffleTimeout, currIddleTimeoutNs).max ... {code} was: when using dynamicAllocation, the parameter "spark.dynamicAllocation.executorIdleTimeout" is used for any executor, regardless whether it is the last running one, or any other useless one. However, it might be interresting to preserve the last executor running longer when it is the last remaining one, so that any incoming new task would be immediatly processed faster, instead of waiting for a complete restart of executors that may take >= 30 secondes. This is particularly frequent in scenario when using Spark Streaming, and when polling for micro-batches. Preserving 1 alive executors help responding faster, while still allowing dynamic alllocation for 2,3..N executors. In practise, this might change only the following source code lines In [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L647-L653|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L647-L653] to add {code:java} private[spark] val DYN_ALLOCATION_LAST_EXECUTOR_IDLE_TIMEOUT = ConfigBuilder("spark.dynamicAllocation.lastExecutorIdleTimeout") .version("3.6.0") .timeConf(TimeUnit.SECONDS) .checkValue(_ >= 0L, "Last Timeout must be >= 0 (and preferrably >= spark.dynamicAllocation.executorIdleTimeout)") .createWithDefault(60) {code} In [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L46 |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L46 ] {code:java} private val idleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)) {code} to add {code:java} private val idleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)) private val lastIdleTimeoutNs = TimeUnit.SECONDS.toNanos(
[jira] [Created] (SPARK-46661) Add customizable property spark.dynamicAllocation.lastExecutorIdleTimeout for last remaining executor, defaulting to spark.dynamicAllocation.executorIdleTimeout
Arnaud Nauwynck created SPARK-46661: --- Summary: Add customizable property spark.dynamicAllocation.lastExecutorIdleTimeout for last remaining executor, defaulting to spark.dynamicAllocation.executorIdleTimeout Key: SPARK-46661 URL: https://issues.apache.org/jira/browse/SPARK-46661 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 3.5.0, 4.0.0 Reporter: Arnaud Nauwynck when using dynamicAllocation, the parameter "spark.dynamicAllocation.executorIdleTimeout" is used for any executor, regardless whether it is the last running one, or any other useless one. However, it might be interresting to preserve the last executor running longer when it is the last remaining one, so that any incoming new task would be immediatly processed faster, instead of waiting for a complete restart of executors that may take >= 30 secondes. This is particularly frequent in scenario when using Spark Streaming, and when polling for micro-batches. Preserving 1 alive executors help responding faster, while still allowing dynamic alllocation for 2,3..N executors. In practise, this might change only the following source code lines In [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L647-L653|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L647-L653] to add {code:java} private[spark] val DYN_ALLOCATION_LAST_EXECUTOR_IDLE_TIMEOUT = ConfigBuilder("spark.dynamicAllocation.lastExecutorIdleTimeout") .version("3.6.0") .timeConf(TimeUnit.SECONDS) .checkValue(_ >= 0L, "Last Timeout must be >= 0 (and preferrably >= spark.dynamicAllocation.executorIdleTimeout)") .createWithDefault(60) {code} In [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L46 |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L46 ] {code:java} private val idleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)) {code} to add {code:java} private val idleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)) private val lastIdleTimeoutNs = TimeUnit.SECONDS.toNanos( conf.get(DYN_ALLOCATION_EXECUTOR_LAST_IDLE_TIMEOUT)) {code} and replace (insert if-condition) in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L573 |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L573 ] {code:java} def updateTimeout(): Unit = { ... val timeout = Seq(_cacheTimeout, _shuffleTimeout, idleTimeoutNs).max ... {code} to be something like {code:java} def updateTimeout(): Unit = { ... val isOnlyOneLastExecutorRemaining = val currIddleTimeoutNs = if (isOnlyOneLastExecutorRemaining) lastIdleTimeoutNs else idleTimeoutNs val timeout = Seq(_cacheTimeout, _shuffleTimeout, currIddleTimeoutNs).max ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46660) ReattachExecute requests do not refresh aliveness of SessionHolder
Venkata Sai Akhil Gudesa created SPARK-46660: Summary: ReattachExecute requests do not refresh aliveness of SessionHolder Key: SPARK-46660 URL: https://issues.apache.org/jira/browse/SPARK-46660 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 4.0.0 Reporter: Venkata Sai Akhil Gudesa In the first executePlan request, creating the {{ExecuteHolder}} triggers {{getOrCreateIsolatedSession}} which refreshes the aliveness of {{{}SessionHolder{}}}. However in {{ReattachExecute}} , we fetch the {{ExecuteHolder}} directly without going through the {{SessionHolder}} (and hence making it seem like the {{SessionHolder}} is idle). This would result in long-running queries (which do not send release execute requests since that refreshes aliveness) failing because the {{SessionHolder}} would expire during active query execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46659) Add customizable TaskScheduling param, to avoid randomly choosing executor for tasks, and downscale on low micro-batches activity
[ https://issues.apache.org/jira/browse/SPARK-46659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arnaud Nauwynck updated SPARK-46659: Description: When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] {code:java} /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } {code} It could be replaced simply by {code:java} val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } {code} was: When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] ``` java /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } ``` It could be replaced simply by ``` java val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } ``` > Add customizable TaskScheduling param, to avoid randomly choosing executor > for tasks, and downscale on low micro-batches activity > - > > Key: SPARK-46659 > URL:
[jira] [Updated] (SPARK-46659) Add customizable TaskScheduling param, to avoid randomly choosing executor for tasks, and downscale on low micro-batches activity
[ https://issues.apache.org/jira/browse/SPARK-46659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arnaud Nauwynck updated SPARK-46659: Description: When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] ```java /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } ``` It could be replaced simply by ```java val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } ``` was: When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] {{ /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } }} It could be replaced simply by {{ val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } }} > Add customizable TaskScheduling param, to avoid randomly choosing executor > for tasks, and downscale on low micro-batches activity > - > > Key: SPARK-46659 > URL: https://issues.apache.org/jira/browse/SPARK-46659 >
[jira] [Updated] (SPARK-46659) Add customizable TaskScheduling param, to avoid randomly choosing executor for tasks, and downscale on low micro-batches activity
[ https://issues.apache.org/jira/browse/SPARK-46659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arnaud Nauwynck updated SPARK-46659: Description: When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] ``` java /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } ``` It could be replaced simply by ``` java val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } ``` was: When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] ```java /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } ``` It could be replaced simply by ```java val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } ``` > Add customizable TaskScheduling param, to avoid randomly choosing executor > for tasks, and downscale on low micro-batches activity > - > > Key: SPARK-46659 > URL:
[jira] [Created] (SPARK-46659) Add customizable TaskScheduling param, to avoid randomly choosing executor for tasks, and downscale on low micro-batches activity
Arnaud Nauwynck created SPARK-46659: --- Summary: Add customizable TaskScheduling param, to avoid randomly choosing executor for tasks, and downscale on low micro-batches activity Key: SPARK-46659 URL: https://issues.apache.org/jira/browse/SPARK-46659 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 3.5.0, 3.4.0, 4.0.0 Reporter: Arnaud Nauwynck When using dynamicAllocation (but not spark.decommission.enabled=true) with a micro-batches activity, very small tasks are arriving at regular interval, and are processed extremely quickly. The flow of events that are processed may consume less than 1% of the cpu of the cluster. But globally, the number of executors stay at a high level (spark.dynamicAllocation.maxExecutors) eventhough they are all 99% of the time IDDLE. Unfortunatly, in the current code, tasks are assigned randomly to executors, so a constant flow of very small tasks maintain artificially in an "active" status all the executors: all executors are receiving tasks from time to time, so strictly speaking, they are never considered as IDDLE during a duration longer than "spark.dynamicAllocation.executorIdleTimeout". Therefore, executors are never marked as candidate for decommissioning, and they continue to receive tasks forever, while thoses tasks could easily be assigned to any other executor (chosen not randomly). The proposition is therefore to add a new configuration property to suppress the random shuffling of assignable offers for task. see this code [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L773] {{ /** * Shuffle offers around to avoid always placing tasks on the same workers. Exposed to allow * overriding in tests, so it can be deterministic. */ protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { Random.shuffle(offers) } }} It could be replaced simply by {{ val SKIP_RANDOMIZE_WORKER_OFFERS = ConfigBuilder("spark.task.skipRandomizeWorkerOffers") .version("3.6.0") .booleanConf .createWithDefault(false) .. val skipRandomizeWorkerOffers = conf.get(SKIP_RANDOMIZE_WORKER_OFFERS) .. protected def shuffleOffers(offers: IndexedSeq[WorkerOffer]): IndexedSeq[WorkerOffer] = { if (skipRandomizeWorkerOffers) { offers } else { Random.shuffle(offers) } } }} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46658) Loosen Ruby dependency specs for doc build
[ https://issues.apache.org/jira/browse/SPARK-46658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46658: --- Labels: pull-request-available (was: ) > Loosen Ruby dependency specs for doc build > -- > > Key: SPARK-46658 > URL: https://issues.apache.org/jira/browse/SPARK-46658 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42199) groupByKey creates columns that may conflict with exising columns
[ https://issues.apache.org/jira/browse/SPARK-42199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42199: --- Labels: pull-request-available (was: ) > groupByKey creates columns that may conflict with exising columns > - > > Key: SPARK-42199 > URL: https://issues.apache.org/jira/browse/SPARK-42199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.3, 3.3.2, 3.4.0, 3.5.0 >Reporter: Enrico Minack >Priority: Major > Labels: pull-request-available > > Calling {{ds.groupByKey(func: V => K)}} creates columns to store the key > value. These columns may conflict with columns that already exist in {{ds}}. > Function {{Dataset.groupByKey.agg}} accounts for this with a very specific > rule, which has some surprising weaknesses: > {code:scala} > spark.range(1) > // groupByKey adds column 'value' > .groupByKey(id => id) > // which cannot be referenced, though it is suggested > .agg(count("value")) > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Column 'value' does not exist. Did > you mean one of the following? [value, id]; > {code} > An existing 'value' column can be referenced: > {code:scala} > // dataset with column 'value' > spark.range(1).select($"id".as("value")).as[Long] > // groupByKey adds another column 'value' > .groupByKey(id => id) > // agg accounts for the extra column and excludes it when resolving 'value' > .agg(count("value")) > .show() > {code} > {code:java} > +---++ > |key|count(value)| > +---++ > | 0| 1| > +---++ > {code} > While column suggestion shows both 'value' columns: > {code:scala} > spark.range(1).select($"id".as("value")).as[Long] > .groupByKey(id => id) > .agg(count("unknown")) > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Column 'unknown' does not exist. Did > you mean one of the following? [value, value] > {code} > However, {{mapValues}} introduces another 'value' column, which should be > referencable, but it breaks the exclusion introduced by {{agg}}: > {code:scala} > spark.range(1) > // groupByKey adds column 'value' > .groupByKey(id => id) > // adds another 'value' column > .mapValues(value => value) > // which cannot be referenced in agg > .agg(count("value")) > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Reference 'value' is ambiguous, could > be: value, value. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46658) Loosen Ruby dependency specs for doc build
Nicholas Chammas created SPARK-46658: Summary: Loosen Ruby dependency specs for doc build Key: SPARK-46658 URL: https://issues.apache.org/jira/browse/SPARK-46658 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44173) Make Spark an sbt build only project
[ https://issues.apache.org/jira/browse/SPARK-44173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805237#comment-17805237 ] Dongjoon Hyun commented on SPARK-44173: --- IIRC, there is a discussion about this and we decided to stick to Maven because its explicit dependency management was preferred at that time, [~LuciferYang]. > Make Spark an sbt build only project > > > Key: SPARK-44173 > URL: https://issues.apache.org/jira/browse/SPARK-44173 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > Supporting both Maven and SBT always brings various testing problems and > increases the complexity of testing code writing > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44173) Make Spark an sbt build only project
[ https://issues.apache.org/jira/browse/SPARK-44173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805237#comment-17805237 ] Dongjoon Hyun edited comment on SPARK-44173 at 1/10/24 5:40 PM: IIRC, there was a discussion about this and we decided to stick to Maven because its explicit dependency management was preferred at that time, [~LuciferYang]. was (Author: dongjoon): IIRC, there is a discussion about this and we decided to stick to Maven because its explicit dependency management was preferred at that time, [~LuciferYang]. > Make Spark an sbt build only project > > > Key: SPARK-44173 > URL: https://issues.apache.org/jira/browse/SPARK-44173 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > Supporting both Maven and SBT always brings various testing problems and > increases the complexity of testing code writing > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46657) Install `lxml` in Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46657: --- Labels: pull-request-available (was: ) > Install `lxml` in Python 3.12 > - > > Key: SPARK-46657 > URL: https://issues.apache.org/jira/browse/SPARK-46657 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46657) Install `lxml` in Python 3.12
Dongjoon Hyun created SPARK-46657: - Summary: Install `lxml` in Python 3.12 Key: SPARK-46657 URL: https://issues.apache.org/jira/browse/SPARK-46657 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46547) Fix deadlock issue between maintenance thread and streaming agg physical operators
[ https://issues.apache.org/jira/browse/SPARK-46547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-46547. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 44542 [https://github.com/apache/spark/pull/44542] > Fix deadlock issue between maintenance thread and streaming agg physical > operators > -- > > Key: SPARK-46547 > URL: https://issues.apache.org/jira/browse/SPARK-46547 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > Fix deadlock issue between maintenance thread and streaming agg physical > operators -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46653) Code-gen for full outer sort merge join output line by line
[ https://issues.apache.org/jira/browse/SPARK-46653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Zhu updated SPARK-46653: -- Description: Be consistent with closing code-gen, avoid oom when there are a large number of duplicate keys in full outer sort merge join. (was: Be consistent with closing code-gen, avoid oom when there are a large number of duplicate keys and the parent of SortMergeJoin cannot code-gen.) > Code-gen for full outer sort merge join output line by line > > > Key: SPARK-46653 > URL: https://issues.apache.org/jira/browse/SPARK-46653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Mingliang Zhu >Priority: Major > Labels: pull-request-available > > Be consistent with closing code-gen, avoid oom when there are a large number > of duplicate keys in full outer sort merge join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46654) df.show() of pyspark displayed different results between Regular Spark and Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-46654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46654: --- Labels: pull-request-available (was: ) > df.show() of pyspark displayed different results between Regular Spark and > Spark Connect > > > Key: SPARK-46654 > URL: https://issues.apache.org/jira/browse/SPARK-46654 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > The following doctest will throw an error in the tests of the pyspark-connect > module > {code:java} > Example 2: Converting a complex StructType to a CSV string > >>> from pyspark.sql import Row, functions as sf > >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))] > >>> df = spark.createDataFrame(data, ("key", "value")) > >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP > +---+ > |to_csv(value) | > +---+ > |2,Alice,"[100,200,300]"| > +---+{code} > {code:java} > ** > 3953File "/__w/spark/spark/python/pyspark/sql/connect/functions/builtin.py", > line 2232, in pyspark.sql.connect.functions.builtin.to_csv > 3954Failed example: > 3955df.select(sf.to_csv(df.value)).show(truncate=False) > 3956Expected: > 3957+---+ > 3958|to_csv(value) | > 3959+---+ > 3960|2,Alice,"[100,200,300]"| > 3961+---+ > 3962Got: > 3963 > +--+ > 3964|to_csv(value) > | > 3965 > +--+ > 3966 > |2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@99c5e30f| > 3967 > +--+ > 3968 > 3969** > 3970 1 of 18 in pyspark.sql.connect.functions.builtin.to_csv > 3971***Test Failed*** 1 failures. {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46656) Split `GroupbyParitySplitApplyTests`
[ https://issues.apache.org/jira/browse/SPARK-46656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46656: --- Labels: pull-request-available (was: ) > Split `GroupbyParitySplitApplyTests` > > > Key: SPARK-46656 > URL: https://issues.apache.org/jira/browse/SPARK-46656 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46656) Split `GroupbyParitySplitApplyTests`
Ruifeng Zheng created SPARK-46656: - Summary: Split `GroupbyParitySplitApplyTests` Key: SPARK-46656 URL: https://issues.apache.org/jira/browse/SPARK-46656 Project: Spark Issue Type: Sub-task Components: PS, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46257) Upgrade Derby to 10.16.1.1
[ https://issues.apache.org/jira/browse/SPARK-46257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805059#comment-17805059 ] Laurenceau Julien edited comment on SPARK-46257 at 1/10/24 10:36 AM: - Yes you are right. The only version that fix this vuln currently released on maven central is : [10.17.1.0|https://mvnrepository.com/artifact/org.apache.derby/derby/10.17.1.0] [https://mvnrepository.com/artifact/org.apache.derby/derby] Do you think it will be possible to upgrade to 10.17.x for spark 4.0.0 ? NB: I asked to derby on their ticket about the vuln if there is a release planned for 10.16.1.2. was (Author: julienlau): Yes you are right. The only version that fix this vuln currently released on maven central is : [10.17.1.0|https://mvnrepository.com/artifact/org.apache.derby/derby/10.17.1.0] [https://mvnrepository.com/artifact/org.apache.derby/derby] Do you think it will be possible to upgrade to 10.17.x for spark 4.0.0 ? > Upgrade Derby to 10.16.1.1 > -- > > Key: SPARK-46257 > URL: https://issues.apache.org/jira/browse/SPARK-46257 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://db.apache.org/derby/releases/release-10_16_1_1.cgi -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46257) Upgrade Derby to 10.16.1.1
[ https://issues.apache.org/jira/browse/SPARK-46257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805059#comment-17805059 ] Laurenceau Julien commented on SPARK-46257: --- Yes you are right. The only version that fix this vuln currently released on maven central is : [10.17.1.0|https://mvnrepository.com/artifact/org.apache.derby/derby/10.17.1.0] [https://mvnrepository.com/artifact/org.apache.derby/derby] Do you think it will be possible to upgrade to 10.17.x for spark 4.0.0 ? > Upgrade Derby to 10.16.1.1 > -- > > Key: SPARK-46257 > URL: https://issues.apache.org/jira/browse/SPARK-46257 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://db.apache.org/derby/releases/release-10_16_1_1.cgi -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46652) Remove `Snappy` from `TPCDSQueryBenchmark` benchmark case name
[ https://issues.apache.org/jira/browse/SPARK-46652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46652: - Assignee: Dongjoon Hyun > Remove `Snappy` from `TPCDSQueryBenchmark` benchmark case name > -- > > Key: SPARK-46652 > URL: https://issues.apache.org/jira/browse/SPARK-46652 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46652) Remove `Snappy` from `TPCDSQueryBenchmark` benchmark case name
[ https://issues.apache.org/jira/browse/SPARK-46652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46652. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44657 [https://github.com/apache/spark/pull/44657] > Remove `Snappy` from `TPCDSQueryBenchmark` benchmark case name > -- > > Key: SPARK-46652 > URL: https://issues.apache.org/jira/browse/SPARK-46652 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46650) Replace AtomicBoolean with volatile boolean
[ https://issues.apache.org/jira/browse/SPARK-46650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46650: -- Assignee: (was: Apache Spark) > Replace AtomicBoolean with volatile boolean > --- > > Key: SPARK-46650 > URL: https://issues.apache.org/jira/browse/SPARK-46650 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46635) Refine docstring of `from_csv/schema_of_csv/to_csv`
[ https://issues.apache.org/jira/browse/SPARK-46635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-46635. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44639 [https://github.com/apache/spark/pull/44639] > Refine docstring of `from_csv/schema_of_csv/to_csv` > --- > > Key: SPARK-46635 > URL: https://issues.apache.org/jira/browse/SPARK-46635 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46635) Refine docstring of `from_csv/schema_of_csv/to_csv`
[ https://issues.apache.org/jira/browse/SPARK-46635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-46635: Assignee: Yang Jie > Refine docstring of `from_csv/schema_of_csv/to_csv` > --- > > Key: SPARK-46635 > URL: https://issues.apache.org/jira/browse/SPARK-46635 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org