[jira] [Commented] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first
[ https://issues.apache.org/jira/browse/SPARK-39690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562984#comment-17562984 ] Apache Spark commented on SPARK-39690: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37098 > Reuse exchange across subqueries is broken with AQE if subquery side exchange > materialized first > > > Key: SPARK-39690 > URL: https://issues.apache.org/jira/browse/SPARK-39690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > > When trying to reuse Exchange of a subquery in main plan, if the Exchange > inside subquery materialize first then main ASPE node won't have that stage > info (in > [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) > to replace in current logical plan. This will cause AQE to produce new > candidate physical plan without reusing the exchange present inside subquery. > And depending on how complex the inner plan is (no. of exchanges) AQE could > choose plan without ReusedExchange. > We have seen with multiple queries with our private build. This can happen in > DPP also. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first
[ https://issues.apache.org/jira/browse/SPARK-39690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39690: Assignee: (was: Apache Spark) > Reuse exchange across subqueries is broken with AQE if subquery side exchange > materialized first > > > Key: SPARK-39690 > URL: https://issues.apache.org/jira/browse/SPARK-39690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > > When trying to reuse Exchange of a subquery in main plan, if the Exchange > inside subquery materialize first then main ASPE node won't have that stage > info (in > [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) > to replace in current logical plan. This will cause AQE to produce new > candidate physical plan without reusing the exchange present inside subquery. > And depending on how complex the inner plan is (no. of exchanges) AQE could > choose plan without ReusedExchange. > We have seen with multiple queries with our private build. This can happen in > DPP also. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first
[ https://issues.apache.org/jira/browse/SPARK-39690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562983#comment-17562983 ] Apache Spark commented on SPARK-39690: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37098 > Reuse exchange across subqueries is broken with AQE if subquery side exchange > materialized first > > > Key: SPARK-39690 > URL: https://issues.apache.org/jira/browse/SPARK-39690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > > When trying to reuse Exchange of a subquery in main plan, if the Exchange > inside subquery materialize first then main ASPE node won't have that stage > info (in > [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) > to replace in current logical plan. This will cause AQE to produce new > candidate physical plan without reusing the exchange present inside subquery. > And depending on how complex the inner plan is (no. of exchanges) AQE could > choose plan without ReusedExchange. > We have seen with multiple queries with our private build. This can happen in > DPP also. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first
[ https://issues.apache.org/jira/browse/SPARK-39690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39690: Assignee: Apache Spark > Reuse exchange across subqueries is broken with AQE if subquery side exchange > materialized first > > > Key: SPARK-39690 > URL: https://issues.apache.org/jira/browse/SPARK-39690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Apache Spark >Priority: Major > > When trying to reuse Exchange of a subquery in main plan, if the Exchange > inside subquery materialize first then main ASPE node won't have that stage > info (in > [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) > to replace in current logical plan. This will cause AQE to produce new > candidate physical plan without reusing the exchange present inside subquery. > And depending on how complex the inner plan is (no. of exchanges) AQE could > choose plan without ReusedExchange. > We have seen with multiple queries with our private build. This can happen in > DPP also. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first
[ https://issues.apache.org/jira/browse/SPARK-39690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kapil Singh updated SPARK-39690: Description: When trying to reuse Exchange of a subquery in main plan, if the Exchange inside subquery materialize first then main ASPE node won't have that stage info (in [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) to replace in current logical plan. This will cause AQE to produce new candidate physical plan without reusing the exchange present inside subquery. And depending on how complex the inner plan is (no. of exchanges) AQE could choose plan without ReusedExchange. We have seen with multiple queries with our private build. This can happen in DPP also. was: When trying to reuse Exchange of a subquery in main plan, if the Exchange inside subquery materialize first then main ASPE node won't have that stage info (in [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) to replace in current logical plan. This will cause AQE to produce new candidate physical plan without reusing the exchange present inside subquery. And depending on how complex the inner plan is (no. of exchanges) AQE could choose plan without ReusedExchange. We have seen in with multiple queries with our private build. This can happen in DPP also. > Reuse exchange across subqueries is broken with AQE if subquery side exchange > materialized first > > > Key: SPARK-39690 > URL: https://issues.apache.org/jira/browse/SPARK-39690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Priority: Major > > When trying to reuse Exchange of a subquery in main plan, if the Exchange > inside subquery materialize first then main ASPE node won't have that stage > info (in > [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) > to replace in current logical plan. This will cause AQE to produce new > candidate physical plan without reusing the exchange present inside subquery. > And depending on how complex the inner plan is (no. of exchanges) AQE could > choose plan without ReusedExchange. > We have seen with multiple queries with our private build. This can happen in > DPP also. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first
Kapil Singh created SPARK-39690: --- Summary: Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first Key: SPARK-39690 URL: https://issues.apache.org/jira/browse/SPARK-39690 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kapil Singh When trying to reuse Exchange of a subquery in main plan, if the Exchange inside subquery materialize first then main ASPE node won't have that stage info (in [stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243]) to replace in current logical plan. This will cause AQE to produce new candidate physical plan without reusing the exchange present inside subquery. And depending on how complex the inner plan is (no. of exchanges) AQE could choose plan without ReusedExchange. We have seen in with multiple queries with our private build. This can happen in DPP also. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39689) Support 2-chars lineSep in CSV datasource
Yaohua Zhao created SPARK-39689: --- Summary: Support 2-chars lineSep in CSV datasource Key: SPARK-39689 URL: https://issues.apache.org/jira/browse/SPARK-39689 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yaohua Zhao Univocity parser allows to set line separator to 1 to 2 characters ([code|https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103]), CSV options should not block this usage ([code|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218]). Due to the limitation around the `normalizedNewLine` (https://github.com/uniVocity/univocity-parsers/issues/170), setting 2 chars as a line separator could cause some weird/bad behaviors. Thus, we probably should leave this proposed fix as an undocumented feature and warn users to do this. A more proper fix could be further investigated in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31749) Allow to set owner reference for the driver pod (cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-31749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562935#comment-17562935 ] magaoyun edited comment on SPARK-31749 at 7/6/22 3:23 AM: -- I haven't found this settings on [https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#configuration]. {code:java} spark.kubernetes.driver.ownerReferences {code} spark v3.1.1 supports it? was (Author: JIRAUSER292332): I have found this settings on https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#configuration. {code:java} spark.kubernetes.driver.ownerReferences {code} spark v3.1.1 supports it? > Allow to set owner reference for the driver pod (cluster mode) > -- > > Key: SPARK-31749 > URL: https://issues.apache.org/jira/browse/SPARK-31749 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.5 >Reporter: Tamas Jambor >Priority: Major > Labels: bulk-closed > > Currently there is no way to pass ownerReferences to the driver pod in > cluster mode. This makes it difficult for the upstream process to clean up > pods after they completed. > > Something like this would be useful: > spark.kubernetes.driver.ownerReferences.[Name] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31749) Allow to set owner reference for the driver pod (cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-31749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562935#comment-17562935 ] magaoyun commented on SPARK-31749: -- I have found this settings on https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#configuration. {code:java} spark.kubernetes.driver.ownerReferences {code} spark v3.1.1 supports it? > Allow to set owner reference for the driver pod (cluster mode) > -- > > Key: SPARK-31749 > URL: https://issues.apache.org/jira/browse/SPARK-31749 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.5 >Reporter: Tamas Jambor >Priority: Major > Labels: bulk-closed > > Currently there is no way to pass ownerReferences to the driver pod in > cluster mode. This makes it difficult for the upstream process to clean up > pods after they completed. > > Something like this would be useful: > spark.kubernetes.driver.ownerReferences.[Name] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39616) Upgrade Breeze to 2.0
[ https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562930#comment-17562930 ] Apache Spark commented on SPARK-39616: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37097 > Upgrade Breeze to 2.0 > - > > Key: SPARK-39616 > URL: https://issues.apache.org/jira/browse/SPARK-39616 > Project: Spark > Issue Type: Improvement > Components: Build, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39616) Upgrade Breeze to 2.0
[ https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562929#comment-17562929 ] Apache Spark commented on SPARK-39616: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37097 > Upgrade Breeze to 2.0 > - > > Key: SPARK-39616 > URL: https://issues.apache.org/jira/browse/SPARK-39616 > Project: Spark > Issue Type: Improvement > Components: Build, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39522) Add Apache Spark infra GA image cache
[ https://issues.apache.org/jira/browse/SPARK-39522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-39522: Summary: Add Apache Spark infra GA image cache (was: Uses Docker image cache over a custom image) > Add Apache Spark infra GA image cache > - > > Key: SPARK-39522 > URL: https://issues.apache.org/jira/browse/SPARK-39522 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should probably replace the base image > (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302, > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain > ubunto image w/ Docker image cache. See also > https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39522) Uses Docker image cache over a custom image
[ https://issues.apache.org/jira/browse/SPARK-39522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39522. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37003 [https://github.com/apache/spark/pull/37003] > Uses Docker image cache over a custom image > --- > > Key: SPARK-39522 > URL: https://issues.apache.org/jira/browse/SPARK-39522 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should probably replace the base image > (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302, > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain > ubunto image w/ Docker image cache. See also > https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39522) Uses Docker image cache over a custom image
[ https://issues.apache.org/jira/browse/SPARK-39522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39522: Assignee: Yikun Jiang > Uses Docker image cache over a custom image > --- > > Key: SPARK-39522 > URL: https://issues.apache.org/jira/browse/SPARK-39522 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yikun Jiang >Priority: Major > > We should probably replace the base image > (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302, > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain > ubunto image w/ Docker image cache. See also > https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562890#comment-17562890 ] Hyukjin Kwon commented on SPARK-39681: -- Thanks [~yikunkero] > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39687) Make sure new catalog methods listed in API reference
[ https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39687: Assignee: Ruifeng Zheng > Make sure new catalog methods listed in API reference > - > > Key: SPARK-39687 > URL: https://issues.apache.org/jira/browse/SPARK-39687 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39687) Make sure new catalog methods listed in API reference
[ https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39687. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37092 [https://github.com/apache/spark/pull/37092] > Make sure new catalog methods listed in API reference > - > > Key: SPARK-39687 > URL: https://issues.apache.org/jira/browse/SPARK-39687 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39686) Disable scheduled builds that do not pass even once
[ https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39686. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37091 [https://github.com/apache/spark/pull/37091] > Disable scheduled builds that do not pass even once > --- > > Key: SPARK-39686 > URL: https://issues.apache.org/jira/browse/SPARK-39686 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > We added some more builds that have never beed tested so far. We should > probably disable them to check the build status easily for now. > See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39686) Disable scheduled builds that do not pass even once
[ https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39686: Assignee: Hyukjin Kwon > Disable scheduled builds that do not pass even once > --- > > Key: SPARK-39686 > URL: https://issues.apache.org/jira/browse/SPARK-39686 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We added some more builds that have never beed tested so far. We should > probably disable them to check the build status easily for now. > See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39584: Assignee: (was: Apache Spark) > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). > When GenTPCDSData generates parquet, that pads spaces for strings whose > lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark TPC-DS unit tests are doing > 2. Change char to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related char issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562841#comment-17562841 ] Apache Spark commented on SPARK-39584: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37096 > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). > When GenTPCDSData generates parquet, that pads spaces for strings whose > lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark TPC-DS unit tests are doing > 2. Change char to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related char issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562840#comment-17562840 ] Apache Spark commented on SPARK-39584: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37096 > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). > When GenTPCDSData generates parquet, that pads spaces for strings whose > lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark TPC-DS unit tests are doing > 2. Change char to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related char issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39584: Assignee: Apache Spark > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Apache Spark >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). > When GenTPCDSData generates parquet, that pads spaces for strings whose > lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark TPC-DS unit tests are doing > 2. Change char to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related char issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission
[ https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39688. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37095 [https://github.com/apache/spark/pull/37095] > getReusablePVCs should handle accounts with no PVC permission > - > > Key: SPARK-39688 > URL: https://issues.apache.org/jira/browse/SPARK-39688 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks
[ https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34927: - Assignee: Kazuaki Ishizaki > Support TPCDSQueryBenchmark in Benchmarks > - > > Key: SPARK-34927 > URL: https://issues.apache.org/jira/browse/SPARK-34927 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 3.4.0 > > > Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should > make it supported. See also > https://github.com/apache/spark/pull/32015#issuecomment-89046 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks
[ https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34927. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37020 [https://github.com/apache/spark/pull/37020] > Support TPCDSQueryBenchmark in Benchmarks > - > > Key: SPARK-34927 > URL: https://issues.apache.org/jira/browse/SPARK-34927 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should > make it supported. See also > https://github.com/apache/spark/pull/32015#issuecomment-89046 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks
[ https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34927: - Assignee: Kazuyuki Tanimura (was: Kazuaki Ishizaki) > Support TPCDSQueryBenchmark in Benchmarks > - > > Key: SPARK-34927 > URL: https://issues.apache.org/jira/browse/SPARK-34927 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.4.0 > > > Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should > make it supported. See also > https://github.com/apache/spark/pull/32015#issuecomment-89046 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission
[ https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39688: - Assignee: Dongjoon Hyun > getReusablePVCs should handle accounts with no PVC permission > - > > Key: SPARK-39688 > URL: https://issues.apache.org/jira/browse/SPARK-39688 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39616) Upgrade Breeze to 2.0
[ https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39616. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37002 [https://github.com/apache/spark/pull/37002] > Upgrade Breeze to 2.0 > - > > Key: SPARK-39616 > URL: https://issues.apache.org/jira/browse/SPARK-39616 > Project: Spark > Issue Type: Improvement > Components: Build, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39616) Upgrade Breeze to 2.0
[ https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39616: - Assignee: Ruifeng Zheng > Upgrade Breeze to 2.0 > - > > Key: SPARK-39616 > URL: https://issues.apache.org/jira/browse/SPARK-39616 > Project: Spark > Issue Type: Improvement > Components: Build, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission
[ https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39688: Assignee: (was: Apache Spark) > getReusablePVCs should handle accounts with no PVC permission > - > > Key: SPARK-39688 > URL: https://issues.apache.org/jira/browse/SPARK-39688 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission
[ https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39688: Assignee: Apache Spark > getReusablePVCs should handle accounts with no PVC permission > - > > Key: SPARK-39688 > URL: https://issues.apache.org/jira/browse/SPARK-39688 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission
[ https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562820#comment-17562820 ] Apache Spark commented on SPARK-39688: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/37095 > getReusablePVCs should handle accounts with no PVC permission > - > > Key: SPARK-39688 > URL: https://issues.apache.org/jira/browse/SPARK-39688 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission
Dongjoon Hyun created SPARK-39688: - Summary: getReusablePVCs should handle accounts with no PVC permission Key: SPARK-39688 URL: https://issues.apache.org/jira/browse/SPARK-39688 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39018) Add support for YARN decommissioning when ESS is Disabled
[ https://issues.apache.org/jira/browse/SPARK-39018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Dixit updated SPARK-39018: --- Fix Version/s: 3.4.0 > Add support for YARN decommissioning when ESS is Disabled > - > > Key: SPARK-39018 > URL: https://issues.apache.org/jira/browse/SPARK-39018 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 3.2.1 >Reporter: Abhishek Dixit >Priority: Major > Fix For: 3.4.0 > > > Subtask to handle Yarn Executor Decommissioning when Shuffle Service is > Disabled. > This relates to > [SPARK-30835|https://issues.apache.org/jira/browse/SPARK-30835] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39018) Add support for YARN decommissioning when ESS is Disabled
[ https://issues.apache.org/jira/browse/SPARK-39018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Dixit resolved SPARK-39018. Resolution: Fixed > Add support for YARN decommissioning when ESS is Disabled > - > > Key: SPARK-39018 > URL: https://issues.apache.org/jira/browse/SPARK-39018 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 3.2.1 >Reporter: Abhishek Dixit >Priority: Major > > Subtask to handle Yarn Executor Decommissioning when Shuffle Service is > Disabled. > This relates to > [SPARK-30835|https://issues.apache.org/jira/browse/SPARK-30835] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35662) Support Timestamp without time zone data type
[ https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562724#comment-17562724 ] Bill Schneider edited comment on SPARK-35662 at 7/5/22 5:07 PM: Is this delayed until 3.4.0? It did not appear to work in Spark 3.3. However, Spark 3.4.0-SNAPSHOT appears to do exactly what I wanted it to: `cast(string, DataTypes.TimestampNTZType)` when written to Parquet, will be exactly the same timestamp when read from a Spark session in a different timezone. was (Author: wrschneider99): Is this delayed until 3.4.0? It did not appear to work in Spark 3.3 > Support Timestamp without time zone data type > - > > Key: SPARK-35662 > URL: https://issues.apache.org/jira/browse/SPARK-35662 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Spark SQL today supports the TIMESTAMP data type. However the semantics > provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. > Timestamps embedded in a SQL query or passed through JDBC are presumed to be > in session local timezone and cast to UTC before being processed. > These are desirable semantics in many cases, such as when dealing with > calendars. > In many (more) other cases, such as when dealing with log files it is > desirable that the provided timestamps not be altered. > SQL users expect that they can model either behavior and do so by using > TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH > LOCAL TIME ZONE for time zone sensitive data. > Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will > be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not > exist in the standard. > In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to > describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for > standard semantic. > Using these two types will provide clarity. > We will also allow users to set the default behavior for TIMESTAMP to either > use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE. > h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type > TimestampWithoutTZ meets or exceeds all function of the existing SQL > Timestamp): > * Add a new DataType implementation for TimestampWithoutTZ. > * Support TimestampWithoutTZ in Dataset/UDF. > * TimestampWithoutTZ literals > * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - > TimestampWithoutTZ, TimestampWithoutTZ - Date) > * Datetime functions/operators: dayofweek, weekofyear, year, etc > * Cast to and from TimestampWithoutTZ, cast String/Timestamp to > TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty > printing)/Timestamp, with the SQL syntax to specify the types > * Support sorting TimestampWithoutTZ. > h3. Milestone 2 – Persistence: > * Ability to create tables of type TimestampWithoutTZ > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return TimestampWithoutTZ > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562735#comment-17562735 ] Yikun Jiang edited comment on SPARK-39681 at 7/5/22 4:06 PM: - [~hyukjin.kwon] I think this should be backported first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 Then, not sure should we pin to old image, I also mention it in here: https://github.com/apache/spark/pull/37091#issuecomment-1174989349 was (Author: yikunkero): [~hyukjin.kwon] I think this should be backported first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562735#comment-17562735 ] Yikun Jiang edited comment on SPARK-39681 at 7/5/22 4:04 PM: - [~hyukjin.kwon] I think this should be backported first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 was (Author: yikunkero): I think this should be backported first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562735#comment-17562735 ] Yikun Jiang edited comment on SPARK-39681 at 7/5/22 4:04 PM: - I think this should be backported first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 was (Author: yikunkero): I think this should be backport first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562735#comment-17562735 ] Yikun Jiang commented on SPARK-39681: - I think this should be backport first. https://issues.apache.org/jira/browse/SPARK-37059 git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3 > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35662) Support Timestamp without time zone data type
[ https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562724#comment-17562724 ] Bill Schneider commented on SPARK-35662: Is this delayed until 3.4.0? It did not appear to work in Spark 3.3 > Support Timestamp without time zone data type > - > > Key: SPARK-35662 > URL: https://issues.apache.org/jira/browse/SPARK-35662 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Spark SQL today supports the TIMESTAMP data type. However the semantics > provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. > Timestamps embedded in a SQL query or passed through JDBC are presumed to be > in session local timezone and cast to UTC before being processed. > These are desirable semantics in many cases, such as when dealing with > calendars. > In many (more) other cases, such as when dealing with log files it is > desirable that the provided timestamps not be altered. > SQL users expect that they can model either behavior and do so by using > TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH > LOCAL TIME ZONE for time zone sensitive data. > Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will > be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not > exist in the standard. > In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to > describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for > standard semantic. > Using these two types will provide clarity. > We will also allow users to set the default behavior for TIMESTAMP to either > use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE. > h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type > TimestampWithoutTZ meets or exceeds all function of the existing SQL > Timestamp): > * Add a new DataType implementation for TimestampWithoutTZ. > * Support TimestampWithoutTZ in Dataset/UDF. > * TimestampWithoutTZ literals > * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - > TimestampWithoutTZ, TimestampWithoutTZ - Date) > * Datetime functions/operators: dayofweek, weekofyear, year, etc > * Cast to and from TimestampWithoutTZ, cast String/Timestamp to > TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty > printing)/Timestamp, with the SQL syntax to specify the types > * Support sorting TimestampWithoutTZ. > h3. Milestone 2 – Persistence: > * Ability to create tables of type TimestampWithoutTZ > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return TimestampWithoutTZ > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39677) Wrong args item formatting of the regexp functions
[ https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39677: - Fix Version/s: 3.2.2 > Wrong args item formatting of the regexp functions > -- > > Key: SPARK-39677 > URL: https://issues.apache.org/jira/browse/SPARK-39677 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.1.4, 3.2.2, 3.4.0, 3.3.1 > > Attachments: Screenshot 2022-07-05 at 09.48.28.png > > > See the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39677) Wrong args item formatting of the regexp functions
[ https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39677: - Fix Version/s: 3.1.4 > Wrong args item formatting of the regexp functions > -- > > Key: SPARK-39677 > URL: https://issues.apache.org/jira/browse/SPARK-39677 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.1.4, 3.4.0, 3.3.1 > > Attachments: Screenshot 2022-07-05 at 09.48.28.png > > > See the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?
[ https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562695#comment-17562695 ] Dmitry commented on SPARK-39623: It is quoted in my issue, here it is again: I see this query on DB backend: {code:sql} SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or "datestamp" is null {code} > partitionng by datestamp leads to wrong query on backend? > - > > Key: SPARK-39623 > URL: https://issues.apache.org/jira/browse/SPARK-39623 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dmitry >Priority: Major > > Hello, > I am new to Apache spark, so please bear with me. I would like to report what > seems to me a bug, but may be I am just not understanding something. > My goal is to run data analysis on a spark cluster. Data is stored in > PostgreSQL DB. Tables contained timestamped entries (timestamp with time > zone). > The code look like: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("foo") \ > .config("spark.jars", "/opt/postgresql-42.4.0.jar") \ > .getOrCreate() > df = spark.read \ > .format("jdbc") \ > .option("url", "jdbc:postgresql://example.org:5432/postgres") \ > .option("dbtable", "billing") \ > .option("user", "user") \ > .option("driver", "org.postgresql.Driver") \ > .option("numPartitions", "4") \ > .option("partitionColumn", "datestamp") \ > .option("lowerBound", "2022-01-01 00:00:00") \ > .option("upperBound", "2022-06-26 23:59:59") \ > .option("fetchsize", 100) \ > .load() > t0 = time.time() > print("Number of entries is => ", df.count(), " Time to execute ", > time.time()-t0) > ... > {code} > datestamp is timestamp with time zone. > I see this query on DB backend: > {code:java} > SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or > "datestamp" is null > {code} > The table is huge and entries go way back before > 2022-01-02 11:59:59. So what ends up happening - all workers but one > complete and one remaining continues to process that query which, to me, > looks like it wants to get all the data before 2022-01-02 11:59:59. Which is > not what I intended. > I remedies this by changing to: > {code:python} > .option("dbtable", "(select * from billinginfo where datestamp > '2022 > 01-01-01 00:00:00') as foo") \ > {code} > And that seem to have solved the issue. But this seems kludgy. Am I doing > something wrong or there is a bug in the way partitioning queries are > generated? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39609) PySpark need to support pypy3.8 to avoid "No module named '_pickle"
[ https://issues.apache.org/jira/browse/SPARK-39609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562682#comment-17562682 ] Yikun Jiang commented on SPARK-39609: - [https://github.com/cloudpipe/cloudpickle/pull/461] cloudpickle didn't supported pypy3.8 yet. > PySpark need to support pypy3.8 to avoid "No module named '_pickle" > --- > > Key: SPARK-39609 > URL: https://issues.apache.org/jira/browse/SPARK-39609 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > {code:java} > Starting test(pypy3): pyspark.sql.tests.test_arrow (temp output: > /tmp/pypy3__pyspark.sql.tests.test_arrow__jx96qdzs.log) > Traceback (most recent call last): > File "/usr/lib/pypy3.8/runpy.py", line 188, in _run_module_as_main > mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > File "/usr/lib/pypy3.8/runpy.py", line 111, in _get_module_details > __import__(pkg_name) > File "/__w/spark/spark/python/pyspark/__init__.py", line 59, in > from pyspark.rdd import RDD, RDDBarrier > File "/__w/spark/spark/python/pyspark/rdd.py", line 54, in > from pyspark.java_gateway import local_connect_and_auth > File "/__w/spark/spark/python/pyspark/java_gateway.py", line 32, in > from pyspark.serializers import read_int, write_with_length, > UTF8Deserializer > File "/__w/spark/spark/python/pyspark/serializers.py", line 68, in > from pyspark import cloudpickle > File "/__w/spark/spark/python/pyspark/cloudpickle/__init__.py", line 4, in > > from pyspark.cloudpickle.cloudpickle import * # noqa > File "/__w/spark/spark/python/pyspark/cloudpickle/cloudpickle.py", line 57, > in > from .compat import pickle > File "/__w/spark/spark/python/pyspark/cloudpickle/compat.py", line 13, in > > from _pickle import Pickler # noqa: F401 > ModuleNotFoundError: No module named '_pickle' > Had test failures in pyspark.sql.tests.test_arrow with pypy3; see logs. {code} > Build latest dockerfile pypy3 upgrade to 3.8 (original is 3.7), but it seems > cloudpickle has a bug. > This may related: > https://github.com/cloudpipe/cloudpickle/commit/8bbea3e140767f51dd935a3c8f21c9a8e8702b7c, > but I try to apply this, also failed. Need a deeper look, if you guys know > the reason of this, pls let me know. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39677) Wrong args item formatting of the regexp functions
[ https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562641#comment-17562641 ] Apache Spark commented on SPARK-39677: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37094 > Wrong args item formatting of the regexp functions > -- > > Key: SPARK-39677 > URL: https://issues.apache.org/jira/browse/SPARK-39677 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: Screenshot 2022-07-05 at 09.48.28.png > > > See the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39677) Wrong args item formatting of the regexp functions
[ https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562642#comment-17562642 ] Apache Spark commented on SPARK-39677: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37093 > Wrong args item formatting of the regexp functions > -- > > Key: SPARK-39677 > URL: https://issues.apache.org/jira/browse/SPARK-39677 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: Screenshot 2022-07-05 at 09.48.28.png > > > See the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39687) Make sure new catalog methods listed in API reference
[ https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39687: Assignee: (was: Apache Spark) > Make sure new catalog methods listed in API reference > - > > Key: SPARK-39687 > URL: https://issues.apache.org/jira/browse/SPARK-39687 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39687) Make sure new catalog methods listed in API reference
[ https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562639#comment-17562639 ] Apache Spark commented on SPARK-39687: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37092 > Make sure new catalog methods listed in API reference > - > > Key: SPARK-39687 > URL: https://issues.apache.org/jira/browse/SPARK-39687 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39687) Make sure new catalog methods listed in API reference
[ https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39687: Assignee: Apache Spark > Make sure new catalog methods listed in API reference > - > > Key: SPARK-39687 > URL: https://issues.apache.org/jira/browse/SPARK-39687 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39687) Make sure new catalog methods listed in API reference
Ruifeng Zheng created SPARK-39687: - Summary: Make sure new catalog methods listed in API reference Key: SPARK-39687 URL: https://issues.apache.org/jira/browse/SPARK-39687 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39539) Overflow on converting valid Milliseconds to Microseconds
[ https://issues.apache.org/jira/browse/SPARK-39539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39539. -- Resolution: Not A Problem > Overflow on converting valid Milliseconds to Microseconds > - > > Key: SPARK-39539 > URL: https://issues.apache.org/jira/browse/SPARK-39539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: senmiao >Priority: Major > > > {code:java} > scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > scala> millisToMicros(Long.MinValue) > java.lang.ArithmeticException: long overflow > at java.lang.Math.multiplyExact(Math.java:892) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:213) > ... 49 elidedscala> millisToMicros(Long.MinValue) > scala> millisToMicros(Long.MaxValue) > java.lang.ArithmeticException: long overflow > at java.lang.Math.multiplyExact(Math.java:892) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:213) > ... 49 elided > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39610) Add safe.directory for container based job
[ https://issues.apache.org/jira/browse/SPARK-39610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39610: Assignee: Yikun Jiang > Add safe.directory for container based job > -- > > Key: SPARK-39610 > URL: https://issues.apache.org/jira/browse/SPARK-39610 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > {code:java} > ``` > fatal: unsafe repository ('/__w/spark/spark' is owned by someone else) > To add an exception for this directory, call: > git config --global --add safe.directory /__w/spark/spark > fatal: unsafe repository ('/__w/spark/spark' is owned by someone else) > To add an exception for this directory, call: > git config --global --add safe.directory /__w/spark/spark > Error: Process completed with exit code 128. > ``` {code} > https://github.blog/2022-04-12-git-security-vulnerability-announced/ > [https://github.com/actions/checkout/issues/760] > ```yaml > - name: Github Actions permissions workaround > run: | > git config --global --add safe.directory ${GITHUB_WORKSPACE} > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39610) Add safe.directory for container based job
[ https://issues.apache.org/jira/browse/SPARK-39610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39610. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37079 [https://github.com/apache/spark/pull/37079] > Add safe.directory for container based job > -- > > Key: SPARK-39610 > URL: https://issues.apache.org/jira/browse/SPARK-39610 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > {code:java} > ``` > fatal: unsafe repository ('/__w/spark/spark' is owned by someone else) > To add an exception for this directory, call: > git config --global --add safe.directory /__w/spark/spark > fatal: unsafe repository ('/__w/spark/spark' is owned by someone else) > To add an exception for this directory, call: > git config --global --add safe.directory /__w/spark/spark > Error: Process completed with exit code 128. > ``` {code} > https://github.blog/2022-04-12-git-security-vulnerability-announced/ > [https://github.com/actions/checkout/issues/760] > ```yaml > - name: Github Actions permissions workaround > run: | > git config --global --add safe.directory ${GITHUB_WORKSPACE} > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39611) PySpark support numpy 1.23.X
[ https://issues.apache.org/jira/browse/SPARK-39611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39611. -- Fix Version/s: 3.3.1 3.2.2 3.4.0 Resolution: Fixed Issue resolved by pull request 37078 [https://github.com/apache/spark/pull/37078] > PySpark support numpy 1.23.X > > > Key: SPARK-39611 > URL: https://issues.apache.org/jira/browse/SPARK-39611 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.1, 3.2.2, 3.4.0 > > > {code:java} > == > ERROR [2.102s]: test_arithmetic_op_exceptions > (pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", > line 99, in test_arithmetic_op_exceptions self.assertRaisesRegex(TypeError, > expected_err_msg, lambda: other / psser) File > "/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex return > context.handle('assertRaisesRegex', args, kwargs) > File "/usr/lib/python3.9/unittest/case.py", line 201, in handle > callable_obj(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", > line 99, in self.assertRaisesRegex(TypeError, expected_err_msg, > lambda: other / psser) > File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, in > __array_ufunc__ > raise NotImplementedError(NotImplementedError: pandas-on-Spark objects > currently do not support . > -- > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39611) PySpark support numpy 1.23.X
[ https://issues.apache.org/jira/browse/SPARK-39611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39611: Assignee: Yikun Jiang > PySpark support numpy 1.23.X > > > Key: SPARK-39611 > URL: https://issues.apache.org/jira/browse/SPARK-39611 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > {code:java} > == > ERROR [2.102s]: test_arithmetic_op_exceptions > (pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", > line 99, in test_arithmetic_op_exceptions self.assertRaisesRegex(TypeError, > expected_err_msg, lambda: other / psser) File > "/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex return > context.handle('assertRaisesRegex', args, kwargs) > File "/usr/lib/python3.9/unittest/case.py", line 201, in handle > callable_obj(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", > line 99, in self.assertRaisesRegex(TypeError, expected_err_msg, > lambda: other / psser) > File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, in > __array_ufunc__ > raise NotImplementedError(NotImplementedError: pandas-on-Spark objects > currently do not support . > -- > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.
[ https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39612. -- Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37084 [https://github.com/apache/spark/pull/37084] > The dataframe returned by exceptAll() can no longer perform operations such > as count() or isEmpty(), or an exception will be thrown. > > > Key: SPARK-39612 > URL: https://issues.apache.org/jira/browse/SPARK-39612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: OS: centos stream 8 > {code:java} > $ uname -a > Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 > 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > $ python --version > Python 3.8.13 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 > /_/ > > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11 > Branch HEAD > Compiled by user ubuntu on 2022-06-09T19:58:58Z > Revision f74867bddfbcdd4d08076db36851e88b15e66556 > Url https://github.com/apache/spark > Type --help for more information. > $ java --version > openjdk 11.0.11 2021-04-20 > OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) > OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) > {code} > >Reporter: Zhu JunYong >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 3.3.1, 3.4.0 > > > As I said, the dataframe returned by `exceptAll()` can no longer perform > operations such as `count()` or `isEmpty()`, or an exception will be thrown. > > > {code:java} > >>> d1 = spark.createDataFrame([("a")], 'STRING') > >>> d1.show() > +-+ > |value| > +-+ > | a| > +-+ > >>> d2 = d1.exceptAll(d1) > >>> d2.show() > +-+ > |value| > +-+ > +-+ > >>> d2.count() > 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID > 525) > java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114) > at > org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37) > at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199) > at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) >
[jira] [Commented] (SPARK-39686) Disable scheduled builds that do not pass even once
[ https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562622#comment-17562622 ] Apache Spark commented on SPARK-39686: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37091 > Disable scheduled builds that do not pass even once > --- > > Key: SPARK-39686 > URL: https://issues.apache.org/jira/browse/SPARK-39686 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > We added some more builds that have never beed tested so far. We should > probably disable them to check the build status easily for now. > See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.
[ https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39612: Assignee: Hyukjin Kwon > The dataframe returned by exceptAll() can no longer perform operations such > as count() or isEmpty(), or an exception will be thrown. > > > Key: SPARK-39612 > URL: https://issues.apache.org/jira/browse/SPARK-39612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: OS: centos stream 8 > {code:java} > $ uname -a > Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 > 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > $ python --version > Python 3.8.13 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 > /_/ > > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11 > Branch HEAD > Compiled by user ubuntu on 2022-06-09T19:58:58Z > Revision f74867bddfbcdd4d08076db36851e88b15e66556 > Url https://github.com/apache/spark > Type --help for more information. > $ java --version > openjdk 11.0.11 2021-04-20 > OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) > OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) > {code} > >Reporter: Zhu JunYong >Assignee: Hyukjin Kwon >Priority: Critical > > As I said, the dataframe returned by `exceptAll()` can no longer perform > operations such as `count()` or `isEmpty()`, or an exception will be thrown. > > > {code:java} > >>> d1 = spark.createDataFrame([("a")], 'STRING') > >>> d1.show() > +-+ > |value| > +-+ > | a| > +-+ > >>> d2 = d1.exceptAll(d1) > >>> d2.show() > +-+ > |value| > +-+ > +-+ > >>> d2.count() > 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID > 525) > java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114) > at > org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37) > at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199) > at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShu
[jira] [Commented] (SPARK-39686) Disable scheduled builds that do not pass even once
[ https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562620#comment-17562620 ] Apache Spark commented on SPARK-39686: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37091 > Disable scheduled builds that do not pass even once > --- > > Key: SPARK-39686 > URL: https://issues.apache.org/jira/browse/SPARK-39686 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > We added some more builds that have never beed tested so far. We should > probably disable them to check the build status easily for now. > See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39606) Use child stats to estimate order operator
[ https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-39606. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36994 [https://github.com/apache/spark/pull/36994] > Use child stats to estimate order operator > -- > > Key: SPARK-39606 > URL: https://issues.apache.org/jira/browse/SPARK-39606 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39686) Disable scheduled builds that do not pass even once
[ https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39686: Assignee: (was: Apache Spark) > Disable scheduled builds that do not pass even once > --- > > Key: SPARK-39686 > URL: https://issues.apache.org/jira/browse/SPARK-39686 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > We added some more builds that have never beed tested so far. We should > probably disable them to check the build status easily for now. > See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39606) Use child stats to estimate order operator
[ https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-39606: --- Assignee: Yuming Wang > Use child stats to estimate order operator > -- > > Key: SPARK-39606 > URL: https://issues.apache.org/jira/browse/SPARK-39606 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39686) Disable scheduled builds that do not pass even once
[ https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39686: Assignee: Apache Spark > Disable scheduled builds that do not pass even once > --- > > Key: SPARK-39686 > URL: https://issues.apache.org/jira/browse/SPARK-39686 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > We added some more builds that have never beed tested so far. We should > probably disable them to check the build status easily for now. > See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39681: - Description: https://github.com/apache/spark/runs/7189972241?check_suite_focus=true {code} Running PySpark tests Traceback (most recent call last): File "./python/run-tests.py", line 65, in raise RuntimeError("Cannot find assembly build directory, please build Spark first.") RuntimeError: Cannot find assembly build directory, please build Spark first. {code} was: {code} Running PySpark tests Traceback (most recent call last): File "./python/run-tests.py", line 65, in raise RuntimeError("Cannot find assembly build directory, please build Spark first.") RuntimeError: Cannot find assembly build directory, please build Spark first. {code} https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39686) Disable scheduled builds that do not pass even once
Hyukjin Kwon created SPARK-39686: Summary: Disable scheduled builds that do not pass even once Key: SPARK-39686 URL: https://issues.apache.org/jira/browse/SPARK-39686 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Affects Versions: 3.4.0 Reporter: Hyukjin Kwon We added some more builds that have never beed tested so far. We should probably disable them to check the build status easily for now. See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39684) Docker IT build broken in master with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-39684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39684: - Affects Version/s: 3.4.0 (was: 3.2.1) > Docker IT build broken in master with Hadoop 2 > -- > > Key: SPARK-39684 > URL: https://issues.apache.org/jira/browse/SPARK-39684 > Project: Spark > Issue Type: Test > Components: Build, SQL, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7054055518?check_suite_focus=true > {code} > [info] DB2KrbIntegrationSuite: > [info] org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite *** ABORTED *** (3 > minutes, 47 seconds) > [info] The code passed to eventually never returned normally. Attempted 185 > times over 3.009751720714 minutes. Last failure message: Login failure > for db2/10.1.0...@example.com from keytab > /home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab. > (DockerJDBCIntegrationSuite.scala:166) > [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: > [info] at > org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185) > [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192) > [info] at > org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402) > [info] at > org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312) > [info] at > org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118) > [info] at > org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.super$beforeAll(DockerKrbJDBCIntegrationSuite.scala:65) > [info] at > org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerKrbJDBCIntegrationSuite.scala:65) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.beforeAll(DockerKrbJDBCIntegrationSuite.scala:44) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) > [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:750) > [info] Cause: java.io.IOException: Login failure for > db2/10.1.0...@example.com from keytab > /home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab > [info] at > org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1231) > [info] at > org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite.getConnection(DB2KrbIntegrationSuite.scala:93) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info]
[jira] [Created] (SPARK-39685) Linter build broken in branch-3.2 with Scala 2.13
Hyukjin Kwon created SPARK-39685: Summary: Linter build broken in branch-3.2 with Scala 2.13 Key: SPARK-39685 URL: https://issues.apache.org/jira/browse/SPARK-39685 Project: Spark Issue Type: Test Components: Build, PySpark, Tests Affects Versions: 3.2.1 Reporter: Hyukjin Kwon https://github.com/apache/spark/runs/7189971643?check_suite_focus=true {code} mypy checks failed: python/pyspark/pandas/typedef/string_typehints.py:24: error: Name "np" already defined (by an import) python/pyspark/pandas/typedef/string_typehints.py:30: error: Incompatible import of "DataFrame" (imported name has type "Type[pyspark.pandas.frame.DataFrame[Any]]", local name has type "Type[pandas.core.frame.DataFrame]") python/pyspark/pandas/typedef/string_typehints.py:30: error: Incompatible import of "Series" (imported name has type "Type[pyspark.pandas.series.Series[Any]]", local name has type "Type[pandas.core.series.Series]") python/pyspark/sql/pandas/_typing/__init__.pyi:39: error: Incompatible types in assignment (expression has type "Type[DataFrame]", variable has type "Type[DataFrameLike]") python/pyspark/sql/pandas/_typing/__init__.pyi:40: error: Incompatible types in assignment (expression has type "Type[Series]", variable has type "Type[SeriesLike]") python/pyspark/pandas/datetimes.py:37: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:39: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:52: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:67: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:74: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:81: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:88: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:95: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:102: error: Cannot determine type of "spark" python/pyspark/pandas/datetimes.py:125: error: Cannot determine type of "spark" python/pyspark/pandas/typedef/typehints.py:36: error: Module "pandas.api.types" has no attribute "pandas_dtype" python/pyspark/pandas/typedef/typehints.py:526: error: Argument 3 to "DataFrameType" has incompatible type "List[None]"; expected "List[Optional[str]]" python/pyspark/pandas/typedef/typehints.py:526: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance python/pyspark/pandas/typedef/typehints.py:526: note: Consider using "Sequence" instead, which is covariant python/pyspark/pandas/utils.py:45: error: Module "pandas.api.types" has no attribute "is_list_like" python/pyspark/pandas/internal.py:1540: error: Argument "name" to "StructField" has incompatible type "Optional[Hashable]"; expected "str" python/pyspark/pandas/indexing.py:27: error: Module "pandas.api.types" has no attribute "is_list_like" python/pyspark/pandas/indexing.py:993: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:994: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1014: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1021: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1023: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1024: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1055: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1067: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1079: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1139: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1142: error: Cannot determine type of "spark" python/pyspark/pandas/indexing.py:1148: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:42: error: Module "pandas.api.types" has no attribute "is_list_like" python/pyspark/pandas/generic.py:1206: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1207: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1293: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1294: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1379: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1380: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1452: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1453: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1520: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1521: error: Cannot determine type of "spark" python/pyspark/pandas/generic.py:1590: error: Cannot determine type of "spark" python/
[jira] [Updated] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39681: - Component/s: Tests > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39682) Docker IT build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39682: - Component/s: SQL > Docker IT build broken in branch-3.2 with Scala 2.13 > > > Key: SPARK-39682 > URL: https://issues.apache.org/jira/browse/SPARK-39682 > Project: Spark > Issue Type: Test > Components: Build, SQL, Tests >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/7189971505?check_suite_focus=true > {code} > [info] OracleIntegrationSuite: > [info] org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite *** ABORTED *** (8 > minutes, 1 second) > [info] The code passed to eventually never returned normally. Attempted 426 > times over 7.008370057216667 minutes. Last failure message: IO Error: The > Network Adapter could not establish the connection. > (DockerJDBCIntegrationSuite.scala:166) > [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: > [info] at > org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:189) > [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196) > [info] at > org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) > [info] at > org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) > [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:750) > [info] Cause: java.sql.SQLRecoverableException: IO Error: The Network > Adapter could not establish the connection > [info] at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:858) > [info] at > oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:793) > [info] at > oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:57) > [info] at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:747) > [info] at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:562) > [info] at java.sql.DriverManager.getConnection(DriverManager.java:664) > [info] at java.sql.DriverManager.getConnection(DriverManager.java:208) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.getConnection(DockerJDBCIntegrationSuite.scala:200) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:154) > [info] at > org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:166) > [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196) > [info] at > org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) > [info] at > org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) > [info] at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166) > [info] at > org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:
[jira] [Created] (SPARK-39684) Docker IT build broken in master with Hadoop 2
Hyukjin Kwon created SPARK-39684: Summary: Docker IT build broken in master with Hadoop 2 Key: SPARK-39684 URL: https://issues.apache.org/jira/browse/SPARK-39684 Project: Spark Issue Type: Test Components: Build, SQL, Tests Affects Versions: 3.2.1 Reporter: Hyukjin Kwon https://github.com/apache/spark/runs/7054055518?check_suite_focus=true {code} [info] DB2KrbIntegrationSuite: [info] org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite *** ABORTED *** (3 minutes, 47 seconds) [info] The code passed to eventually never returned normally. Attempted 185 times over 3.009751720714 minutes. Last failure message: Login failure for db2/10.1.0...@example.com from keytab /home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab. (DockerJDBCIntegrationSuite.scala:166) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118) [info] at org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.super$beforeAll(DockerKrbJDBCIntegrationSuite.scala:65) [info] at org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerKrbJDBCIntegrationSuite.scala:65) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.beforeAll(DockerKrbJDBCIntegrationSuite.scala:44) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) [info] Cause: java.io.IOException: Login failure for db2/10.1.0...@example.com from keytab /home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab [info] at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1231) [info] at org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite.getConnection(DB2KrbIntegrationSuite.scala:93) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:150) [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:162) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
[jira] [Updated] (SPARK-39683) Did not find value which can be converted into java.lang.String
[ https://issues.apache.org/jira/browse/SPARK-39683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zehra updated SPARK-39683: -- Description: Hi, I have a problem with loading the model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into] was: Hi, I have a problem with loading the model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [Error Message|http://example.com/] https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into > Did not find value which can be converted into java.lang.String > --- > > Key: SPARK-39683 > URL: https://issues.apache.org/jira/browse/SPARK-39683 > Project: Spark > Issue Type: Question > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: zehra >Priority: Major > > Hi, I have a problem with loading the model in pyspark. I wrote it in a > detail at this link. Could you help me with it? Thanks. > > [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39683) Did not find value which can be converted into java.lang.String
[ https://issues.apache.org/jira/browse/SPARK-39683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zehra updated SPARK-39683: -- Description: Hi, I have a problem with loading the ALS model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into] was: Hi, I have a problem with loading the model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into] > Did not find value which can be converted into java.lang.String > --- > > Key: SPARK-39683 > URL: https://issues.apache.org/jira/browse/SPARK-39683 > Project: Spark > Issue Type: Question > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: zehra >Priority: Major > > Hi, I have a problem with loading the ALS model in pyspark. I wrote it in a > detail at this link. Could you help me with it? Thanks. > > [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39683) Did not find value which can be converted into java.lang.String
[ https://issues.apache.org/jira/browse/SPARK-39683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zehra updated SPARK-39683: -- Description: Hi, I have a problem with loading the model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [Error Message|http://example.com/] https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into was: Hi, I have a problem with loading the model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [Error Message|http://example.com]https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into > Did not find value which can be converted into java.lang.String > --- > > Key: SPARK-39683 > URL: https://issues.apache.org/jira/browse/SPARK-39683 > Project: Spark > Issue Type: Question > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: zehra >Priority: Major > > Hi, I have a problem with loading the model in pyspark. I wrote it in a > detail at this link. Could you help me with it? Thanks. > [Error Message|http://example.com/] > https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39683) Did not find value which can be converted into java.lang.String
zehra created SPARK-39683: - Summary: Did not find value which can be converted into java.lang.String Key: SPARK-39683 URL: https://issues.apache.org/jira/browse/SPARK-39683 Project: Spark Issue Type: Question Components: ML, PySpark Affects Versions: 3.2.0 Reporter: zehra Hi, I have a problem with loading the model in pyspark. I wrote it in a detail at this link. Could you help me with it? Thanks. [Error Message|http://example.com]https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39681: - Summary: PySpark build broken in branch-3.2 with Scala 2.13 (was: PySpark build in branch-3.2 with Scala 2.13 fails) > PySpark build broken in branch-3.2 with Scala 2.13 > -- > > Key: SPARK-39681 > URL: https://issues.apache.org/jira/browse/SPARK-39681 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.2.1 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > > Running PySpark tests > > Traceback (most recent call last): > File "./python/run-tests.py", line 65, in > raise RuntimeError("Cannot find assembly build directory, please build > Spark first.") > RuntimeError: Cannot find assembly build directory, please build Spark first. > {code} > https://github.com/apache/spark/runs/7189972241?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39682) Docker IT build broken in branch-3.2 with Scala 2.13
Hyukjin Kwon created SPARK-39682: Summary: Docker IT build broken in branch-3.2 with Scala 2.13 Key: SPARK-39682 URL: https://issues.apache.org/jira/browse/SPARK-39682 Project: Spark Issue Type: Test Components: Build, Tests Affects Versions: 3.2.1 Reporter: Hyukjin Kwon https://github.com/apache/spark/runs/7189971505?check_suite_focus=true {code} [info] OracleIntegrationSuite: [info] org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite *** ABORTED *** (8 minutes, 1 second) [info] The code passed to eventually never returned normally. Attempted 426 times over 7.008370057216667 minutes. Last failure message: IO Error: The Network Adapter could not establish the connection. (DockerJDBCIntegrationSuite.scala:166) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:189) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) [info] Cause: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection [info] at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:858) [info] at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:793) [info] at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:57) [info] at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:747) [info] at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:562) [info] at java.sql.DriverManager.getConnection(DriverManager.java:664) [info] at java.sql.DriverManager.getConnection(DriverManager.java:208) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.getConnection(DockerJDBCIntegrationSuite.scala:200) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:154) [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:166) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49) [info] at org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118) [info] at org.scala
[jira] [Created] (SPARK-39681) PySpark build in branch-3.2 with Scala 2.13 fails
Hyukjin Kwon created SPARK-39681: Summary: PySpark build in branch-3.2 with Scala 2.13 fails Key: SPARK-39681 URL: https://issues.apache.org/jira/browse/SPARK-39681 Project: Spark Issue Type: Test Components: Build, PySpark Affects Versions: 3.2.1 Reporter: Hyukjin Kwon {code} Running PySpark tests Traceback (most recent call last): File "./python/run-tests.py", line 65, in raise RuntimeError("Cannot find assembly build directory, please build Spark first.") RuntimeError: Cannot find assembly build directory, please build Spark first. {code} https://github.com/apache/spark/runs/7189972241?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39677) Wrong args item formatting of the regexp functions
[ https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39677: - Fix Version/s: 3.4.0 3.3.1 > Wrong args item formatting of the regexp functions > -- > > Key: SPARK-39677 > URL: https://issues.apache.org/jira/browse/SPARK-39677 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: Screenshot 2022-07-05 at 09.48.28.png > > > See the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39677) Wrong args item formatting of the regexp functions
[ https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39677. -- Resolution: Fixed > Wrong args item formatting of the regexp functions > -- > > Key: SPARK-39677 > URL: https://issues.apache.org/jira/browse/SPARK-39677 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: Screenshot 2022-07-05 at 09.48.28.png > > > See the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37527) Translate more standard aggregate functions for pushdown
[ https://issues.apache.org/jira/browse/SPARK-37527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562538#comment-17562538 ] Apache Spark commented on SPARK-37527: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37090 > Translate more standard aggregate functions for pushdown > > > Key: SPARK-37527 > URL: https://issues.apache.org/jira/browse/SPARK-37527 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > Currently, Spark aggregate pushdown will translate some standard aggregate > functions, so that compile these functions suitable specify database. > After this job, users could override JdbcDialect.compileAggregate to > implement some aggregate functions supported by some database. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39579) Make ListFunctions/getFunction/functionExists API compatible
[ https://issues.apache.org/jira/browse/SPARK-39579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562520#comment-17562520 ] Apache Spark commented on SPARK-39579: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37088 > Make ListFunctions/getFunction/functionExists API compatible > - > > Key: SPARK-39579 > URL: https://issues.apache.org/jira/browse/SPARK-39579 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39447) Only non-broadcast query stage can propagate empty relation
[ https://issues.apache.org/jira/browse/SPARK-39447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562514#comment-17562514 ] Apache Spark commented on SPARK-39447: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37087 > Only non-broadcast query stage can propagate empty relation > --- > > Key: SPARK-39447 > URL: https://issues.apache.org/jira/browse/SPARK-39447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
[ https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562513#comment-17562513 ] Apache Spark commented on SPARK-39680: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37086 > Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter` > - > > Key: SPARK-39680 > URL: https://issues.apache.org/jira/browse/SPARK-39680 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `hasSpaceForAnotherRecord()` method is identified as `@VisibleForTesting` > and no longer used in master branch now. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
[ https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39680: Assignee: (was: Apache Spark) > Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter` > - > > Key: SPARK-39680 > URL: https://issues.apache.org/jira/browse/SPARK-39680 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `hasSpaceForAnotherRecord()` method is identified as `@VisibleForTesting` > and no longer used in master branch now. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
[ https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562512#comment-17562512 ] Apache Spark commented on SPARK-39680: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37086 > Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter` > - > > Key: SPARK-39680 > URL: https://issues.apache.org/jira/browse/SPARK-39680 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `hasSpaceForAnotherRecord()` method is identified as `@VisibleForTesting` > and no longer used in master branch now. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
[ https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39680: Assignee: Apache Spark > Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter` > - > > Key: SPARK-39680 > URL: https://issues.apache.org/jira/browse/SPARK-39680 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > `hasSpaceForAnotherRecord()` method is identified as `@VisibleForTesting` > and no longer used in master branch now. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.
[ https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39612: Assignee: (was: Apache Spark) > The dataframe returned by exceptAll() can no longer perform operations such > as count() or isEmpty(), or an exception will be thrown. > > > Key: SPARK-39612 > URL: https://issues.apache.org/jira/browse/SPARK-39612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: OS: centos stream 8 > {code:java} > $ uname -a > Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 > 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > $ python --version > Python 3.8.13 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 > /_/ > > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11 > Branch HEAD > Compiled by user ubuntu on 2022-06-09T19:58:58Z > Revision f74867bddfbcdd4d08076db36851e88b15e66556 > Url https://github.com/apache/spark > Type --help for more information. > $ java --version > openjdk 11.0.11 2021-04-20 > OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) > OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) > {code} > >Reporter: Zhu JunYong >Priority: Critical > > As I said, the dataframe returned by `exceptAll()` can no longer perform > operations such as `count()` or `isEmpty()`, or an exception will be thrown. > > > {code:java} > >>> d1 = spark.createDataFrame([("a")], 'STRING') > >>> d1.show() > +-+ > |value| > +-+ > | a| > +-+ > >>> d2 = d1.exceptAll(d1) > >>> d2.show() > +-+ > |value| > +-+ > +-+ > >>> d2.count() > 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID > 525) > java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114) > at > org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37) > at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199) > at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) >
[jira] [Commented] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.
[ https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562511#comment-17562511 ] Apache Spark commented on SPARK-39612: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37084 > The dataframe returned by exceptAll() can no longer perform operations such > as count() or isEmpty(), or an exception will be thrown. > > > Key: SPARK-39612 > URL: https://issues.apache.org/jira/browse/SPARK-39612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: OS: centos stream 8 > {code:java} > $ uname -a > Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 > 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > $ python --version > Python 3.8.13 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 > /_/ > > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11 > Branch HEAD > Compiled by user ubuntu on 2022-06-09T19:58:58Z > Revision f74867bddfbcdd4d08076db36851e88b15e66556 > Url https://github.com/apache/spark > Type --help for more information. > $ java --version > openjdk 11.0.11 2021-04-20 > OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) > OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) > {code} > >Reporter: Zhu JunYong >Priority: Critical > > As I said, the dataframe returned by `exceptAll()` can no longer perform > operations such as `count()` or `isEmpty()`, or an exception will be thrown. > > > {code:java} > >>> d1 = spark.createDataFrame([("a")], 'STRING') > >>> d1.show() > +-+ > |value| > +-+ > | a| > +-+ > >>> d2 = d1.exceptAll(d1) > >>> d2.show() > +-+ > |value| > +-+ > +-+ > >>> d2.count() > 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID > 525) > java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114) > at > org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37) > at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199) > at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) >
[jira] [Assigned] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.
[ https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39612: Assignee: Apache Spark > The dataframe returned by exceptAll() can no longer perform operations such > as count() or isEmpty(), or an exception will be thrown. > > > Key: SPARK-39612 > URL: https://issues.apache.org/jira/browse/SPARK-39612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: OS: centos stream 8 > {code:java} > $ uname -a > Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 > 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > $ python --version > Python 3.8.13 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 > /_/ > > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11 > Branch HEAD > Compiled by user ubuntu on 2022-06-09T19:58:58Z > Revision f74867bddfbcdd4d08076db36851e88b15e66556 > Url https://github.com/apache/spark > Type --help for more information. > $ java --version > openjdk 11.0.11 2021-04-20 > OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) > OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) > {code} > >Reporter: Zhu JunYong >Assignee: Apache Spark >Priority: Critical > > As I said, the dataframe returned by `exceptAll()` can no longer perform > operations such as `count()` or `isEmpty()`, or an exception will be thrown. > > > {code:java} > >>> d1 = spark.createDataFrame([("a")], 'STRING') > >>> d1.show() > +-+ > |value| > +-+ > | a| > +-+ > >>> d2 = d1.exceptAll(d1) > >>> d2.show() > +-+ > |value| > +-+ > +-+ > >>> d2.count() > 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID > 525) > java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114) > at > org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36) > at > org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37) > at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199) > at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShu
[jira] [Assigned] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering
[ https://issues.apache.org/jira/browse/SPARK-39679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39679: Assignee: Apache Spark > TakeOrderedAndProjectExec should respect child output ordering > -- > > Key: SPARK-39679 > URL: https://issues.apache.org/jira/browse/SPARK-39679 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > TakeOrderedAndProjectExec should respect child output ordering to avoid > unnecessary sort. For example: TakeOrderedAndProjectExec on the top of > SortMergeJoin. > {code:java} > SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering
[ https://issues.apache.org/jira/browse/SPARK-39679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39679: Assignee: (was: Apache Spark) > TakeOrderedAndProjectExec should respect child output ordering > -- > > Key: SPARK-39679 > URL: https://issues.apache.org/jira/browse/SPARK-39679 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > TakeOrderedAndProjectExec should respect child output ordering to avoid > unnecessary sort. For example: TakeOrderedAndProjectExec on the top of > SortMergeJoin. > {code:java} > SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering
[ https://issues.apache.org/jira/browse/SPARK-39679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562510#comment-17562510 ] Apache Spark commented on SPARK-39679: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37085 > TakeOrderedAndProjectExec should respect child output ordering > -- > > Key: SPARK-39679 > URL: https://issues.apache.org/jira/browse/SPARK-39679 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > TakeOrderedAndProjectExec should respect child output ordering to avoid > unnecessary sort. For example: TakeOrderedAndProjectExec on the top of > SortMergeJoin. > {code:java} > SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38531: Assignee: (was: Apache Spark) > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.4.0, 3.3.1 >Reporter: Min Yang >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38531: Assignee: Apache Spark > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.4.0, 3.3.1 >Reporter: Min Yang >Assignee: Apache Spark >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
Yang Jie created SPARK-39680: Summary: Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter` Key: SPARK-39680 URL: https://issues.apache.org/jira/browse/SPARK-39680 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Yang Jie `hasSpaceForAnotherRecord()` method is identified as `@VisibleForTesting` and no longer used in master branch now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering
XiDuo You created SPARK-39679: - Summary: TakeOrderedAndProjectExec should respect child output ordering Key: SPARK-39679 URL: https://issues.apache.org/jira/browse/SPARK-39679 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You TakeOrderedAndProjectExec should respect child output ordering to avoid unnecessary sort. For example: TakeOrderedAndProjectExec on the top of SortMergeJoin. {code:java} SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562496#comment-17562496 ] Hyukjin Kwon commented on SPARK-38531: -- Ah, maybe I should have filed a new JIRA > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.4.0, 3.3.1 >Reporter: Min Yang >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38531: - Fix Version/s: (was: 3.3.0) > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org