[jira] [Created] (SPARK-39690) Reuse exchange across subqueries is broken with AQE if subquery side exchange materialized first

2022-07-05 Thread Kapil Singh (Jira)
Kapil Singh created SPARK-39690:
---

 Summary: Reuse exchange across subqueries is broken with AQE if 
subquery side exchange materialized first
 Key: SPARK-39690
 URL: https://issues.apache.org/jira/browse/SPARK-39690
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kapil Singh


When trying to reuse Exchange of a subquery in main plan, if the Exchange 
inside subquery materialize first then main ASPE node won't have that stage 
info (in 
[stageToReplace|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L243])
 to replace in current logical plan. This will cause AQE to produce new 
candidate physical plan without reusing the exchange present inside subquery. 
And depending on how complex the inner plan is (no. of exchanges) AQE could 
choose plan without ReusedExchange. 

We have seen in with multiple queries with our private build. This can happen 
in DPP also.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39689) Support 2-chars lineSep in CSV datasource

2022-07-05 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-39689:
---

 Summary: Support 2-chars lineSep in CSV datasource
 Key: SPARK-39689
 URL: https://issues.apache.org/jira/browse/SPARK-39689
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yaohua Zhao


Univocity parser allows to set line separator to 1 to 2 characters 
([code|https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103]),
 CSV options should not block this usage 
([code|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218]).

 

Due to the limitation around the `normalizedNewLine` 
(https://github.com/uniVocity/univocity-parsers/issues/170), setting 2 chars as 
a line separator could cause some weird/bad behaviors. Thus, we probably should 
leave this proposed fix as an undocumented feature and warn users to do this.

 

A more proper fix could be further investigated in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31749) Allow to set owner reference for the driver pod (cluster mode)

2022-07-05 Thread magaoyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562935#comment-17562935
 ] 

magaoyun edited comment on SPARK-31749 at 7/6/22 3:23 AM:
--

I haven't found this settings on 
[https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#configuration].
{code:java}
spark.kubernetes.driver.ownerReferences {code}
spark v3.1.1 supports it?


was (Author: JIRAUSER292332):
I have found this settings on 
https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#configuration.
{code:java}
spark.kubernetes.driver.ownerReferences {code}
spark v3.1.1 supports it?

> Allow to set owner reference for the driver pod (cluster mode)
> --
>
> Key: SPARK-31749
> URL: https://issues.apache.org/jira/browse/SPARK-31749
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.5
>Reporter: Tamas Jambor
>Priority: Major
>  Labels: bulk-closed
>
> Currently there is no way to pass ownerReferences to the driver pod in 
> cluster mode. This makes it difficult for the upstream process to clean up 
> pods after they completed. 
>  
> Something like this would be useful:
> spark.kubernetes.driver.ownerReferences.[Name]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31749) Allow to set owner reference for the driver pod (cluster mode)

2022-07-05 Thread magaoyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562935#comment-17562935
 ] 

magaoyun commented on SPARK-31749:
--

I have found this settings on 
https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#configuration.
{code:java}
spark.kubernetes.driver.ownerReferences {code}
spark v3.1.1 supports it?

> Allow to set owner reference for the driver pod (cluster mode)
> --
>
> Key: SPARK-31749
> URL: https://issues.apache.org/jira/browse/SPARK-31749
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.5
>Reporter: Tamas Jambor
>Priority: Major
>  Labels: bulk-closed
>
> Currently there is no way to pass ownerReferences to the driver pod in 
> cluster mode. This makes it difficult for the upstream process to clean up 
> pods after they completed. 
>  
> Something like this would be useful:
> spark.kubernetes.driver.ownerReferences.[Name]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39616) Upgrade Breeze to 2.0

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562930#comment-17562930
 ] 

Apache Spark commented on SPARK-39616:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37097

> Upgrade Breeze to 2.0
> -
>
> Key: SPARK-39616
> URL: https://issues.apache.org/jira/browse/SPARK-39616
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39616) Upgrade Breeze to 2.0

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562929#comment-17562929
 ] 

Apache Spark commented on SPARK-39616:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37097

> Upgrade Breeze to 2.0
> -
>
> Key: SPARK-39616
> URL: https://issues.apache.org/jira/browse/SPARK-39616
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39522) Add Apache Spark infra GA image cache

2022-07-05 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-39522:

Summary: Add Apache Spark infra GA image cache  (was: Uses Docker image 
cache over a custom image)

> Add Apache Spark infra GA image cache
> -
>
> Key: SPARK-39522
> URL: https://issues.apache.org/jira/browse/SPARK-39522
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39522) Uses Docker image cache over a custom image

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39522.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37003
[https://github.com/apache/spark/pull/37003]

> Uses Docker image cache over a custom image
> ---
>
> Key: SPARK-39522
> URL: https://issues.apache.org/jira/browse/SPARK-39522
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39522) Uses Docker image cache over a custom image

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39522:


Assignee: Yikun Jiang

> Uses Docker image cache over a custom image
> ---
>
> Key: SPARK-39522
> URL: https://issues.apache.org/jira/browse/SPARK-39522
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yikun Jiang
>Priority: Major
>
> We should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562890#comment-17562890
 ] 

Hyukjin Kwon commented on SPARK-39681:
--

Thanks [~yikunkero]

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39687) Make sure new catalog methods listed in API reference

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39687:


Assignee: Ruifeng Zheng

> Make sure new catalog methods listed in API reference
> -
>
> Key: SPARK-39687
> URL: https://issues.apache.org/jira/browse/SPARK-39687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39687) Make sure new catalog methods listed in API reference

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39687.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37092
[https://github.com/apache/spark/pull/37092]

> Make sure new catalog methods listed in API reference
> -
>
> Key: SPARK-39687
> URL: https://issues.apache.org/jira/browse/SPARK-39687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39686.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37091
[https://github.com/apache/spark/pull/37091]

> Disable scheduled builds that do not pass even once
> ---
>
> Key: SPARK-39686
> URL: https://issues.apache.org/jira/browse/SPARK-39686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> We added some more builds that have never beed tested so far. We should 
> probably disable them to check the build status easily for now.
> See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39686:


Assignee: Hyukjin Kwon

> Disable scheduled builds that do not pass even once
> ---
>
> Key: SPARK-39686
> URL: https://issues.apache.org/jira/browse/SPARK-39686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We added some more builds that have never beed tested so far. We should 
> probably disable them to check the build status easily for now.
> See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39584:


Assignee: (was: Apache Spark)

> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). 
> When GenTPCDSData generates parquet, that pads spaces for strings whose 
> lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark TPC-DS unit tests are doing
> 2. Change char to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related char issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562841#comment-17562841
 ] 

Apache Spark commented on SPARK-39584:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37096

> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). 
> When GenTPCDSData generates parquet, that pads spaces for strings whose 
> lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark TPC-DS unit tests are doing
> 2. Change char to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related char issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562840#comment-17562840
 ] 

Apache Spark commented on SPARK-39584:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37096

> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). 
> When GenTPCDSData generates parquet, that pads spaces for strings whose 
> lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark TPC-DS unit tests are doing
> 2. Change char to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related char issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39584:


Assignee: Apache Spark

> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Apache Spark
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). 
> When GenTPCDSData generates parquet, that pads spaces for strings whose 
> lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark TPC-DS unit tests are doing
> 2. Change char to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related char issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39688.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37095
[https://github.com/apache/spark/pull/37095]

> getReusablePVCs should handle accounts with no PVC permission
> -
>
> Key: SPARK-39688
> URL: https://issues.apache.org/jira/browse/SPARK-39688
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34927:
-

Assignee: Kazuaki Ishizaki

> Support TPCDSQueryBenchmark in Benchmarks
> -
>
> Key: SPARK-34927
> URL: https://issues.apache.org/jira/browse/SPARK-34927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 3.4.0
>
>
> Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should 
> make it supported. See also 
> https://github.com/apache/spark/pull/32015#issuecomment-89046



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34927.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37020
[https://github.com/apache/spark/pull/37020]

> Support TPCDSQueryBenchmark in Benchmarks
> -
>
> Key: SPARK-34927
> URL: https://issues.apache.org/jira/browse/SPARK-34927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should 
> make it supported. See also 
> https://github.com/apache/spark/pull/32015#issuecomment-89046



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34927:
-

Assignee: Kazuyuki Tanimura  (was: Kazuaki Ishizaki)

> Support TPCDSQueryBenchmark in Benchmarks
> -
>
> Key: SPARK-34927
> URL: https://issues.apache.org/jira/browse/SPARK-34927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.4.0
>
>
> Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should 
> make it supported. See also 
> https://github.com/apache/spark/pull/32015#issuecomment-89046



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39688:
-

Assignee: Dongjoon Hyun

> getReusablePVCs should handle accounts with no PVC permission
> -
>
> Key: SPARK-39688
> URL: https://issues.apache.org/jira/browse/SPARK-39688
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39616) Upgrade Breeze to 2.0

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39616.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37002
[https://github.com/apache/spark/pull/37002]

> Upgrade Breeze to 2.0
> -
>
> Key: SPARK-39616
> URL: https://issues.apache.org/jira/browse/SPARK-39616
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39616) Upgrade Breeze to 2.0

2022-07-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39616:
-

Assignee: Ruifeng Zheng

> Upgrade Breeze to 2.0
> -
>
> Key: SPARK-39616
> URL: https://issues.apache.org/jira/browse/SPARK-39616
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39688:


Assignee: (was: Apache Spark)

> getReusablePVCs should handle accounts with no PVC permission
> -
>
> Key: SPARK-39688
> URL: https://issues.apache.org/jira/browse/SPARK-39688
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39688:


Assignee: Apache Spark

> getReusablePVCs should handle accounts with no PVC permission
> -
>
> Key: SPARK-39688
> URL: https://issues.apache.org/jira/browse/SPARK-39688
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562820#comment-17562820
 ] 

Apache Spark commented on SPARK-39688:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37095

> getReusablePVCs should handle accounts with no PVC permission
> -
>
> Key: SPARK-39688
> URL: https://issues.apache.org/jira/browse/SPARK-39688
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39688) getReusablePVCs should handle accounts with no PVC permission

2022-07-05 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-39688:
-

 Summary: getReusablePVCs should handle accounts with no PVC 
permission
 Key: SPARK-39688
 URL: https://issues.apache.org/jira/browse/SPARK-39688
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39018) Add support for YARN decommissioning when ESS is Disabled

2022-07-05 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated SPARK-39018:
---
Fix Version/s: 3.4.0

> Add support for YARN decommissioning when ESS is Disabled
> -
>
> Key: SPARK-39018
> URL: https://issues.apache.org/jira/browse/SPARK-39018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.2.1
>Reporter: Abhishek Dixit
>Priority: Major
> Fix For: 3.4.0
>
>
> Subtask to handle Yarn Executor Decommissioning when Shuffle Service is 
> Disabled.
> This relates to 
> [SPARK-30835|https://issues.apache.org/jira/browse/SPARK-30835]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39018) Add support for YARN decommissioning when ESS is Disabled

2022-07-05 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit resolved SPARK-39018.

Resolution: Fixed

> Add support for YARN decommissioning when ESS is Disabled
> -
>
> Key: SPARK-39018
> URL: https://issues.apache.org/jira/browse/SPARK-39018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.2.1
>Reporter: Abhishek Dixit
>Priority: Major
>
> Subtask to handle Yarn Executor Decommissioning when Shuffle Service is 
> Disabled.
> This relates to 
> [SPARK-30835|https://issues.apache.org/jira/browse/SPARK-30835]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35662) Support Timestamp without time zone data type

2022-07-05 Thread Bill Schneider (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562724#comment-17562724
 ] 

Bill Schneider edited comment on SPARK-35662 at 7/5/22 5:07 PM:


Is this delayed until 3.4.0?  It did not appear to work in Spark 3.3.  

However, Spark 3.4.0-SNAPSHOT appears to do exactly what I wanted it to:

`cast(string, DataTypes.TimestampNTZType)`

when written to Parquet, will be exactly the same timestamp when read from a 
Spark session in a different timezone.


was (Author: wrschneider99):
Is this delayed until 3.4.0?  It did not appear to work in Spark 3.3

> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache.org/jira/browse/SPARK-35662
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL today supports the TIMESTAMP data type. However the semantics 
> provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
> Timestamps embedded in a SQL query or passed through JDBC are presumed to be 
> in session local timezone and cast to UTC before being processed.
>  These are desirable semantics in many cases, such as when dealing with 
> calendars.
>  In many (more) other cases, such as when dealing with log files it is 
> desirable that the provided timestamps not be altered.
>  SQL users expect that they can model either behavior and do so by using 
> TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
> LOCAL TIME ZONE for time zone sensitive data.
>  Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
> be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
> exist in the standard.
> In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
> describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
> standard semantic.
>  Using these two types will provide clarity.
>  We will also allow users to set the default behavior for TIMESTAMP to either 
> use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
> h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
> TimestampWithoutTZ meets or exceeds all function of the existing SQL 
> Timestamp):
>  * Add a new DataType implementation for TimestampWithoutTZ.
>  * Support TimestampWithoutTZ in Dataset/UDF.
>  * TimestampWithoutTZ literals
>  * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - 
> TimestampWithoutTZ, TimestampWithoutTZ - Date)
>  * Datetime functions/operators: dayofweek, weekofyear, year, etc
>  * Cast to and from TimestampWithoutTZ, cast String/Timestamp to 
> TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty 
> printing)/Timestamp, with the SQL syntax to specify the types
>  * Support sorting TimestampWithoutTZ.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type TimestampWithoutTZ
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return TimestampWithoutTZ
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562735#comment-17562735
 ] 

Yikun Jiang edited comment on SPARK-39681 at 7/5/22 4:06 PM:
-

[~hyukjin.kwon] 

I think this should be backported first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3

 

Then, not sure should we pin to old image, I also mention it in here: 
https://github.com/apache/spark/pull/37091#issuecomment-1174989349


was (Author: yikunkero):
[~hyukjin.kwon] 

I think this should be backported first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562735#comment-17562735
 ] 

Yikun Jiang edited comment on SPARK-39681 at 7/5/22 4:04 PM:
-

[~hyukjin.kwon] 

I think this should be backported first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3


was (Author: yikunkero):
I think this should be backported first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562735#comment-17562735
 ] 

Yikun Jiang edited comment on SPARK-39681 at 7/5/22 4:04 PM:
-

I think this should be backported first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3


was (Author: yikunkero):
I think this should be backport first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562735#comment-17562735
 ] 

Yikun Jiang commented on SPARK-39681:
-

I think this should be backport first. 
https://issues.apache.org/jira/browse/SPARK-37059

git cherry-pick 81aa51469424a6618b3a3e680c6341c831be2fb3

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35662) Support Timestamp without time zone data type

2022-07-05 Thread Bill Schneider (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562724#comment-17562724
 ] 

Bill Schneider commented on SPARK-35662:


Is this delayed until 3.4.0?  It did not appear to work in Spark 3.3

> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache.org/jira/browse/SPARK-35662
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL today supports the TIMESTAMP data type. However the semantics 
> provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
> Timestamps embedded in a SQL query or passed through JDBC are presumed to be 
> in session local timezone and cast to UTC before being processed.
>  These are desirable semantics in many cases, such as when dealing with 
> calendars.
>  In many (more) other cases, such as when dealing with log files it is 
> desirable that the provided timestamps not be altered.
>  SQL users expect that they can model either behavior and do so by using 
> TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
> LOCAL TIME ZONE for time zone sensitive data.
>  Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
> be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
> exist in the standard.
> In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
> describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
> standard semantic.
>  Using these two types will provide clarity.
>  We will also allow users to set the default behavior for TIMESTAMP to either 
> use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
> h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
> TimestampWithoutTZ meets or exceeds all function of the existing SQL 
> Timestamp):
>  * Add a new DataType implementation for TimestampWithoutTZ.
>  * Support TimestampWithoutTZ in Dataset/UDF.
>  * TimestampWithoutTZ literals
>  * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - 
> TimestampWithoutTZ, TimestampWithoutTZ - Date)
>  * Datetime functions/operators: dayofweek, weekofyear, year, etc
>  * Cast to and from TimestampWithoutTZ, cast String/Timestamp to 
> TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty 
> printing)/Timestamp, with the SQL syntax to specify the types
>  * Support sorting TimestampWithoutTZ.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type TimestampWithoutTZ
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return TimestampWithoutTZ
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39677) Wrong args item formatting of the regexp functions

2022-07-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39677:
-
Fix Version/s: 3.2.2

> Wrong args item formatting of the regexp functions
> --
>
> Key: SPARK-39677
> URL: https://issues.apache.org/jira/browse/SPARK-39677
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.1.4, 3.2.2, 3.4.0, 3.3.1
>
> Attachments: Screenshot 2022-07-05 at 09.48.28.png
>
>
> See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39677) Wrong args item formatting of the regexp functions

2022-07-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39677:
-
Fix Version/s: 3.1.4

> Wrong args item formatting of the regexp functions
> --
>
> Key: SPARK-39677
> URL: https://issues.apache.org/jira/browse/SPARK-39677
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1
>
> Attachments: Screenshot 2022-07-05 at 09.48.28.png
>
>
> See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-07-05 Thread Dmitry (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562695#comment-17562695
 ] 

Dmitry commented on SPARK-39623:


It is quoted in my issue, here it is again:

I see this query on DB backend:
{code:sql}
SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
"datestamp" is null
{code}

> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39609) PySpark need to support pypy3.8 to avoid "No module named '_pickle"

2022-07-05 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562682#comment-17562682
 ] 

Yikun Jiang commented on SPARK-39609:
-

[https://github.com/cloudpipe/cloudpickle/pull/461]

 

cloudpickle didn't supported pypy3.8 yet.

> PySpark need to support pypy3.8 to avoid "No module named '_pickle"
> ---
>
> Key: SPARK-39609
> URL: https://issues.apache.org/jira/browse/SPARK-39609
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> Starting test(pypy3): pyspark.sql.tests.test_arrow (temp output: 
> /tmp/pypy3__pyspark.sql.tests.test_arrow__jx96qdzs.log)
> Traceback (most recent call last):
>   File "/usr/lib/pypy3.8/runpy.py", line 188, in _run_module_as_main
>     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>   File "/usr/lib/pypy3.8/runpy.py", line 111, in _get_module_details
>     __import__(pkg_name)
>   File "/__w/spark/spark/python/pyspark/__init__.py", line 59, in 
>     from pyspark.rdd import RDD, RDDBarrier
>   File "/__w/spark/spark/python/pyspark/rdd.py", line 54, in 
>     from pyspark.java_gateway import local_connect_and_auth
>   File "/__w/spark/spark/python/pyspark/java_gateway.py", line 32, in 
>     from pyspark.serializers import read_int, write_with_length, 
> UTF8Deserializer
>   File "/__w/spark/spark/python/pyspark/serializers.py", line 68, in 
>     from pyspark import cloudpickle
>   File "/__w/spark/spark/python/pyspark/cloudpickle/__init__.py", line 4, in 
> 
>     from pyspark.cloudpickle.cloudpickle import *  # noqa
>   File "/__w/spark/spark/python/pyspark/cloudpickle/cloudpickle.py", line 57, 
> in 
>     from .compat import pickle
>   File "/__w/spark/spark/python/pyspark/cloudpickle/compat.py", line 13, in 
> 
>     from _pickle import Pickler  # noqa: F401
> ModuleNotFoundError: No module named '_pickle'
> Had test failures in pyspark.sql.tests.test_arrow with pypy3; see logs. {code}
> Build latest dockerfile pypy3 upgrade to 3.8 (original is 3.7), but it seems 
> cloudpickle has a bug.
> This may related: 
> https://github.com/cloudpipe/cloudpickle/commit/8bbea3e140767f51dd935a3c8f21c9a8e8702b7c,
>  but I try to apply this, also failed. Need a deeper look, if you guys know 
> the reason of this, pls let me know.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39677) Wrong args item formatting of the regexp functions

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562641#comment-17562641
 ] 

Apache Spark commented on SPARK-39677:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37094

> Wrong args item formatting of the regexp functions
> --
>
> Key: SPARK-39677
> URL: https://issues.apache.org/jira/browse/SPARK-39677
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Screenshot 2022-07-05 at 09.48.28.png
>
>
> See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39677) Wrong args item formatting of the regexp functions

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562642#comment-17562642
 ] 

Apache Spark commented on SPARK-39677:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37093

> Wrong args item formatting of the regexp functions
> --
>
> Key: SPARK-39677
> URL: https://issues.apache.org/jira/browse/SPARK-39677
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Screenshot 2022-07-05 at 09.48.28.png
>
>
> See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39687) Make sure new catalog methods listed in API reference

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39687:


Assignee: (was: Apache Spark)

> Make sure new catalog methods listed in API reference
> -
>
> Key: SPARK-39687
> URL: https://issues.apache.org/jira/browse/SPARK-39687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39687) Make sure new catalog methods listed in API reference

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562639#comment-17562639
 ] 

Apache Spark commented on SPARK-39687:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37092

> Make sure new catalog methods listed in API reference
> -
>
> Key: SPARK-39687
> URL: https://issues.apache.org/jira/browse/SPARK-39687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39687) Make sure new catalog methods listed in API reference

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39687:


Assignee: Apache Spark

> Make sure new catalog methods listed in API reference
> -
>
> Key: SPARK-39687
> URL: https://issues.apache.org/jira/browse/SPARK-39687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39687) Make sure new catalog methods listed in API reference

2022-07-05 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-39687:
-

 Summary: Make sure new catalog methods listed in API reference
 Key: SPARK-39687
 URL: https://issues.apache.org/jira/browse/SPARK-39687
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39539) Overflow on converting valid Milliseconds to Microseconds

2022-07-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39539.
--
Resolution: Not A Problem

> Overflow on converting valid Milliseconds to Microseconds
> -
>
> Key: SPARK-39539
> URL: https://issues.apache.org/jira/browse/SPARK-39539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: senmiao
>Priority: Major
>
>  
> {code:java}
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> millisToMicros(Long.MinValue) 
> java.lang.ArithmeticException: long overflow
>    at java.lang.Math.multiplyExact(Math.java:892)   
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:213)
>    ... 49 elidedscala> millisToMicros(Long.MinValue)
> scala> millisToMicros(Long.MaxValue)
> java.lang.ArithmeticException: long overflow
>   at java.lang.Math.multiplyExact(Math.java:892)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:213)
>   ... 49 elided
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39610) Add safe.directory for container based job

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39610:


Assignee: Yikun Jiang

> Add safe.directory for container based job
> --
>
> Key: SPARK-39610
> URL: https://issues.apache.org/jira/browse/SPARK-39610
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> {code:java}
> ```
> fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
> To add an exception for this directory, call:
>     git config --global --add safe.directory /__w/spark/spark
> fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
> To add an exception for this directory, call:
>     git config --global --add safe.directory /__w/spark/spark
> Error: Process completed with exit code 128.
> ``` {code}
> https://github.blog/2022-04-12-git-security-vulnerability-announced/
> [https://github.com/actions/checkout/issues/760]
> ```yaml
>     - name: Github Actions permissions workaround
>       run: |
>         git config --global --add safe.directory ${GITHUB_WORKSPACE}
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39610) Add safe.directory for container based job

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39610.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37079
[https://github.com/apache/spark/pull/37079]

> Add safe.directory for container based job
> --
>
> Key: SPARK-39610
> URL: https://issues.apache.org/jira/browse/SPARK-39610
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> ```
> fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
> To add an exception for this directory, call:
>     git config --global --add safe.directory /__w/spark/spark
> fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
> To add an exception for this directory, call:
>     git config --global --add safe.directory /__w/spark/spark
> Error: Process completed with exit code 128.
> ``` {code}
> https://github.blog/2022-04-12-git-security-vulnerability-announced/
> [https://github.com/actions/checkout/issues/760]
> ```yaml
>     - name: Github Actions permissions workaround
>       run: |
>         git config --global --add safe.directory ${GITHUB_WORKSPACE}
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39611) PySpark support numpy 1.23.X

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39611.
--
Fix Version/s: 3.3.1
   3.2.2
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37078
[https://github.com/apache/spark/pull/37078]

> PySpark support numpy 1.23.X
> 
>
> Key: SPARK-39611
> URL: https://issues.apache.org/jira/browse/SPARK-39611
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.1, 3.2.2, 3.4.0
>
>
> {code:java}
> ==
> ERROR [2.102s]: test_arithmetic_op_exceptions 
> (pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest) 
> --
> Traceback (most recent call last):
> File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", 
> line 99, in test_arithmetic_op_exceptions self.assertRaisesRegex(TypeError, 
> expected_err_msg, lambda: other / psser) File 
> "/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex return 
> context.handle('assertRaisesRegex', args, kwargs)
> File "/usr/lib/python3.9/unittest/case.py", line 201, in handle 
> callable_obj(*args, **kwargs)
> File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", 
> line 99, in  self.assertRaisesRegex(TypeError, expected_err_msg, 
> lambda: other / psser)
> File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, in 
> __array_ufunc__ 
> raise NotImplementedError(NotImplementedError: pandas-on-Spark objects 
> currently do not support .
> --
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39611) PySpark support numpy 1.23.X

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39611:


Assignee: Yikun Jiang

> PySpark support numpy 1.23.X
> 
>
> Key: SPARK-39611
> URL: https://issues.apache.org/jira/browse/SPARK-39611
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> {code:java}
> ==
> ERROR [2.102s]: test_arithmetic_op_exceptions 
> (pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest) 
> --
> Traceback (most recent call last):
> File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", 
> line 99, in test_arithmetic_op_exceptions self.assertRaisesRegex(TypeError, 
> expected_err_msg, lambda: other / psser) File 
> "/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex return 
> context.handle('assertRaisesRegex', args, kwargs)
> File "/usr/lib/python3.9/unittest/case.py", line 201, in handle 
> callable_obj(*args, **kwargs)
> File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", 
> line 99, in  self.assertRaisesRegex(TypeError, expected_err_msg, 
> lambda: other / psser)
> File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, in 
> __array_ufunc__ 
> raise NotImplementedError(NotImplementedError: pandas-on-Spark objects 
> currently do not support .
> --
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39612.
--
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37084
[https://github.com/apache/spark/pull/37084]

> The dataframe returned by exceptAll() can no longer perform operations such 
> as count() or isEmpty(), or an exception will be thrown.
> 
>
> Key: SPARK-39612
> URL: https://issues.apache.org/jira/browse/SPARK-39612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: OS: centos stream 8
> {code:java}
> $ uname -a
> Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
> 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> $ python --version
> Python 3.8.13 
> $ pyspark --version
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
>       /_/
>                         
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
> Branch HEAD
> Compiled by user ubuntu on 2022-06-09T19:58:58Z
> Revision f74867bddfbcdd4d08076db36851e88b15e66556
> Url https://github.com/apache/spark
> Type --help for more information.
> $ java --version
> openjdk 11.0.11 2021-04-20
> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
> {code}
>  
>Reporter: Zhu JunYong
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.1, 3.4.0
>
>
> As I said, the dataframe returned by `exceptAll()` can no longer perform 
> operations such as `count()` or `isEmpty()`, or an exception will be thrown.
>  
>  
> {code:java}
> >>> d1 = spark.createDataFrame([("a")], 'STRING')
> >>> d1.show()
> +-+
> |value|
> +-+
> |    a|
> +-+
> >>> d2 = d1.exceptAll(d1)
> >>> d2.show()
> +-+
> |value|
> +-+
> +-+
> >>> d2.count()
> 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 
> 525)
> java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at scala.collection.immutable.List.map(List.scala:297)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
>     at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
>     at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>  

[jira] [Commented] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562622#comment-17562622
 ] 

Apache Spark commented on SPARK-39686:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37091

> Disable scheduled builds that do not pass even once
> ---
>
> Key: SPARK-39686
> URL: https://issues.apache.org/jira/browse/SPARK-39686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We added some more builds that have never beed tested so far. We should 
> probably disable them to check the build status easily for now.
> See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39612:


Assignee: Hyukjin Kwon

> The dataframe returned by exceptAll() can no longer perform operations such 
> as count() or isEmpty(), or an exception will be thrown.
> 
>
> Key: SPARK-39612
> URL: https://issues.apache.org/jira/browse/SPARK-39612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: OS: centos stream 8
> {code:java}
> $ uname -a
> Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
> 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> $ python --version
> Python 3.8.13 
> $ pyspark --version
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
>       /_/
>                         
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
> Branch HEAD
> Compiled by user ubuntu on 2022-06-09T19:58:58Z
> Revision f74867bddfbcdd4d08076db36851e88b15e66556
> Url https://github.com/apache/spark
> Type --help for more information.
> $ java --version
> openjdk 11.0.11 2021-04-20
> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
> {code}
>  
>Reporter: Zhu JunYong
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> As I said, the dataframe returned by `exceptAll()` can no longer perform 
> operations such as `count()` or `isEmpty()`, or an exception will be thrown.
>  
>  
> {code:java}
> >>> d1 = spark.createDataFrame([("a")], 'STRING')
> >>> d1.show()
> +-+
> |value|
> +-+
> |    a|
> +-+
> >>> d2 = d1.exceptAll(d1)
> >>> d2.show()
> +-+
> |value|
> +-+
> +-+
> >>> d2.count()
> 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 
> 525)
> java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at scala.collection.immutable.List.map(List.scala:297)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
>     at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
>     at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> 

[jira] [Commented] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562620#comment-17562620
 ] 

Apache Spark commented on SPARK-39686:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37091

> Disable scheduled builds that do not pass even once
> ---
>
> Key: SPARK-39686
> URL: https://issues.apache.org/jira/browse/SPARK-39686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We added some more builds that have never beed tested so far. We should 
> probably disable them to check the build status easily for now.
> See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39606) Use child stats to estimate order operator

2022-07-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-39606.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36994
[https://github.com/apache/spark/pull/36994]

> Use child stats to estimate order operator
> --
>
> Key: SPARK-39606
> URL: https://issues.apache.org/jira/browse/SPARK-39606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39686:


Assignee: (was: Apache Spark)

> Disable scheduled builds that do not pass even once
> ---
>
> Key: SPARK-39686
> URL: https://issues.apache.org/jira/browse/SPARK-39686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We added some more builds that have never beed tested so far. We should 
> probably disable them to check the build status easily for now.
> See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39606) Use child stats to estimate order operator

2022-07-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-39606:
---

Assignee: Yuming Wang

> Use child stats to estimate order operator
> --
>
> Key: SPARK-39606
> URL: https://issues.apache.org/jira/browse/SPARK-39606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39686:


Assignee: Apache Spark

> Disable scheduled builds that do not pass even once
> ---
>
> Key: SPARK-39686
> URL: https://issues.apache.org/jira/browse/SPARK-39686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We added some more builds that have never beed tested so far. We should 
> probably disable them to check the build status easily for now.
> See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39681:
-
Description: 
https://github.com/apache/spark/runs/7189972241?check_suite_focus=true

{code}

Running PySpark tests

Traceback (most recent call last):
  File "./python/run-tests.py", line 65, in 
raise RuntimeError("Cannot find assembly build directory, please build 
Spark first.")
RuntimeError: Cannot find assembly build directory, please build Spark first.
{code}



  was:
{code}


Running PySpark tests

Traceback (most recent call last):
  File "./python/run-tests.py", line 65, in 
raise RuntimeError("Cannot find assembly build directory, please build 
Spark first.")
RuntimeError: Cannot find assembly build directory, please build Spark first.
{code}

https://github.com/apache/spark/runs/7189972241?check_suite_focus=true


> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39686) Disable scheduled builds that do not pass even once

2022-07-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-39686:


 Summary: Disable scheduled builds that do not pass even once
 Key: SPARK-39686
 URL: https://issues.apache.org/jira/browse/SPARK-39686
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


We added some more builds that have never beed tested so far. We should 
probably disable them to check the build status easily for now.

See also SPARK-39681, SPARK-39685, SPARK-39682 and SPARK-39684



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39684) Docker IT build broken in master with Hadoop 2

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39684:
-
Affects Version/s: 3.4.0
   (was: 3.2.1)

> Docker IT build broken in master with Hadoop 2
> --
>
> Key: SPARK-39684
> URL: https://issues.apache.org/jira/browse/SPARK-39684
> Project: Spark
>  Issue Type: Test
>  Components: Build, SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7054055518?check_suite_focus=true
> {code}
> [info] DB2KrbIntegrationSuite:
> [info] org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite *** ABORTED *** (3 
> minutes, 47 seconds)
> [info]   The code passed to eventually never returned normally. Attempted 185 
> times over 3.009751720714 minutes. Last failure message: Login failure 
> for db2/10.1.0...@example.com from keytab 
> /home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab.
>  (DockerJDBCIntegrationSuite.scala:166)
> [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info]   at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185)
> [info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.super$beforeAll(DockerKrbJDBCIntegrationSuite.scala:65)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerKrbJDBCIntegrationSuite.scala:65)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.beforeAll(DockerKrbJDBCIntegrationSuite.scala:44)
> [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
> [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> [info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
> [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]   at java.lang.Thread.run(Thread.java:750)
> [info]   Cause: java.io.IOException: Login failure for 
> db2/10.1.0...@example.com from keytab 
> /home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab
> [info]   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1231)
> [info]   at 
> org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite.getConnection(DB2KrbIntegrationSuite.scala:93)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info] 

[jira] [Created] (SPARK-39685) Linter build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-39685:


 Summary: Linter build broken in branch-3.2 with Scala 2.13
 Key: SPARK-39685
 URL: https://issues.apache.org/jira/browse/SPARK-39685
 Project: Spark
  Issue Type: Test
  Components: Build, PySpark, Tests
Affects Versions: 3.2.1
Reporter: Hyukjin Kwon


https://github.com/apache/spark/runs/7189971643?check_suite_focus=true

{code}
mypy checks failed:
python/pyspark/pandas/typedef/string_typehints.py:24: error: Name "np" already 
defined (by an import)
python/pyspark/pandas/typedef/string_typehints.py:30: error: Incompatible 
import of "DataFrame" (imported name has type 
"Type[pyspark.pandas.frame.DataFrame[Any]]", local name has type 
"Type[pandas.core.frame.DataFrame]")
python/pyspark/pandas/typedef/string_typehints.py:30: error: Incompatible 
import of "Series" (imported name has type 
"Type[pyspark.pandas.series.Series[Any]]", local name has type 
"Type[pandas.core.series.Series]")
python/pyspark/sql/pandas/_typing/__init__.pyi:39: error: Incompatible types in 
assignment (expression has type "Type[DataFrame]", variable has type 
"Type[DataFrameLike]")
python/pyspark/sql/pandas/_typing/__init__.pyi:40: error: Incompatible types in 
assignment (expression has type "Type[Series]", variable has type 
"Type[SeriesLike]")
python/pyspark/pandas/datetimes.py:37: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:39: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:52: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:67: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:74: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:81: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:88: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:95: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:102: error: Cannot determine type of "spark"
python/pyspark/pandas/datetimes.py:125: error: Cannot determine type of "spark"
python/pyspark/pandas/typedef/typehints.py:36: error: Module "pandas.api.types" 
has no attribute "pandas_dtype"
python/pyspark/pandas/typedef/typehints.py:526: error: Argument 3 to 
"DataFrameType" has incompatible type "List[None]"; expected 
"List[Optional[str]]"
python/pyspark/pandas/typedef/typehints.py:526: note: "List" is invariant -- 
see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
python/pyspark/pandas/typedef/typehints.py:526: note: Consider using "Sequence" 
instead, which is covariant
python/pyspark/pandas/utils.py:45: error: Module "pandas.api.types" has no 
attribute "is_list_like"
python/pyspark/pandas/internal.py:1540: error: Argument "name" to "StructField" 
has incompatible type "Optional[Hashable]"; expected "str"
python/pyspark/pandas/indexing.py:27: error: Module "pandas.api.types" has no 
attribute "is_list_like"
python/pyspark/pandas/indexing.py:993: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:994: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1014: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1021: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1023: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1024: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1055: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1067: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1079: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1139: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1142: error: Cannot determine type of "spark"
python/pyspark/pandas/indexing.py:1148: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:42: error: Module "pandas.api.types" has no 
attribute "is_list_like"
python/pyspark/pandas/generic.py:1206: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1207: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1293: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1294: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1379: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1380: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1452: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1453: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1520: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1521: error: Cannot determine type of "spark"
python/pyspark/pandas/generic.py:1590: error: Cannot determine type of "spark"

[jira] [Updated] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39681:
-
Component/s: Tests

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39682) Docker IT build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39682:
-
Component/s: SQL

> Docker IT build broken in branch-3.2 with Scala 2.13
> 
>
> Key: SPARK-39682
> URL: https://issues.apache.org/jira/browse/SPARK-39682
> Project: Spark
>  Issue Type: Test
>  Components: Build, SQL, Tests
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/7189971505?check_suite_focus=true
> {code}
> [info] OracleIntegrationSuite:
> [info] org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite *** ABORTED *** (8 
> minutes, 1 second)
> [info]   The code passed to eventually never returned normally. Attempted 426 
> times over 7.008370057216667 minutes. Last failure message: IO Error: The 
> Network Adapter could not establish the connection. 
> (DockerJDBCIntegrationSuite.scala:166)
> [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info]   at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:189)
> [info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118)
> [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
> [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> [info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
> [info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]   at java.lang.Thread.run(Thread.java:750)
> [info]   Cause: java.sql.SQLRecoverableException: IO Error: The Network 
> Adapter could not establish the connection
> [info]   at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:858)
> [info]   at 
> oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:793)
> [info]   at 
> oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:57)
> [info]   at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:747)
> [info]   at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:562)
> [info]   at java.sql.DriverManager.getConnection(DriverManager.java:664)
> [info]   at java.sql.DriverManager.getConnection(DriverManager.java:208)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.getConnection(DockerJDBCIntegrationSuite.scala:200)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at 
> org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:154)
> [info]   at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:166)
> [info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
> [info]   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166)
> [info]   at 
> 

[jira] [Created] (SPARK-39684) Docker IT build broken in master with Hadoop 2

2022-07-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-39684:


 Summary: Docker IT build broken in master with Hadoop 2
 Key: SPARK-39684
 URL: https://issues.apache.org/jira/browse/SPARK-39684
 Project: Spark
  Issue Type: Test
  Components: Build, SQL, Tests
Affects Versions: 3.2.1
Reporter: Hyukjin Kwon


https://github.com/apache/spark/runs/7054055518?check_suite_focus=true

{code}
[info] DB2KrbIntegrationSuite:
[info] org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite *** ABORTED *** (3 
minutes, 47 seconds)
[info]   The code passed to eventually never returned normally. Attempted 185 
times over 3.009751720714 minutes. Last failure message: Login failure for 
db2/10.1.0...@example.com from keytab 
/home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab.
 (DockerJDBCIntegrationSuite.scala:166)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118)
[info]   at 
org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.super$beforeAll(DockerKrbJDBCIntegrationSuite.scala:65)
[info]   at 
org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerKrbJDBCIntegrationSuite.scala:65)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.beforeAll(DockerKrbJDBCIntegrationSuite.scala:44)
[info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:750)
[info]   Cause: java.io.IOException: Login failure for 
db2/10.1.0...@example.com from keytab 
/home/runner/work/spark/spark/target/tmp/spark-0c9cf0ca-6ce0-491c-b032-bbccf22d51ac/db2.keytab
[info]   at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1231)
[info]   at 
org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite.getConnection(DB2KrbIntegrationSuite.scala:93)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:150)
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:162)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
[info]   at 

[jira] [Updated] (SPARK-39683) Did not find value which can be converted into java.lang.String

2022-07-05 Thread zehra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zehra updated SPARK-39683:
--
Description: 
Hi, I have a problem with loading the model in pyspark. I wrote it in a detail 
at this link. Could you help me with it? Thanks.

 

[https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into]

  was:
Hi, I have a problem with loading the model in pyspark. I wrote it in a detail 
at this link. Could you help me with it? Thanks.

[Error Message|http://example.com/]

https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into


> Did not find value which can be converted into java.lang.String
> ---
>
> Key: SPARK-39683
> URL: https://issues.apache.org/jira/browse/SPARK-39683
> Project: Spark
>  Issue Type: Question
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: zehra
>Priority: Major
>
> Hi, I have a problem with loading the model in pyspark. I wrote it in a 
> detail at this link. Could you help me with it? Thanks.
>  
> [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39683) Did not find value which can be converted into java.lang.String

2022-07-05 Thread zehra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zehra updated SPARK-39683:
--
Description: 
Hi, I have a problem with loading the ALS model in pyspark. I wrote it in a 
detail at this link. Could you help me with it? Thanks.

 

[https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into]

  was:
Hi, I have a problem with loading the model in pyspark. I wrote it in a detail 
at this link. Could you help me with it? Thanks.

 

[https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into]


> Did not find value which can be converted into java.lang.String
> ---
>
> Key: SPARK-39683
> URL: https://issues.apache.org/jira/browse/SPARK-39683
> Project: Spark
>  Issue Type: Question
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: zehra
>Priority: Major
>
> Hi, I have a problem with loading the ALS model in pyspark. I wrote it in a 
> detail at this link. Could you help me with it? Thanks.
>  
> [https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39683) Did not find value which can be converted into java.lang.String

2022-07-05 Thread zehra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zehra updated SPARK-39683:
--
Description: 
Hi, I have a problem with loading the model in pyspark. I wrote it in a detail 
at this link. Could you help me with it? Thanks.

[Error Message|http://example.com/]

https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into

  was:
Hi, I have a problem with loading the model in pyspark. I wrote it in a detail 
at this link. Could you help me with it? Thanks.

[Error 
Message|http://example.com]https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into


> Did not find value which can be converted into java.lang.String
> ---
>
> Key: SPARK-39683
> URL: https://issues.apache.org/jira/browse/SPARK-39683
> Project: Spark
>  Issue Type: Question
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: zehra
>Priority: Major
>
> Hi, I have a problem with loading the model in pyspark. I wrote it in a 
> detail at this link. Could you help me with it? Thanks.
> [Error Message|http://example.com/]
> https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39683) Did not find value which can be converted into java.lang.String

2022-07-05 Thread zehra (Jira)
zehra created SPARK-39683:
-

 Summary: Did not find value which can be converted into 
java.lang.String
 Key: SPARK-39683
 URL: https://issues.apache.org/jira/browse/SPARK-39683
 Project: Spark
  Issue Type: Question
  Components: ML, PySpark
Affects Versions: 3.2.0
Reporter: zehra


Hi, I have a problem with loading the model in pyspark. I wrote it in a detail 
at this link. Could you help me with it? Thanks.

[Error 
Message|http://example.com]https://stackoverflow.com/questions/72868619/apache-spark-loading-als-model-did-not-find-value-which-can-be-converted-into



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39681) PySpark build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39681:
-
Summary: PySpark build broken in branch-3.2 with Scala 2.13  (was: PySpark 
build in branch-3.2 with Scala 2.13 fails)

> PySpark build broken in branch-3.2 with Scala 2.13
> --
>
> Key: SPARK-39681
> URL: https://issues.apache.org/jira/browse/SPARK-39681
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark
>Affects Versions: 3.2.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> 
> Running PySpark tests
> 
> Traceback (most recent call last):
>   File "./python/run-tests.py", line 65, in 
> raise RuntimeError("Cannot find assembly build directory, please build 
> Spark first.")
> RuntimeError: Cannot find assembly build directory, please build Spark first.
> {code}
> https://github.com/apache/spark/runs/7189972241?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39682) Docker IT build broken in branch-3.2 with Scala 2.13

2022-07-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-39682:


 Summary: Docker IT build broken in branch-3.2 with Scala 2.13
 Key: SPARK-39682
 URL: https://issues.apache.org/jira/browse/SPARK-39682
 Project: Spark
  Issue Type: Test
  Components: Build, Tests
Affects Versions: 3.2.1
Reporter: Hyukjin Kwon


https://github.com/apache/spark/runs/7189971505?check_suite_focus=true

{code}
[info] OracleIntegrationSuite:
[info] org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite *** ABORTED *** (8 
minutes, 1 second)
[info]   The code passed to eventually never returned normally. Attempted 426 
times over 7.008370057216667 minutes. Last failure message: IO Error: The 
Network Adapter could not establish the connection. 
(DockerJDBCIntegrationSuite.scala:166)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:189)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118)
[info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:750)
[info]   Cause: java.sql.SQLRecoverableException: IO Error: The Network Adapter 
could not establish the connection
[info]   at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:858)
[info]   at 
oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:793)
[info]   at 
oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:57)
[info]   at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:747)
[info]   at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:562)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:664)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:208)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.getConnection(DockerJDBCIntegrationSuite.scala:200)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$3(DockerJDBCIntegrationSuite.scala:167)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:154)
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:166)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:196)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:166)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118)
[info]   at 

[jira] [Created] (SPARK-39681) PySpark build in branch-3.2 with Scala 2.13 fails

2022-07-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-39681:


 Summary: PySpark build in branch-3.2 with Scala 2.13 fails
 Key: SPARK-39681
 URL: https://issues.apache.org/jira/browse/SPARK-39681
 Project: Spark
  Issue Type: Test
  Components: Build, PySpark
Affects Versions: 3.2.1
Reporter: Hyukjin Kwon


{code}


Running PySpark tests

Traceback (most recent call last):
  File "./python/run-tests.py", line 65, in 
raise RuntimeError("Cannot find assembly build directory, please build 
Spark first.")
RuntimeError: Cannot find assembly build directory, please build Spark first.
{code}

https://github.com/apache/spark/runs/7189972241?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39677) Wrong args item formatting of the regexp functions

2022-07-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39677:
-
Fix Version/s: 3.4.0
   3.3.1

> Wrong args item formatting of the regexp functions
> --
>
> Key: SPARK-39677
> URL: https://issues.apache.org/jira/browse/SPARK-39677
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Screenshot 2022-07-05 at 09.48.28.png
>
>
> See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39677) Wrong args item formatting of the regexp functions

2022-07-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39677.
--
Resolution: Fixed

> Wrong args item formatting of the regexp functions
> --
>
> Key: SPARK-39677
> URL: https://issues.apache.org/jira/browse/SPARK-39677
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Screenshot 2022-07-05 at 09.48.28.png
>
>
> See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37527) Translate more standard aggregate functions for pushdown

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562538#comment-17562538
 ] 

Apache Spark commented on SPARK-37527:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37090

> Translate more standard aggregate functions for pushdown
> 
>
> Key: SPARK-37527
> URL: https://issues.apache.org/jira/browse/SPARK-37527
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, Spark aggregate pushdown will translate some standard aggregate 
> functions, so that compile these functions suitable specify database.
> After this job, users could override JdbcDialect.compileAggregate to 
> implement some aggregate functions supported by some database.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39579) Make ListFunctions/getFunction/functionExists API compatible

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562520#comment-17562520
 ] 

Apache Spark commented on SPARK-39579:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37088

> Make ListFunctions/getFunction/functionExists API compatible 
> -
>
> Key: SPARK-39579
> URL: https://issues.apache.org/jira/browse/SPARK-39579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39447) Only non-broadcast query stage can propagate empty relation

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562514#comment-17562514
 ] 

Apache Spark commented on SPARK-39447:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37087

> Only non-broadcast query stage can propagate empty relation
> ---
>
> Key: SPARK-39447
> URL: https://issues.apache.org/jira/browse/SPARK-39447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562513#comment-17562513
 ] 

Apache Spark commented on SPARK-39680:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37086

> Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
> -
>
> Key: SPARK-39680
> URL: https://issues.apache.org/jira/browse/SPARK-39680
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `hasSpaceForAnotherRecord()` method  is identified as `@VisibleForTesting` 
> and no longer used in master branch now.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39680:


Assignee: (was: Apache Spark)

> Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
> -
>
> Key: SPARK-39680
> URL: https://issues.apache.org/jira/browse/SPARK-39680
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `hasSpaceForAnotherRecord()` method  is identified as `@VisibleForTesting` 
> and no longer used in master branch now.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562512#comment-17562512
 ] 

Apache Spark commented on SPARK-39680:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37086

> Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
> -
>
> Key: SPARK-39680
> URL: https://issues.apache.org/jira/browse/SPARK-39680
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `hasSpaceForAnotherRecord()` method  is identified as `@VisibleForTesting` 
> and no longer used in master branch now.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39680:


Assignee: Apache Spark

> Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`
> -
>
> Key: SPARK-39680
> URL: https://issues.apache.org/jira/browse/SPARK-39680
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> `hasSpaceForAnotherRecord()` method  is identified as `@VisibleForTesting` 
> and no longer used in master branch now.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39612:


Assignee: (was: Apache Spark)

> The dataframe returned by exceptAll() can no longer perform operations such 
> as count() or isEmpty(), or an exception will be thrown.
> 
>
> Key: SPARK-39612
> URL: https://issues.apache.org/jira/browse/SPARK-39612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: OS: centos stream 8
> {code:java}
> $ uname -a
> Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
> 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> $ python --version
> Python 3.8.13 
> $ pyspark --version
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
>       /_/
>                         
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
> Branch HEAD
> Compiled by user ubuntu on 2022-06-09T19:58:58Z
> Revision f74867bddfbcdd4d08076db36851e88b15e66556
> Url https://github.com/apache/spark
> Type --help for more information.
> $ java --version
> openjdk 11.0.11 2021-04-20
> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
> {code}
>  
>Reporter: Zhu JunYong
>Priority: Critical
>
> As I said, the dataframe returned by `exceptAll()` can no longer perform 
> operations such as `count()` or `isEmpty()`, or an exception will be thrown.
>  
>  
> {code:java}
> >>> d1 = spark.createDataFrame([("a")], 'STRING')
> >>> d1.show()
> +-+
> |value|
> +-+
> |    a|
> +-+
> >>> d2 = d1.exceptAll(d1)
> >>> d2.show()
> +-+
> |value|
> +-+
> +-+
> >>> d2.count()
> 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 
> 525)
> java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at scala.collection.immutable.List.map(List.scala:297)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
>     at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
>     at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>   

[jira] [Commented] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562511#comment-17562511
 ] 

Apache Spark commented on SPARK-39612:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37084

> The dataframe returned by exceptAll() can no longer perform operations such 
> as count() or isEmpty(), or an exception will be thrown.
> 
>
> Key: SPARK-39612
> URL: https://issues.apache.org/jira/browse/SPARK-39612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: OS: centos stream 8
> {code:java}
> $ uname -a
> Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
> 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> $ python --version
> Python 3.8.13 
> $ pyspark --version
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
>       /_/
>                         
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
> Branch HEAD
> Compiled by user ubuntu on 2022-06-09T19:58:58Z
> Revision f74867bddfbcdd4d08076db36851e88b15e66556
> Url https://github.com/apache/spark
> Type --help for more information.
> $ java --version
> openjdk 11.0.11 2021-04-20
> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
> {code}
>  
>Reporter: Zhu JunYong
>Priority: Critical
>
> As I said, the dataframe returned by `exceptAll()` can no longer perform 
> operations such as `count()` or `isEmpty()`, or an exception will be thrown.
>  
>  
> {code:java}
> >>> d1 = spark.createDataFrame([("a")], 'STRING')
> >>> d1.show()
> +-+
> |value|
> +-+
> |    a|
> +-+
> >>> d2 = d1.exceptAll(d1)
> >>> d2.show()
> +-+
> |value|
> +-+
> +-+
> >>> d2.count()
> 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 
> 525)
> java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at scala.collection.immutable.List.map(List.scala:297)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
>     at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
>     at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> 

[jira] [Assigned] (SPARK-39612) The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39612:


Assignee: Apache Spark

> The dataframe returned by exceptAll() can no longer perform operations such 
> as count() or isEmpty(), or an exception will be thrown.
> 
>
> Key: SPARK-39612
> URL: https://issues.apache.org/jira/browse/SPARK-39612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: OS: centos stream 8
> {code:java}
> $ uname -a
> Linux thomaszhu1.fyre.ibm.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 
> 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> $ python --version
> Python 3.8.13 
> $ pyspark --version
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
>       /_/
>                         
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.11
> Branch HEAD
> Compiled by user ubuntu on 2022-06-09T19:58:58Z
> Revision f74867bddfbcdd4d08076db36851e88b15e66556
> Url https://github.com/apache/spark
> Type --help for more information.
> $ java --version
> openjdk 11.0.11 2021-04-20
> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) 
> {code}
>  
>Reporter: Zhu JunYong
>Assignee: Apache Spark
>Priority: Critical
>
> As I said, the dataframe returned by `exceptAll()` can no longer perform 
> operations such as `count()` or `isEmpty()`, or an exception will be thrown.
>  
>  
> {code:java}
> >>> d1 = spark.createDataFrame([("a")], 'STRING')
> >>> d1.show()
> +-+
> |value|
> +-+
> |    a|
> +-+
> >>> d2 = d1.exceptAll(d1)
> >>> d2.show()
> +-+
> |value|
> +-+
> +-+
> >>> d2.count()
> 22/06/27 11:22:15 ERROR Executor: Exception in task 0.0 in stage 113.0 (TID 
> 525)
> java.lang.IllegalStateException: Couldn't find value#465 in [sum#494L]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at scala.collection.immutable.List.map(List.scala:297)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
>     at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$10(GenerateExec.scala:114)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
>     at 
> org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
>     at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
>     at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> 

[jira] [Assigned] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39679:


Assignee: Apache Spark

> TakeOrderedAndProjectExec should respect child output ordering
> --
>
> Key: SPARK-39679
> URL: https://issues.apache.org/jira/browse/SPARK-39679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> TakeOrderedAndProjectExec should respect child output ordering to avoid 
> unnecessary sort. For example:  TakeOrderedAndProjectExec on the top of 
> SortMergeJoin.
> {code:java}
> SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39679:


Assignee: (was: Apache Spark)

> TakeOrderedAndProjectExec should respect child output ordering
> --
>
> Key: SPARK-39679
> URL: https://issues.apache.org/jira/browse/SPARK-39679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> TakeOrderedAndProjectExec should respect child output ordering to avoid 
> unnecessary sort. For example:  TakeOrderedAndProjectExec on the top of 
> SortMergeJoin.
> {code:java}
> SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562510#comment-17562510
 ] 

Apache Spark commented on SPARK-39679:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37085

> TakeOrderedAndProjectExec should respect child output ordering
> --
>
> Key: SPARK-39679
> URL: https://issues.apache.org/jira/browse/SPARK-39679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> TakeOrderedAndProjectExec should respect child output ordering to avoid 
> unnecessary sort. For example:  TakeOrderedAndProjectExec on the top of 
> SortMergeJoin.
> {code:java}
> SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38531:


Assignee: (was: Apache Spark)

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.4.0, 3.3.1
>Reporter: Min Yang
>Priority: Minor
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38531:


Assignee: Apache Spark

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.4.0, 3.3.1
>Reporter: Min Yang
>Assignee: Apache Spark
>Priority: Minor
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39680) Remove `hasSpaceForAnotherRecord` from `UnsafeExternalSorter`

2022-07-05 Thread Yang Jie (Jira)
Yang Jie created SPARK-39680:


 Summary: Remove `hasSpaceForAnotherRecord` from 
`UnsafeExternalSorter`
 Key: SPARK-39680
 URL: https://issues.apache.org/jira/browse/SPARK-39680
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


`hasSpaceForAnotherRecord()` method  is identified as `@VisibleForTesting` and 
no longer used in master branch now.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39679) TakeOrderedAndProjectExec should respect child output ordering

2022-07-05 Thread XiDuo You (Jira)
XiDuo You created SPARK-39679:
-

 Summary: TakeOrderedAndProjectExec should respect child output 
ordering
 Key: SPARK-39679
 URL: https://issues.apache.org/jira/browse/SPARK-39679
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


TakeOrderedAndProjectExec should respect child output ordering to avoid 
unnecessary sort. For example:  TakeOrderedAndProjectExec on the top of 
SortMergeJoin.
{code:java}
SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100;
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562496#comment-17562496
 ] 

Hyukjin Kwon commented on SPARK-38531:
--

Ah, maybe I should have filed a new JIRA  

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.4.0, 3.3.1
>Reporter: Min Yang
>Priority: Minor
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38531:
-
Fix Version/s: (was: 3.3.0)

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Min Yang
>Priority: Minor
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-38531:
--
  Assignee: (was: Min Yang)

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Min Yang
>Priority: Minor
> Fix For: 3.3.0
>
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562495#comment-17562495
 ] 

Hyukjin Kwon commented on SPARK-38531:
--

Reverted at 
https://github.com/apache/spark/commit/161c596cafea9c235b5c918d8999c085401d73a9 
and 
https://github.com/apache/spark/commit/4512e0943036d30587ab19a95efb0e66b47dd746

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Min Yang
>Assignee: Min Yang
>Priority: Minor
> Fix For: 3.3.0
>
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition

2022-07-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38531:
-
Affects Version/s: 3.4.0
   3.3.1

> "Prune unrequired child index" branch of ColumnPruning has wrong condition
> --
>
> Key: SPARK-38531
> URL: https://issues.apache.org/jira/browse/SPARK-38531
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.4.0, 3.3.1
>Reporter: Min Yang
>Priority: Minor
>
> The "prune unrequired references" branch has the condition:
> {code:java}
> case p @ Project(_, g: Generate) if p.references != g.outputSet => {code}
> This is wrong as generators like Inline will always enter this branch as long 
> as it does not use all the generator output.
>  
> Example:
>  
> input: , b: int>>>
>  
> Project(a.a as x)
> - Generate(Inline(col1), ..., a, b)
>  
> p.references is [a]
> g.outputSet is [a, b]
>  
> This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
> thus miss some optimization opportunities. The condition should be
> {code:java}
> g.requiredChildOutput.contains(!p.references.contains(_)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39678) Improve stats estimation for v2 tables

2022-07-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39678:


Assignee: (was: Apache Spark)

> Improve stats estimation for v2 tables
> --
>
> Key: SPARK-39678
> URL: https://issues.apache.org/jira/browse/SPARK-39678
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Priority: Minor
>
> In case of v2 tables, connectors can bubble up both [sizeInBytes and rowCount 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java].
> Presently, SizeInBytesOnlyStatsPlanVisitor, ommits propagating / estimating 
> rowCount stats, some places like :
>  * 
> [CodePointer1|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58]
>  * [CodePointer2 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L46-L47]
> For the 
> [non-cbo|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/LogicalPlanStats.scala#L34-L39]
>  flow, as per my understanding, this can improve the stats estimation, since 
> rowcount is indirectly used in places to estimate the size as well. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39678) Improve stats estimation for v2 tables

2022-07-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562475#comment-17562475
 ] 

Apache Spark commented on SPARK-39678:
--

User 'singhpk234' has created a pull request for this issue:
https://github.com/apache/spark/pull/37083

> Improve stats estimation for v2 tables
> --
>
> Key: SPARK-39678
> URL: https://issues.apache.org/jira/browse/SPARK-39678
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Priority: Minor
>
> In case of v2 tables, connectors can bubble up both [sizeInBytes and rowCount 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java].
> Presently, SizeInBytesOnlyStatsPlanVisitor, ommits propagating / estimating 
> rowCount stats, some places like :
>  * 
> [CodePointer1|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58]
>  * [CodePointer2 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L46-L47]
> For the 
> [non-cbo|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/LogicalPlanStats.scala#L34-L39]
>  flow, as per my understanding, this can improve the stats estimation, since 
> rowcount is indirectly used in places to estimate the size as well. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >