date:20210330

[jira] [Updated] (SPARK-34882) RewriteDistinctAggregates can cause a bug if the aggregator does not ignore NULLs

2021-03-30 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34882:
-
Affects Version/s: 3.0.3
   3.1.2
   2.4.8

> RewriteDistinctAggregates can cause a bug if the aggregator does not ignore 
> NULLs
> -
>
> Key: SPARK-34882
> URL: https://issues.apache.org/jira/browse/SPARK-34882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> {code:title=group-by.sql}
> SELECT
> first(DISTINCT a), last(DISTINCT a),
> first(a), last(a),
> first(DISTINCT b), last(DISTINCT b),
> first(b), last(b)
> FROM testData WHERE a IS NOT NULL AND b IS NOT NULL;{code}
> {code:title=group-by.sql.out}
> -- !query schema
> struct a):int,first(a):int,last(a):int,first(DISTINCT b):int,last(DISTINCT 
> b):int,first(b):int,last(b):int>
> -- !query output
> NULL  1   1   3   1   NULL1   2
> {code}
> The results should not be NULL, because NULL inputs are filtered out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala

2021-03-30 Thread Michal Zeman (Jira)

Michal Zeman created SPARK-34904:


 Summary: Old import of LZ4 package inside CompressionCodec.scala 
 Key: SPARK-34904
 URL: https://issues.apache.org/jira/browse/SPARK-34904
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.1, 2.4.0
Reporter: Michal Zeman


This commit should upgrade the version of the LZ4 package: 
[https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f]
 

The dependency was changed. However, inside the file [CompressionCodec.scala 
|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala]
 the old import referencing net.jpountz.lz4 (where versions up to 1.3. are 
held) remains. 

Because of probably backward compatibility, the newer version of org.lz4 
package still contains net.jpountz.lz4 
([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. 
Therefore this import does not cause problems at a first sight. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34900) Some `spark-submit` commands used to run benchmarks in the user's guide is wrong

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311418#comment-17311418
 ] 

Apache Spark commented on SPARK-34900:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32002

> Some `spark-submit`  commands used to run benchmarks in the user's guide is 
> wrong
> -
>
> Key: SPARK-34900
> URL: https://issues.apache.org/jira/browse/SPARK-34900
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
> Fix For: 3.2.0
>
>
> For example, the guide for running JoinBenchmark as follows:
>  
> {code:java}
> /**
>  * Benchmark to measure performance for joins.
>  * To run this benchmark:
>  * {{{
>  *   1. without sbt:
>  *  bin/spark-submit --class  --jars  
> 
>  *   2. build/sbt "sql/test:runMain "
>  *   3. generate result:
>  *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain  class>"
>  *  Results will be written to "benchmarks/JoinBenchmark-results.txt".
>  * }}}
>  */
> object JoinBenchmark extends SqlBasedBenchmark {
> {code}
>  
>  
> but if we run JoinBenchmark with commnad
>  
> {code:java}
> bin/spark-submit --class 
> org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars 
> spark-core_2.12-3.2.0-SNAPSHOT-tests.jar 
> spark-sql_2.12-3.2.0-SNAPSHOT-tests.jar 
> {code}
>  
> The following exception will be thrown：
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/catalyst/plans/SQLHelper
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369){code}
>  
> because SqlBasedBenchmark trait extends BenchmarkBase and SQLHelper, 
> SQLHelper def in spark-catalyst-tests.jar.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34900) Some `spark-submit` commands used to run benchmarks in the user's guide is wrong

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311423#comment-17311423
 ] 

Apache Spark commented on SPARK-34900:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32003

> Some `spark-submit`  commands used to run benchmarks in the user's guide is 
> wrong
> -
>
> Key: SPARK-34900
> URL: https://issues.apache.org/jira/browse/SPARK-34900
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
> Fix For: 3.2.0
>
>
> For example, the guide for running JoinBenchmark as follows:
>  
> {code:java}
> /**
>  * Benchmark to measure performance for joins.
>  * To run this benchmark:
>  * {{{
>  *   1. without sbt:
>  *  bin/spark-submit --class  --jars  
> 
>  *   2. build/sbt "sql/test:runMain "
>  *   3. generate result:
>  *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain  class>"
>  *  Results will be written to "benchmarks/JoinBenchmark-results.txt".
>  * }}}
>  */
> object JoinBenchmark extends SqlBasedBenchmark {
> {code}
>  
>  
> but if we run JoinBenchmark with commnad
>  
> {code:java}
> bin/spark-submit --class 
> org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars 
> spark-core_2.12-3.2.0-SNAPSHOT-tests.jar 
> spark-sql_2.12-3.2.0-SNAPSHOT-tests.jar 
> {code}
>  
> The following exception will be thrown：
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/catalyst/plans/SQLHelper
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369){code}
>  
> because SqlBasedBenchmark trait extends BenchmarkBase and SQLHelper, 
> SQLHelper def in spark-catalyst-tests.jar.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34900) Some `spark-submit` commands used to run benchmarks in the user's guide is wrong

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311424#comment-17311424
 ] 

Apache Spark commented on SPARK-34900:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32003

> Some `spark-submit`  commands used to run benchmarks in the user's guide is 
> wrong
> -
>
> Key: SPARK-34900
> URL: https://issues.apache.org/jira/browse/SPARK-34900
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
> Fix For: 3.2.0
>
>
> For example, the guide for running JoinBenchmark as follows:
>  
> {code:java}
> /**
>  * Benchmark to measure performance for joins.
>  * To run this benchmark:
>  * {{{
>  *   1. without sbt:
>  *  bin/spark-submit --class  --jars  
> 
>  *   2. build/sbt "sql/test:runMain "
>  *   3. generate result:
>  *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain  class>"
>  *  Results will be written to "benchmarks/JoinBenchmark-results.txt".
>  * }}}
>  */
> object JoinBenchmark extends SqlBasedBenchmark {
> {code}
>  
>  
> but if we run JoinBenchmark with commnad
>  
> {code:java}
> bin/spark-submit --class 
> org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars 
> spark-core_2.12-3.2.0-SNAPSHOT-tests.jar 
> spark-sql_2.12-3.2.0-SNAPSHOT-tests.jar 
> {code}
>  
> The following exception will be thrown：
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/catalyst/plans/SQLHelper
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369){code}
>  
> because SqlBasedBenchmark trait extends BenchmarkBase and SQLHelper, 
> SQLHelper def in spark-catalyst-tests.jar.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34905) Enable ANSI intervals in SQLQueryTestSuite

2021-03-30 Thread Max Gekk (Jira)

Max Gekk created SPARK-34905:


 Summary: Enable ANSI intervals in SQLQueryTestSuite
 Key: SPARK-34905
 URL: https://issues.apache.org/jira/browse/SPARK-34905
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.2.0
Reporter: Max Gekk


Remove the following code from SQLQueryTestSuite:
{code:java}
localSparkSession.conf.set(SQLConf.LEGACY_INTERVAL_ENABLED.key, true)
{code}
and use the ANSI interval where it is possible.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34905) Enable ANSI intervals in SQLQueryTestSuite

2021-03-30 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-34905:
-
Description: 
Remove the following code from SQLQueryTestSuite:
{code:java}
localSparkSession.conf.set(SQLConf.LEGACY_INTERVAL_ENABLED.key, true)
{code}
and use the ANSI interval where it is possible.

Probably, this depends on casting intervals to strings.

  was:
Remove the following code from SQLQueryTestSuite:
{code:java}
localSparkSession.conf.set(SQLConf.LEGACY_INTERVAL_ENABLED.key, true)
{code}
and use the ANSI interval where it is possible.



> Enable ANSI intervals in SQLQueryTestSuite
> --
>
> Key: SPARK-34905
> URL: https://issues.apache.org/jira/browse/SPARK-34905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Remove the following code from SQLQueryTestSuite:
> {code:java}
> localSparkSession.conf.set(SQLConf.LEGACY_INTERVAL_ENABLED.key, true)
> {code}
> and use the ANSI interval where it is possible.
> Probably, this depends on casting intervals to strings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33308:
---

Assignee: angerszhu

> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level
> --
>
> Key: SPARK-33308
> URL: https://issues.apache.org/jira/browse/SPARK-33308
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34884) Improve dynamic partition pruning evaluation

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34884:
---

Assignee: Yuming Wang

> Improve dynamic partition pruning evaluation
> 
>
> Key: SPARK-34884
> URL: https://issues.apache.org/jira/browse/SPARK-34884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Fast fail if filtering side can not build broadcast by size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34884) Improve dynamic partition pruning evaluation

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34884.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31984
[https://github.com/apache/spark/pull/31984]

> Improve dynamic partition pruning evaluation
> 
>
> Key: SPARK-34884
> URL: https://issues.apache.org/jira/browse/SPARK-34884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Fast fail if filtering side can not build broadcast by size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33308) support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in parser level

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33308.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30212
[https://github.com/apache/spark/pull/30212]

> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level
> --
>
> Key: SPARK-33308
> URL: https://issues.apache.org/jira/browse/SPARK-33308
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> support CUBE(...) and ROLLUP(...), GROUPING SETS(...) as group by expr in 
> parser level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2021-03-30 Thread Vincenzo Eduardo Padulano (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311505#comment-17311505
 ] 

Vincenzo Eduardo Padulano commented on SPARK-26404:
---

I found the same issue with this very basic setup: my default python version is 
Python 3 and I also have a Python 2 environment with pyspark installed. I've 
written this simple script base on the Pi estimation at 
[https://spark.apache.org/examples.html] :
{code:python}
import sys
import random
import pyspark

confdict = {"spark.app.name": "spark_pi",
    "spark.master": "local[4]",
    "spark.pyspark.python": sys.executable}
sparkconf = pyspark.SparkConf().setAll(confdict.items())
sparkcontext = pyspark.SparkContext(conf=sparkconf)

def inside(p):
    x, y = random.random(), random.random()
    return x * x + y * y < 1

num_samples = 1e4
num_partitions = 256

count = sparkcontext.parallelize(range(int(num_samples)), 
num_partitions).filter(inside).count()
print("Pi is roughly %.4f" % (4.0 * count / num_samples))
{code}
If I run the script with my default python executable, all good
{code:bash}
$: python spark_pi.py 
21/03/30 14:55:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Pi is roughly 3.1292 
{code}
But when I use Python 2 it doesn't pick the `spark.pyspark.python` 
configuration option I asked in my `SparkConf` object
{code:bash}
$: python2 spark_pi.py 
21/03/30 14:56:59 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
/home/vpadulan/.local/lib/python2.7/site-packages/pyspark/context.py:227: 
DeprecationWarning: Support for Python 2 and Python 3 prior to version 3.6 is 
deprecated as of Spark 3.0. See also the plan for dropping Python 2 support at 
https://spark.apache.org/news/plan-for-dropping-python-2-support.html.
  DeprecationWarning)
21/03/30 14:57:03 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)256]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/home/vpadulan/.local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
 line 473, in main
raise Exception(("Python in worker has different version %s than that in " +
Exception: Python in worker has different version 3.8 than that in driver 2.7, 
PySpark cannot run with different minor versions. Please check environment 
variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
{code}
I can't understand how this issue was resolved in the first place, on the 
[official 
documentation|https://spark.apache.org/docs/latest/configuration.html#environment-variables]
 it is stated that `spark.pyspark.python` should take precedence over 
`PYSPARK_PYTHON`:
{noformat}
Python binary executable to use for PySpark in both driver and workers (default 
is python3 if available, otherwise python). Property spark.pyspark.python take 
precedence if it is set
{noformat}

> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311506#comment-17311506
 ] 

Apache Spark commented on SPARK-34856:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32004

> ANSI mode: Allow casting complex types as string type
> -
>
> Key: SPARK-34856
> URL: https://issues.apache.org/jira/browse/SPARK-34856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, complex types are not allowed to cast as string type. This breaks 
> the Dataset.show() API. E.g
> {code:java}
> scala> sql(“select array(1, 2, 2)“).show(false)
> org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
> AS STRING)’ due to data type mismatch:
>  cannot cast array to string with ANSI mode on.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311508#comment-17311508
 ] 

Apache Spark commented on SPARK-34856:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32004

> ANSI mode: Allow casting complex types as string type
> -
>
> Key: SPARK-34856
> URL: https://issues.apache.org/jira/browse/SPARK-34856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, complex types are not allowed to cast as string type. This breaks 
> the Dataset.show() API. E.g
> {code:java}
> scala> sql(“select array(1, 2, 2)“).show(false)
> org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
> AS STRING)’ due to data type mismatch:
>  cannot cast array to string with ANSI mode on.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:


{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  override final def children: Seq[T] = Nil}}
{{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def child: T}}
{{  @transient override final lazy val children: Seq[T] = child :: Nil}}
{{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def left: T}}
{{  def right: T}}
{{  @transient override final lazy val children: Seq[T] = left :: right :: Nil}}
{{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def first: T}}
{{  def second: T}}
{{  def third: T}}
{{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
{{}}}
 * This refactoring, which is part of a bigger effort to make tree 
transformations in Spark more efficient, has two benefits:
It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of

[jira] [Created] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)

Ali Afroozeh created SPARK-34906:


 Summary: Refactor TreeNode's children handling methods into 
specialized traits
 Key: SPARK-34906
 URL: https://issues.apache.org/jira/browse/SPARK-34906
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Ali Afroozeh


Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:


{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  override final def children: Seq[T] = Nil}}
{{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def child: T}}
{{  @transient override final lazy val children: Seq[T] = child :: Nil}}
{{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def left: T}}
{{  def right: T}}
{{  @transient override final lazy val children: Seq[T] = left :: right :: Nil}}
{{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def first: T}}
{{  def second: T}}
{{  def third: T}}
{{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
{{}}}
 * This refactoring, which is part of a bigger effort to make tree 
transformations in Spark more efficient, has two benefits:
It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with only one child, respectively. This PR 
refactors the `TreeNode` hierarchy by extracting the children handling 
functionality into the following traits. The former nodes such as 
`UnaryExpression` now extend the corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling

[jira] [Commented] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311535#comment-17311535
 ] 

Apache Spark commented on SPARK-34906:
--

User 'dbaliafroozeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31932

> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of children, for example 
> `UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an 
> expression, a logical plan and a physical plan with only one child, 
> respectively. This PR refactors the `TreeNode` hierarchy by extracting the 
> children handling functionality into the following traits. The former nodes 
> such as `UnaryExpression` now extend the corresponding new trait:
> {{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  override final def children: Seq[T] = Nil}}
>  {{}}}
> {{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def child: T}}
>  {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
>  {{}}}
> {{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def left: T}}
>  {{  def right: T}}
>  {{  @transient override final lazy val children: Seq[T] = left :: right :: 
> Nil}}
>  {{}}}
> {{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def first: T}}
>  {{  def second: T}}
>  {{  def third: T}}
>  {{  @transient override final lazy val children: Seq[T] = first :: second :: 
> third :: Nil}}
>  {{}}}
>  
> This refactoring, which is part of a bigger effort to make tree 
> transformations in Spark more efficient, has two benefits:
>  * It moves the children handling to a single place, instead of being spread 
> in specific subclasses, which will help the future optimizations for tree 
> traversals.
>  * It allows to mix in these traits with some concrete node types that could 
> not extend the previous classes. For example, expressions with one child that 
> extend `AggregateFunction` cannot extend `UnaryExpression` as 
> `AggregateFunction` defines the `foldable` method final while 
> `UnaryExpression` defines it as non final. With the new traits, we can 
> directly extend the concrete class from `UnaryLike` in these cases. Classes 
> with more specific child handling will make tree traversal methods faster.
> In this PR we have also updated many concrete node types to extend these 
> traits to benefit from more specific child handling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34906:


Assignee: (was: Apache Spark)

> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of children, for example 
> `UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an 
> expression, a logical plan and a physical plan with only one child, 
> respectively. This PR refactors the `TreeNode` hierarchy by extracting the 
> children handling functionality into the following traits. The former nodes 
> such as `UnaryExpression` now extend the corresponding new trait:
> {{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  override final def children: Seq[T] = Nil}}
>  {{}}}
> {{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def child: T}}
>  {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
>  {{}}}
> {{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def left: T}}
>  {{  def right: T}}
>  {{  @transient override final lazy val children: Seq[T] = left :: right :: 
> Nil}}
>  {{}}}
> {{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def first: T}}
>  {{  def second: T}}
>  {{  def third: T}}
>  {{  @transient override final lazy val children: Seq[T] = first :: second :: 
> third :: Nil}}
>  {{}}}
>  
> This refactoring, which is part of a bigger effort to make tree 
> transformations in Spark more efficient, has two benefits:
>  * It moves the children handling to a single place, instead of being spread 
> in specific subclasses, which will help the future optimizations for tree 
> traversals.
>  * It allows to mix in these traits with some concrete node types that could 
> not extend the previous classes. For example, expressions with one child that 
> extend `AggregateFunction` cannot extend `UnaryExpression` as 
> `AggregateFunction` defines the `foldable` method final while 
> `UnaryExpression` defines it as non final. With the new traits, we can 
> directly extend the concrete class from `UnaryLike` in these cases. Classes 
> with more specific child handling will make tree traversal methods faster.
> In this PR we have also updated many concrete node types to extend these 
> traits to benefit from more specific child handling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34906:


Assignee: Apache Spark

> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Assignee: Apache Spark
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of children, for example 
> `UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an 
> expression, a logical plan and a physical plan with only one child, 
> respectively. This PR refactors the `TreeNode` hierarchy by extracting the 
> children handling functionality into the following traits. The former nodes 
> such as `UnaryExpression` now extend the corresponding new trait:
> {{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  override final def children: Seq[T] = Nil}}
>  {{}}}
> {{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def child: T}}
>  {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
>  {{}}}
> {{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def left: T}}
>  {{  def right: T}}
>  {{  @transient override final lazy val children: Seq[T] = left :: right :: 
> Nil}}
>  {{}}}
> {{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def first: T}}
>  {{  def second: T}}
>  {{  def third: T}}
>  {{  @transient override final lazy val children: Seq[T] = first :: second :: 
> third :: Nil}}
>  {{}}}
>  
> This refactoring, which is part of a bigger effort to make tree 
> transformations in Spark more efficient, has two benefits:
>  * It moves the children handling to a single place, instead of being spread 
> in specific subclasses, which will help the future optimizations for tree 
> traversals.
>  * It allows to mix in these traits with some concrete node types that could 
> not extend the previous classes. For example, expressions with one child that 
> extend `AggregateFunction` cannot extend `UnaryExpression` as 
> `AggregateFunction` defines the `foldable` method final while 
> `UnaryExpression` defines it as non final. With the new traits, we can 
> directly extend the concrete class from `UnaryLike` in these cases. Classes 
> with more specific child handling will make tree traversal methods faster.
> In this PR we have also updated many concrete node types to extend these 
> traits to benefit from more specific child handling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the 
`TreeNode` hierarchy by extracting the children handling functionality into the 
following traits. The former nodes such as UnaryExpression now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with only one child, respectively. This PR 
refactors the `TreeNode` hierarchy by extracting the children handling 
functionality into the following traits. The former nodes such as 
`UnaryExpression` now extend the corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the TreeNode 
hierarchy by extracting the children handling functionality into the following 
traits. The former nodes such as UnaryExpression now extend the corresponding 
new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the 
`TreeNode` hierarchy by extracting the children handling functionality into the 
following traits. The former nodes such as UnaryExpression now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of childr

[jira] [Assigned] (SPARK-34899) Use origin plan if can not coalesce shuffle partition

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34899:
---

Assignee: ulysses you

> Use origin plan if can not coalesce shuffle partition
> -
>
> Key: SPARK-34899
> URL: https://issues.apache.org/jira/browse/SPARK-34899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> The `CoalesceShufflePartitions` can not coalesce such case if the total 
> shuffle partitions size of mappers are big enough. Then it's confused to use 
> `CustomShuffleReaderExec` which marked as `coalesced` but has no affect with 
> partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34899) Use origin plan if can not coalesce shuffle partition

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34899.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31994
[https://github.com/apache/spark/pull/31994]

> Use origin plan if can not coalesce shuffle partition
> -
>
> Key: SPARK-34899
> URL: https://issues.apache.org/jira/browse/SPARK-34899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> The `CoalesceShufflePartitions` can not coalesce such case if the total 
> shuffle partitions size of mappers are big enough. Then it's confused to use 
> `CustomShuffleReaderExec` which marked as `coalesced` but has no affect with 
> partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the TreeNode 
hierarchy by extracting the children handling functionality into the following 
traits. UnaryExpression` and other similar classes now extend the corresponding 
new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the TreeNode 
hierarchy by extracting the children handling functionality into the following 
traits. The former nodes such as UnaryExpression now extend the corresponding 
new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of childr

[jira] [Commented] (SPARK-34668) Support casting of day-time intervals to strings

2021-03-30 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311544#comment-17311544
 ] 

angerszhu commented on SPARK-34668:
---

[~maxgekk] Should we support cast String to DayTimeIntervalType too.


> Support casting of day-time intervals to strings
> 
>
> Key: SPARK-34668
> URL: https://issues.apache.org/jira/browse/SPARK-34668
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Extend the Cast expression and support DayTimeIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34668) Support casting of day-time intervals to strings

2021-03-30 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311544#comment-17311544
 ] 

angerszhu edited comment on SPARK-34668 at 3/30/21, 2:15 PM:
-

[~maxgekk] Should we support cast String to DayTimeIntervalType too ?



was (Author: angerszhuuu):
[~maxgekk] Should we support cast String to DayTimeIntervalType too.


> Support casting of day-time intervals to strings
> 
>
> Key: SPARK-34668
> URL: https://issues.apache.org/jira/browse/SPARK-34668
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Extend the Cast expression and support DayTimeIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32027) EventLoggingListener threw java.util.ConcurrentModificationException

2021-03-30 Thread Kristopher Kane (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311545#comment-17311545
 ] 

Kristopher Kane commented on SPARK-32027:
-

Possibly related and fixed with 
https://issues.apache.org/jira/browse/SPARK-34731

> EventLoggingListener threw  java.util.ConcurrentModificationException
> -
>
> Key: SPARK-32027
> URL: https://issues.apache.org/jira/browse/SPARK-32027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:568)
>   at 
> org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:574)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:573)
>   at 
> org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:159)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:81)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:159)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 20/06/18 20:22:25 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.util.JsonProtocol$.mapToJson(JsonPr

[jira] [Updated] (SPARK-34900) Some `spark-submit` commands used to run benchmarks in the user's guide is wrong

2021-03-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34900:
-
Fix Version/s: 3.0.3
   3.1.2

> Some `spark-submit`  commands used to run benchmarks in the user's guide is 
> wrong
> -
>
> Key: SPARK-34900
> URL: https://issues.apache.org/jira/browse/SPARK-34900
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> For example, the guide for running JoinBenchmark as follows:
>  
> {code:java}
> /**
>  * Benchmark to measure performance for joins.
>  * To run this benchmark:
>  * {{{
>  *   1. without sbt:
>  *  bin/spark-submit --class  --jars  
> 
>  *   2. build/sbt "sql/test:runMain "
>  *   3. generate result:
>  *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain  class>"
>  *  Results will be written to "benchmarks/JoinBenchmark-results.txt".
>  * }}}
>  */
> object JoinBenchmark extends SqlBasedBenchmark {
> {code}
>  
>  
> but if we run JoinBenchmark with commnad
>  
> {code:java}
> bin/spark-submit --class 
> org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars 
> spark-core_2.12-3.2.0-SNAPSHOT-tests.jar 
> spark-sql_2.12-3.2.0-SNAPSHOT-tests.jar 
> {code}
>  
> The following exception will be thrown：
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/catalyst/plans/SQLHelper
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369){code}
>  
> because SqlBasedBenchmark trait extends BenchmarkBase and SQLHelper, 
> SQLHelper def in spark-catalyst-tests.jar.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34828) YARN Shuffle Service: Support configurability of aux service name and service-specific config overrides

2021-03-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-34828.
---
Fix Version/s: 3.2.0
 Assignee: Erik Krogen
   Resolution: Fixed

> YARN Shuffle Service: Support configurability of aux service name and 
> service-specific config overrides
> ---
>
> Key: SPARK-34828
> URL: https://issues.apache.org/jira/browse/SPARK-34828
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0
>
>
> In some cases it may be desirable to run multiple instances of the Spark 
> Shuffle Service which are using different versions of Spark. This can be 
> helpful, for example, when running a YARN cluster with a mixed workload of 
> applications running multiple Spark versions, since a given version of the 
> shuffle service is not always compatible with other versions of Spark. (See 
> SPARK-27780 for more detail on this)
> YARN versions since 2.9.0 support the ability to run shuffle services within 
> an isolated classloader (see YARN-4577), meaning multiple Spark versions can 
> coexist within a single NodeManager.
> To support this from the Spark side, we need to make two enhancements:
> * Make the name of the shuffle service configurable. Currently it is 
> hard-coded to be {{spark_shuffle}} on both the client and server side. The 
> server-side name is not actually used anywhere, as it is the value within the 
> {{yarn.nodemanager.aux-services}} which is considered by the NodeManager to 
> be definitive name. However, if you change this in the configs, the 
> hard-coded name within the client will no longer match. So, this needs to be 
> configurable.
> * Add a way to separately configure the two shuffle service instances. Since 
> the configurations such as the port number are taken from the NodeManager 
> config, they will both try to use the same port, which obviously won't work. 
> So, we need to provide a way to selectively configure the two shuffle service 
> instances. I will go into details on my proposal for how to achieve this 
> within the PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34907) Add main class that runs all benchmarks

2021-03-30 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-34907:


 Summary: Add main class that runs all benchmarks
 Key: SPARK-34907
 URL: https://issues.apache.org/jira/browse/SPARK-34907
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


This is related to SPARK-31471. It should be good if we can have an automatic 
way to do it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31471) Add a script to run multiple benchmarks

2021-03-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31471.
--
Resolution: Duplicate

> Add a script to run multiple benchmarks
> ---
>
> Key: SPARK-31471
> URL: https://issues.apache.org/jira/browse/SPARK-31471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Max Gekk
>Priority: Minor
>
> Add a python script to run multiple benchmarks. The script can be taken from 
> [https://github.com/apache/spark/pull/27078]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34907) Add main class that runs all benchmarks

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311643#comment-17311643
 ] 

Apache Spark commented on SPARK-34907:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32005

> Add main class that runs all benchmarks
> ---
>
> Key: SPARK-34907
> URL: https://issues.apache.org/jira/browse/SPARK-34907
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is related to SPARK-31471. It should be good if we can have an automatic 
> way to do it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34907) Add main class that runs all benchmarks

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34907:


Assignee: (was: Apache Spark)

> Add main class that runs all benchmarks
> ---
>
> Key: SPARK-34907
> URL: https://issues.apache.org/jira/browse/SPARK-34907
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is related to SPARK-31471. It should be good if we can have an automatic 
> way to do it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34907) Add main class that runs all benchmarks

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34907:


Assignee: Apache Spark

> Add main class that runs all benchmarks
> ---
>
> Key: SPARK-34907
> URL: https://issues.apache.org/jira/browse/SPARK-34907
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> This is related to SPARK-31471. It should be good if we can have an automatic 
> way to do it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34908) Add test cases for char and varchar with functions

2021-03-30 Thread Kent Yao (Jira)

Kent Yao created SPARK-34908:


 Summary: Add test cases for char and varchar with functions
 Key: SPARK-34908
 URL: https://issues.apache.org/jira/browse/SPARK-34908
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao


Add test cases for char and varchar with functions to show the behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34908) Add test cases for char and varchar with functions

2021-03-30 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34908:
-
Priority: Minor  (was: Major)

> Add test cases for char and varchar with functions
> --
>
> Key: SPARK-34908
> URL: https://issues.apache.org/jira/browse/SPARK-34908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Add test cases for char and varchar with functions to show the behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Tim Armstrong (Jira)

Tim Armstrong created SPARK-34909:
-

 Summary: conv() does not convert negative inputs to unsigned 
correctly
 Key: SPARK-34909
 URL: https://issues.apache.org/jira/browse/SPARK-34909
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Tim Armstrong


{noformat}
scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
+---+
|   conv(-10, 11, 7)|
+---+
|4501202152252313413456|
+---+
scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
+--+
| hex(conv(-10, 11, 7))|
+--+
|3435303132303231353232353233313334313334353600|
+--+
{noformat}

The correct result is 45012021522523134134555. The above output has an 
incorrect second-to-last digit (6 instead of 5) and the last digit is a 
non-printing character the null byte.

I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
results. I tried replacing with java.lang.Long.divideUnsigned and that fixed it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34909:


Assignee: Apache Spark

> conv() does not convert negative inputs to unsigned correctly
> -
>
> Key: SPARK-34909
> URL: https://issues.apache.org/jira/browse/SPARK-34909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
> +---+
> |   conv(-10, 11, 7)|
> +---+
> |4501202152252313413456|
> +---+
> scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
> +--+
> | hex(conv(-10, 11, 7))|
> +--+
> |3435303132303231353232353233313334313334353600|
> +--+
> {noformat}
> The correct result is 45012021522523134134555. The above output has an 
> incorrect second-to-last digit (6 instead of 5) and the last digit is a 
> non-printing character the null byte.
> I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
> results. I tried replacing with java.lang.Long.divideUnsigned and that fixed 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311699#comment-17311699
 ] 

Apache Spark commented on SPARK-34909:
--

User 'timarmstrong' has created a pull request for this issue:
https://github.com/apache/spark/pull/32006

> conv() does not convert negative inputs to unsigned correctly
> -
>
> Key: SPARK-34909
> URL: https://issues.apache.org/jira/browse/SPARK-34909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Priority: Major
>
> {noformat}
> scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
> +---+
> |   conv(-10, 11, 7)|
> +---+
> |4501202152252313413456|
> +---+
> scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
> +--+
> | hex(conv(-10, 11, 7))|
> +--+
> |3435303132303231353232353233313334313334353600|
> +--+
> {noformat}
> The correct result is 45012021522523134134555. The above output has an 
> incorrect second-to-last digit (6 instead of 5) and the last digit is a 
> non-printing character the null byte.
> I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
> results. I tried replacing with java.lang.Long.divideUnsigned and that fixed 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34909:


Assignee: (was: Apache Spark)

> conv() does not convert negative inputs to unsigned correctly
> -
>
> Key: SPARK-34909
> URL: https://issues.apache.org/jira/browse/SPARK-34909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Priority: Major
>
> {noformat}
> scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
> +---+
> |   conv(-10, 11, 7)|
> +---+
> |4501202152252313413456|
> +---+
> scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
> +--+
> | hex(conv(-10, 11, 7))|
> +--+
> |3435303132303231353232353233313334313334353600|
> +--+
> {noformat}
> The correct result is 45012021522523134134555. The above output has an 
> incorrect second-to-last digit (6 instead of 5) and the last digit is a 
> non-printing character the null byte.
> I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
> results. I tried replacing with java.lang.Long.divideUnsigned and that fixed 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311700#comment-17311700
 ] 

Apache Spark commented on SPARK-34909:
--

User 'timarmstrong' has created a pull request for this issue:
https://github.com/apache/spark/pull/32006

> conv() does not convert negative inputs to unsigned correctly
> -
>
> Key: SPARK-34909
> URL: https://issues.apache.org/jira/browse/SPARK-34909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Priority: Major
>
> {noformat}
> scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
> +---+
> |   conv(-10, 11, 7)|
> +---+
> |4501202152252313413456|
> +---+
> scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
> +--+
> | hex(conv(-10, 11, 7))|
> +--+
> |3435303132303231353232353233313334313334353600|
> +--+
> {noformat}
> The correct result is 45012021522523134134555. The above output has an 
> incorrect second-to-last digit (6 instead of 5) and the last digit is a 
> non-printing character the null byte.
> I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
> results. I tried replacing with java.lang.Long.divideUnsigned and that fixed 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-34906.
---
Fix Version/s: 3.2.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.2.0
>
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of children, for example 
> UnaryExpression, UnaryNode and UnaryExec for representing an expression, a 
> logical plan and a physical plan with only one child, respectively. This PR 
> refactors the TreeNode hierarchy by extracting the children handling 
> functionality into the following traits. UnaryExpression` and other similar 
> classes now extend the corresponding new trait:
> {{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  override final def children: Seq[T] = Nil}}
>  {{}}}
> {{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def child: T}}
>  {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
>  {{}}}
> {{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def left: T}}
>  {{  def right: T}}
>  {{  @transient override final lazy val children: Seq[T] = left :: right :: 
> Nil}}
>  {{}}}
> {{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
>  {{  def first: T}}
>  {{  def second: T}}
>  {{  def third: T}}
>  {{  @transient override final lazy val children: Seq[T] = first :: second :: 
> third :: Nil}}
>  {{}}}
>  
> This refactoring, which is part of a bigger effort to make tree 
> transformations in Spark more efficient, has two benefits:
>  * It moves the children handling to a single place, instead of being spread 
> in specific subclasses, which will help the future optimizations for tree 
> traversals.
>  * It allows to mix in these traits with some concrete node types that could 
> not extend the previous classes. For example, expressions with one child that 
> extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
> defines the foldable method final while UnaryExpression defines it as non 
> final. With the new traits, we can directly extend the concrete class from 
> UnaryLike in these cases. Classes with more specific child handling will make 
> tree traversal methods faster.
> In this PR we have also updated many concrete node types to extend these 
> traits to benefit from more specific child handling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-03-30 Thread Daniel Zhi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311818#comment-17311818
 ] 

Daniel Zhi commented on SPARK-23977:


[~ste...@apache.org] Thanks for the info. Below are the related (key, value) we 
used:
 # spark.hadoop.fs.s3a.committer.name --- partitioned
 # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a --- 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
 # spark.sql.sources.commitProtocolClass --- 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
 # spark.sql.parquet.output.committer.class --- 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter

3 & 4 appear to be necessary to ensure S3A committers being used by Spark for 
parquet outputs, except that "INSERT OVERWRITE" is blocked by the 
dynamicPartitionOverwrite exception. It will be helpful and appreciated if you 
can patiently elaborate on the proper way to "use the partitioned committer and 
configure it to do the right thing ..." in Spark.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34910) Add an option for different stride orders

2021-03-30 Thread Jason Yarbrough (Jira)

Jason Yarbrough created SPARK-34910:
---

 Summary: Add an option for different stride orders
 Key: SPARK-34910
 URL: https://issues.apache.org/jira/browse/SPARK-34910
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Jason Yarbrough


Currently, the JDBCRelation columnPartition function orders the strides in 
ascending order, starting from the lower bound and working its way towards the 
upper bound.

I'm proposing leaving that as the default, but adding an option (such as 
strideOrder) in JDBCOptions. Since it will default to the current behavior, 
this will keep people's current code working as expected. However, people who 
may have data skew closer to the upper bound might appreciate being able to 
have the strides in descending order, thus filling up the first partition with 
the last stride and so forth. Also, people with nondeterministic data skew or 
sporadic data density might be able to benefit from a random ordering of the 
strides.

I have the code created to implement this, and it creates a pattern that can be 
used to add other algorithms that people may want to add (such as counting the 
rows and ranking each stride, and then ordering from most dense to least). The 
current two options I have coded is 'descending' and 'random.'

The original idea was to create something closer to Spark's hash partitioner, 
but for JDBC and pushed down to the database engine for efficiency. However, 
that would require adding hashing algorithms for each dialect, and the 
performance from those algorithms may outweigh the benefit. The method I'm 
proposing in this ticket avoids those complexities while still giving some of 
the benefit (in the case of random ordering).

I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34910) JDBC - Add an option for different stride orders

2021-03-30 Thread Jason Yarbrough (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34910:

Summary: JDBC - Add an option for different stride orders  (was: Add an 
option for different stride orders)

> JDBC - Add an option for different stride orders
> 
>
> Key: SPARK-34910
> URL: https://issues.apache.org/jira/browse/SPARK-34910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).
> I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-03-30 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311845#comment-17311845
 ] 

Pablo Langa Blanco commented on SPARK-34464:


Hi [~lfruleux],

Here it's a link that explain very good the reasons when the different types of 
aggregations are applied. 

[https://www.waitingforcode.com/apache-spark-sql/aggregations-execution-apache-spark-sql/read]

In the case you expose there are two things that make the aggregation fallback 
in a SortAggregate. The first is that the types of the aggregation are not 
primitive mutable types (necessary for HashAggregate). The first fallback is 
ObjectHashAggregate, but in this case first function is not supported by 
ObjectHashAggregate because it's not a TypedImperativeAggregate, so it fallback 
to SorteAggregate.

I don't know if this has any reason, I'm going to take a look if it's possible 
to TypedImperativeAggregate to fallback to ObjectHashAggregate.

Thanks!

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34860) Multinomial Logistic Regression with intercept support centering

2021-03-30 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34860.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31985
[https://github.com/apache/spark/pull/31985]

> Multinomial Logistic Regression with intercept support centering
> 
>
> Key: SPARK-34860
> URL: https://issues.apache.org/jira/browse/SPARK-34860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34860) Multinomial Logistic Regression with intercept support centering

2021-03-30 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34860:


Assignee: zhengruifeng

> Multinomial Logistic Regression with intercept support centering
> 
>
> Key: SPARK-34860
> URL: https://issues.apache.org/jira/browse/SPARK-34860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-30 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34795.
--
Fix Version/s: 3.2.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31886

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.2.0
>
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33350:


Assignee: Apache Spark

> Add support to DiskBlockManager to create merge directory and to get the 
> local shuffle merged data
> --
>
> Key: SPARK-33350
> URL: https://issues.apache.org/jira/browse/SPARK-33350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Apache Spark
>Priority: Major
>
> DiskBlockManager should be able to create the {{merge_manager}} directory, 
> where the push-based merged shuffle files are written and also create 
> sub-dirs under it. 
> It should also be able to serve the local merged shuffle data/index/meta 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311942#comment-17311942
 ] 

Apache Spark commented on SPARK-33350:
--

User 'zhouyejoe' has created a pull request for this issue:
https://github.com/apache/spark/pull/32007

> Add support to DiskBlockManager to create merge directory and to get the 
> local shuffle merged data
> --
>
> Key: SPARK-33350
> URL: https://issues.apache.org/jira/browse/SPARK-33350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> DiskBlockManager should be able to create the {{merge_manager}} directory, 
> where the push-based merged shuffle files are written and also create 
> sub-dirs under it. 
> It should also be able to serve the local merged shuffle data/index/meta 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33350:


Assignee: (was: Apache Spark)

> Add support to DiskBlockManager to create merge directory and to get the 
> local shuffle merged data
> --
>
> Key: SPARK-33350
> URL: https://issues.apache.org/jira/browse/SPARK-33350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> DiskBlockManager should be able to create the {{merge_manager}} directory, 
> where the push-based merged shuffle files are written and also create 
> sub-dirs under it. 
> It should also be able to serve the local merged shuffle data/index/meta 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread angerszhu (Jira)

angerszhu created SPARK-34911:
-

 Summary: Fix code close issue in monitoring.md
 Key: SPARK-34911
 URL: https://issues.apache.org/jira/browse/SPARK-34911
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34911:


Assignee: (was: Apache Spark)

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34911:


Assignee: Apache Spark

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311964#comment-17311964
 ] 

Apache Spark commented on SPARK-34911:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32008

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311965#comment-17311965
 ] 

Apache Spark commented on SPARK-34911:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32008

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-34911:
--
Component/s: (was: SQL)
 Spark Core

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34912) 启动spark-shell读取文件报错

2021-03-30 Thread czh (Jira)

czh created SPARK-34912:
---

 Summary: 启动spark-shell读取文件报错
 Key: SPARK-34912
 URL: https://issues.apache.org/jira/browse/SPARK-34912
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.0
Reporter: czh


启动spark-shell读取外部文件报错

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

spark版本1.6，hadoop版本2.6.0，已经把aws-java-sdk-1.7.4.jar和hadoop-aws-2.6.0.jar复制到spark下lib包中



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34913) 启动spark-shell读取文件报错

2021-03-30 Thread czh (Jira)

czh created SPARK-34913:
---

 Summary: 启动spark-shell读取文件报错
 Key: SPARK-34913
 URL: https://issues.apache.org/jira/browse/SPARK-34913
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.0
Reporter: czh


启动spark-shell读取外部文件报错

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

spark版本1.6，hadoop版本2.6.0，已经把aws-java-sdk-1.7.4.jar和hadoop-aws-2.6.0.jar复制到spark下lib包中



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-03-30 Thread Yuriy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311978#comment-17311978
 ] 

Yuriy commented on SPARK-34510:
---

Sean, the code is not actually performing any S3 operations, if you take a 
closer look at s3_repo.py all it is doing is running .foreachPartition on a 
data frame that contains 3 records and then printing out the results. It works 
locally for me just not when it’s deployed to EMR.

> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is 
> what controls the flow of the application and calls code inside the 
> _file_processor_ package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from pyspark.sql import SparkSession
> spark_session = SparkSession.builder.appName("Test").getOrCreate()
> {code}
> *s3_repo.py* 
> {code:java}
> from file_processor.config.spark import spark_session
> def save_to_s3():
> spark_session.sql('SELECT * FROM 
> rawFileData').toJSON().foreachPartition(_save_to_s3)
> def _save_to_s3(iterator):   
> for record in iterator:
> print(record)
> {code}
>  *table_creator.py*
> {code:java}
> from file_processor.config.spark import spark_session
> from pyspark.sql import Row
> def create_table():
> file_contents = [
> {'line_num': 1, 'contents': 'line 1'},
> {'line_num': 2, 'contents': 'line 2'},
> {'line_num': 3, 'contents': 'line 3'}
> ]
> spark_session.createDataFrame(Row(**row) for row in 
> file_contents).cache().createOrReplaceTempView("rawFileData")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34914) Local scheduler backend support update token

2021-03-30 Thread ulysses you (Jira)

ulysses you created SPARK-34914:
---

 Summary: Local scheduler backend support update token
 Key: SPARK-34914
 URL: https://issues.apache.org/jira/browse/SPARK-34914
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: ulysses you


`LocalSchedulerBackend` doesn't extend `CoarseGrainedSchedulerBackend` so in 
local mode, we don't support update token.

In proxy use case with follow cmd, we will get exception
{code:java}
 ./bin/spark-shell --master local --proxy-user user_name

> spark.sql("show tables")
{code}


{code:java}
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:477)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:285)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34914) Local scheduler backend support update token

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34914:


Assignee: (was: Apache Spark)

> Local scheduler backend support update token
> 
>
> Key: SPARK-34914
> URL: https://issues.apache.org/jira/browse/SPARK-34914
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> `LocalSchedulerBackend` doesn't extend `CoarseGrainedSchedulerBackend` so in 
> local mode, we don't support update token.
> In proxy use case with follow cmd, we will get exception
> {code:java}
>  ./bin/spark-shell --master local --proxy-user user_name
> > spark.sql("show tables")
> {code}
> {code:java}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:477)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:285)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34914) Local scheduler backend support update token

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34914:


Assignee: Apache Spark

> Local scheduler backend support update token
> 
>
> Key: SPARK-34914
> URL: https://issues.apache.org/jira/browse/SPARK-34914
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> `LocalSchedulerBackend` doesn't extend `CoarseGrainedSchedulerBackend` so in 
> local mode, we don't support update token.
> In proxy use case with follow cmd, we will get exception
> {code:java}
>  ./bin/spark-shell --master local --proxy-user user_name
> > spark.sql("show tables")
> {code}
> {code:java}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:477)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:285)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34914) Local scheduler backend support update token

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311989#comment-17311989
 ] 

Apache Spark commented on SPARK-34914:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32009

> Local scheduler backend support update token
> 
>
> Key: SPARK-34914
> URL: https://issues.apache.org/jira/browse/SPARK-34914
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> `LocalSchedulerBackend` doesn't extend `CoarseGrainedSchedulerBackend` so in 
> local mode, we don't support update token.
> In proxy use case with follow cmd, we will get exception
> {code:java}
>  ./bin/spark-shell --master local --proxy-user user_name
> > spark.sql("show tables")
> {code}
> {code:java}
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:477)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:285)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34913) 启动spark-shell读取文件报错

2021-03-30 Thread czh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34913:

Attachment: 微信截图_20210331110125.png

> 启动spark-shell读取文件报错
> ---
>
> Key: SPARK-34913
> URL: https://issues.apache.org/jira/browse/SPARK-34913
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210331110125.png
>
>
> 启动spark-shell读取外部文件报错
> Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
> spark版本1.6，hadoop版本2.6.0，已经把aws-java-sdk-1.7.4.jar和hadoop-aws-2.6.0.jar复制到spark下lib包中



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34911:
-
Priority: Trivial  (was: Major)

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Trivial
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-30 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312000#comment-17312000
 ] 

Sean R. Owen commented on SPARK-34911:
--

This could have just been a follow up to the other JIRA(s) too

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Trivial
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34912) 启动spark-shell读取文件报错

2021-03-30 Thread czh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34912:

Description: 
Start spark shell to read external file and report an error

 

Class org.apache.hadoop .fs.s3a.S3AFileSystem not found

 

Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
and hadoop-aws-2.6.0.jar to the Lib package under spark

  was:
启动spark-shell读取外部文件报错

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

spark版本1.6，hadoop版本2.6.0，已经把aws-java-sdk-1.7.4.jar和hadoop-aws-2.6.0.jar复制到spark下lib包中


> 启动spark-shell读取文件报错
> ---
>
> Key: SPARK-34912
> URL: https://issues.apache.org/jira/browse/SPARK-34912
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
>
> Start spark shell to read external file and report an error
>  
> Class org.apache.hadoop .fs.s3a.S3AFileSystem not found
>  
> Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
> and hadoop-aws-2.6.0.jar to the Lib package under spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34913) 启动spark-shell读取文件报错

2021-03-30 Thread czh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34913:

Description: 
Start spark shell to read external file and report an error

Class org.apache.hadoop .fs.s3a.S3AFileSystem not found

Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
and hadoop-aws-2.6.0.jar to the Lib package under spark

  was:
启动spark-shell读取外部文件报错

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

spark版本1.6，hadoop版本2.6.0，已经把aws-java-sdk-1.7.4.jar和hadoop-aws-2.6.0.jar复制到spark下lib包中


> 启动spark-shell读取文件报错
> ---
>
> Key: SPARK-34913
> URL: https://issues.apache.org/jira/browse/SPARK-34913
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210331110125.png
>
>
> Start spark shell to read external file and report an error
> Class org.apache.hadoop .fs.s3a.S3AFileSystem not found
> Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
> and hadoop-aws-2.6.0.jar to the Lib package under spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34913) Start spark shell to read file and report an error

2021-03-30 Thread czh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34913:

Summary: Start spark shell to read file and report an error  (was: 
启动spark-shell读取文件报错)

> Start spark shell to read file and report an error
> --
>
> Key: SPARK-34913
> URL: https://issues.apache.org/jira/browse/SPARK-34913
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210331110125.png
>
>
> Start spark shell to read external file and report an error
> Class org.apache.hadoop .fs.s3a.S3AFileSystem not found
> Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
> and hadoop-aws-2.6.0.jar to the Lib package under spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34912) Start spark shell to read file and report an error

2021-03-30 Thread czh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34912:

Summary: Start spark shell to read file and report an error  (was: 
启动spark-shell读取文件报错)

> Start spark shell to read file and report an error
> --
>
> Key: SPARK-34912
> URL: https://issues.apache.org/jira/browse/SPARK-34912
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
>
> Start spark shell to read external file and report an error
>  
> Class org.apache.hadoop .fs.s3a.S3AFileSystem not found
>  
> Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
> and hadoop-aws-2.6.0.jar to the Lib package under spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34908) Add test cases for char and varchar with functions

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34908:


Assignee: (was: Apache Spark)

> Add test cases for char and varchar with functions
> --
>
> Key: SPARK-34908
> URL: https://issues.apache.org/jira/browse/SPARK-34908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Add test cases for char and varchar with functions to show the behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34908) Add test cases for char and varchar with functions

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312035#comment-17312035
 ] 

Apache Spark commented on SPARK-34908:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32010

> Add test cases for char and varchar with functions
> --
>
> Key: SPARK-34908
> URL: https://issues.apache.org/jira/browse/SPARK-34908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Add test cases for char and varchar with functions to show the behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34908) Add test cases for char and varchar with functions

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34908:


Assignee: Apache Spark

> Add test cases for char and varchar with functions
> --
>
> Key: SPARK-34908
> URL: https://issues.apache.org/jira/browse/SPARK-34908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> Add test cases for char and varchar with functions to show the behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34907) Add main class that runs all benchmarks

2021-03-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34907.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32005
[https://github.com/apache/spark/pull/32005]

> Add main class that runs all benchmarks
> ---
>
> Key: SPARK-34907
> URL: https://issues.apache.org/jira/browse/SPARK-34907
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> This is related to SPARK-31471. It should be good if we can have an automatic 
> way to do it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34907) Add main class that runs all benchmarks

2021-03-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34907:


Assignee: Hyukjin Kwon

> Add main class that runs all benchmarks
> ---
>
> Key: SPARK-34907
> URL: https://issues.apache.org/jira/browse/SPARK-34907
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> This is related to SPARK-31471. It should be good if we can have an automatic 
> way to do it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-30 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-34915:


 Summary: Cache Maven, SBT and Scala in all jobs that use them
 Key: SPARK-34915
 URL: https://issues.apache.org/jira/browse/SPARK-34915
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Hyukjin Kwon


We should cache SBT, Maven and Scala for all jobs that use them. This is 
currently missing in some jobs such as 
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-03-30 Thread Yingyi Bu (Jira)

Yingyi Bu created SPARK-34916:
-

 Summary: Reduce tree traversals in transform/resolve function 
families
 Key: SPARK-34916
 URL: https://issues.apache.org/jira/browse/SPARK-34916
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Yingyi Bu
 Fix For: 3.2.0


Transform/resolve functions are called ~280k times per query on average for 
TPC-DS queries, which are way more than necessary. We can reduce those calls 
with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34915:


Assignee: (was: Apache Spark)

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34915:


Assignee: Apache Spark

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312055#comment-17312055
 ] 

Apache Spark commented on SPARK-34915:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32011

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-03-30 Thread Yingyi Bu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-34916:
--
   Fix Version/s: (was: 3.2.0)
Shepherd: Herman van Hövell
Target Version/s: 3.2.0

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for 
> TPC-DS queries, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312056#comment-17312056
 ] 

Apache Spark commented on SPARK-34915:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32011

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34909:
---

Assignee: Tim Armstrong

> conv() does not convert negative inputs to unsigned correctly
> -
>
> Key: SPARK-34909
> URL: https://issues.apache.org/jira/browse/SPARK-34909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>
> {noformat}
> scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
> +---+
> |   conv(-10, 11, 7)|
> +---+
> |4501202152252313413456|
> +---+
> scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
> +--+
> | hex(conv(-10, 11, 7))|
> +--+
> |3435303132303231353232353233313334313334353600|
> +--+
> {noformat}
> The correct result is 45012021522523134134555. The above output has an 
> incorrect second-to-last digit (6 instead of 5) and the last digit is a 
> non-printing character the null byte.
> I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
> results. I tried replacing with java.lang.Long.divideUnsigned and that fixed 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34909) conv() does not convert negative inputs to unsigned correctly

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34909.
-
Fix Version/s: 3.0.3
   3.1.2
   2.4.8
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32006
[https://github.com/apache/spark/pull/32006]

> conv() does not convert negative inputs to unsigned correctly
> -
>
> Key: SPARK-34909
> URL: https://issues.apache.org/jira/browse/SPARK-34909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: 3.2.0, 2.4.8, 3.1.2, 3.0.3
>
>
> {noformat}
> scala> spark.sql("select conv('-10', 11, 7)").show(20, 150)
> +---+
> |   conv(-10, 11, 7)|
> +---+
> |4501202152252313413456|
> +---+
> scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150)
> +--+
> | hex(conv(-10, 11, 7))|
> +--+
> |3435303132303231353232353233313334313334353600|
> +--+
> {noformat}
> The correct result is 45012021522523134134555. The above output has an 
> incorrect second-to-last digit (6 instead of 5) and the last digit is a 
> non-printing character the null byte.
> I tracked the bug down to NumberConverter.unsignedLongDiv returning incorrect 
> results. I tried replacing with java.lang.Long.divideUnsigned and that fixed 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-03-30 Thread Yingyi Bu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-34916:
--
Description: Transform/resolve functions are called ~280k times per query 
on average for a TPC-DS query, which are way more than necessary. We can reduce 
those calls with early exit information and conditons.  (was: Transform/resolve 
functions are called ~280k times per query on average for TPC-DS queries, which 
are way more than necessary. We can reduce those calls with early exit 
information and conditons.)

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34896) Return day-time interval from dates subtraction

2021-03-30 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-34896.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31996
[https://github.com/apache/spark/pull/31996]

> Return day-time interval from dates subtraction
> ---
>
> Key: SPARK-34896
> URL: https://issues.apache.org/jira/browse/SPARK-34896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> # Add SQL config to switch between new ANSI intervals and CalendarIntervalType
> # Modify SubtractDates to return DayTimeIntervalType when the config is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34917) Create SQL syntax document for CAST

2021-03-30 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-34917:
--

 Summary: Create SQL syntax document for CAST
 Key: SPARK-34917
 URL: https://issues.apache.org/jira/browse/SPARK-34917
 Project: Spark
  Issue Type: Task
  Components: Documentation
Affects Versions: 3.2.0
Reporter: Gengliang Wang


Documentation for the behavior of CAST, including valid conversion types 
combinations, the result of integral overflow/string parsing errors, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34918) Create SQL syntax document for TRT_CAST

2021-03-30 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-34918:
--

 Summary: Create SQL syntax document for TRT_CAST
 Key: SPARK-34918
 URL: https://issues.apache.org/jira/browse/SPARK-34918
 Project: Spark
  Issue Type: Task
  Components: Documentation
Affects Versions: 3.2.0
Reporter: Gengliang Wang


Documentation for the behavior of CAST, including valid conversion types 
combinations, the result of integral overflow/string parsing errors, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34568) enableHiveSupport should ignore if SparkContext is created

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34568:
---

Assignee: angerszhu

> enableHiveSupport should ignore if SparkContext is created
> --
>
> Key: SPARK-34568
> URL: https://issues.apache.org/jira/browse/SPARK-34568
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> If SparkContext is created, 
> SparkSession.builder.enableHiveSupport().getOrCreate() won't load hive 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34568) enableHiveSupport should ignore if SparkContext is created

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34568.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31680
[https://github.com/apache/spark/pull/31680]

> enableHiveSupport should ignore if SparkContext is created
> --
>
> Key: SPARK-34568
> URL: https://issues.apache.org/jira/browse/SPARK-34568
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> If SparkContext is created, 
> SparkSession.builder.enableHiveSupport().getOrCreate() won't load hive 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34354) CostBasedJoinReorder can fail on self-join

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34354:
---

Assignee: wuyi

> CostBasedJoinReorder can fail on self-join
> --
>
> Key: SPARK-34354
> URL: https://issues.apache.org/jira/browse/SPARK-34354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
>  
> For example:
> {code:java}
> test("join reorder with self-join") {
>   val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === 
> nameToAttr("t2.k-1-5")))
> .select(nameToAttr("t1.v-1-10"))
> .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))
>   // this can fail
>   Optimize.execute(plan.analyze)
> }
> {code}
> error:
> {code:java}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34354) CostBasedJoinReorder can fail on self-join

2021-03-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34354.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31470
[https://github.com/apache/spark/pull/31470]

> CostBasedJoinReorder can fail on self-join
> --
>
> Key: SPARK-34354
> URL: https://issues.apache.org/jira/browse/SPARK-34354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>
>  
> For example:
> {code:java}
> test("join reorder with self-join") {
>   val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === 
> nameToAttr("t2.k-1-5")))
> .select(nameToAttr("t1.v-1-10"))
> .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))
>   // this can fail
>   Optimize.execute(plan.analyze)
> }
> {code}
> error:
> {code:java}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

95 matches

Mail list logo