[jira] [Resolved] (SPARK-38382) Refactor migration guide's sentences

2022-03-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38382.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35705
[https://github.com/apache/spark/pull/35705]

> Refactor migration guide's sentences
> 
>
> Key: SPARK-38382
> URL: https://issues.apache.org/jira/browse/SPARK-38382
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Current migration guide use Since spark x.x.x and In spark x.x.x, we should 
> unify it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38382) Refactor migration guide's sentences

2022-03-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38382:
---

Assignee: angerszhu

> Refactor migration guide's sentences
> 
>
> Key: SPARK-38382
> URL: https://issues.apache.org/jira/browse/SPARK-38382
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Trivial
>
> Current migration guide use Since spark x.x.x and In spark x.x.x, we should 
> unify it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-03-07 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502123#comment-17502123
 ] 

L. C. Hsieh commented on SPARK-38285:
-

No worries as bug fix is not blocked by code freeze. I've submitted a PR for 
this.

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {code:python}
> spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)

[jira] [Assigned] (SPARK-38431) Support to delete matched rows from jdbc tables

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38431:


Assignee: Apache Spark

> Support to delete matched rows from jdbc tables
> ---
>
> Key: SPARK-38431
> URL: https://issues.apache.org/jira/browse/SPARK-38431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Assignee: Apache Spark
>Priority: Major
>
> The Spark SQL cannot perform delete opration when it accesses the RDBMS. I 
> think that It's not friendly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38431) Support to delete matched rows from jdbc tables

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38431:


Assignee: (was: Apache Spark)

> Support to delete matched rows from jdbc tables
> ---
>
> Key: SPARK-38431
> URL: https://issues.apache.org/jira/browse/SPARK-38431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> The Spark SQL cannot perform delete opration when it accesses the RDBMS. I 
> think that It's not friendly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38285:


Assignee: (was: Apache Spark)

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {code:python}
> spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
> {code}
> Update: 

[jira] [Commented] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502135#comment-17502135
 ] 

Apache Spark commented on SPARK-38285:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35749

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {code:python}
> spark.conf.set("spark.sql.optimizer.expression.nestedPrunin

[jira] [Commented] (SPARK-38431) Support to delete matched rows from jdbc tables

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502134#comment-17502134
 ] 

Apache Spark commented on SPARK-38431:
--

User 'caican00' has created a pull request for this issue:
https://github.com/apache/spark/pull/35748

> Support to delete matched rows from jdbc tables
> ---
>
> Key: SPARK-38431
> URL: https://issues.apache.org/jira/browse/SPARK-38431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> The Spark SQL cannot perform delete opration when it accesses the RDBMS. I 
> think that It's not friendly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38285:


Assignee: Apache Spark

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Assignee: Apache Spark
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {code:python}
> spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", F

[jira] [Resolved] (SPARK-38335) Parser changes for DEFAULT column support

2022-03-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38335.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35690
[https://github.com/apache/spark/pull/35690]

> Parser changes for DEFAULT column support
> -
>
> Key: SPARK-38335
> URL: https://issues.apache.org/jira/browse/SPARK-38335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Daniel
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38433) Add Shell Code Style Check Action

2022-03-07 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-38433:
--

 Summary: Add Shell Code Style Check Action
 Key: SPARK-38433
 URL: https://issues.apache.org/jira/browse/SPARK-38433
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Jackey Lee


There is no shell code check in the current spark github actions. Some shell 
codes are written incorrectly and run abnormally in special cases. Besides, 
they cannot also pass the check of the shellcheck plugin, especially in IDEA or 
shellcheck actions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28173) Add Kafka delegation token proxy user support

2022-03-07 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502162#comment-17502162
 ] 

Gabor Somogyi commented on SPARK-28173:
---

Looks like there is a willingness to merge the Kafka part. Hope we can kick 
this forward relatively soon to finish this feature.

> Add Kafka delegation token proxy user support
> -
>
> Key: SPARK-28173
> URL: https://issues.apache.org/jira/browse/SPARK-28173
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In SPARK-26592 I've turned off proxy user usage because 
> https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. 
> Since the KIP will be under discussion and hopefully implemented here is this 
> jira to track the Spark side effort.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38335) Parser changes for DEFAULT column support

2022-03-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38335:
--

Assignee: Daniel

> Parser changes for DEFAULT column support
> -
>
> Key: SPARK-38335
> URL: https://issues.apache.org/jira/browse/SPARK-38335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38413) Invalid call to exprId

2022-03-07 Thread liujintao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502169#comment-17502169
 ] 

liujintao commented on SPARK-38413:
---

create view step_29100348500  as

            select c_user_id, c_column_1, c_column_2, c_column_3, c_column_4, 
c_column_5,

                        c_column_6, c_column_7, c_column_8, c_column_9, 
c_column_10, c_column_11,

                       c_column_12, c_column_13, c_column_14, c_column_15, 
c_column_16, c_column_17,

                       c_column_18, c_column_19, c_column_20, c_column_21, 
c_column_22, c_column_23,

                      c_column_24, c_column_25, c_column_26, c_column_27, 
c_column_28, c_column_29,

                       c_column_30, c_target, cast(abs( c_column_2) as bigint) 
c_60767

          from step_29100328097.

This is my sql,I only  support  2.4.

[~hyukjin.kwon] 

> Invalid call to exprId
> --
>
> Key: SPARK-38413
> URL: https://issues.apache.org/jira/browse/SPARK-38413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: liujintao
>Priority: Major
>
> java.sql.SQLException: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'c_column_2
> java.lang.RuntimeException: java.sql.SQLException: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'c_column_2
>     at 
> com.bonc.ds.xquery.explorev2.stephandler.utils.HiveJdbcExecutor.createTableWithSql(HiveJdbcExecutor.java:602)
>     at 
> com.bonc.ds.xquery.explorev2.stephandler.operator.AbstractOperator.execute(AbstractOperator.java:183)
> I see that this issue is fixed by 2.4.0.
> I have no idea, i use hadoop3.1.4 spark 2.4.4



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38413) Invalid call to exprId

2022-03-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502171#comment-17502171
 ] 

Hyukjin Kwon commented on SPARK-38413:
--

It would be great to make it self-contained. Also, we won't likely fix it in 
2.X because it's EOL. it would be good to check if the same exists in 3+ too.

> Invalid call to exprId
> --
>
> Key: SPARK-38413
> URL: https://issues.apache.org/jira/browse/SPARK-38413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: liujintao
>Priority: Major
>
> java.sql.SQLException: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'c_column_2
> java.lang.RuntimeException: java.sql.SQLException: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'c_column_2
>     at 
> com.bonc.ds.xquery.explorev2.stephandler.utils.HiveJdbcExecutor.createTableWithSql(HiveJdbcExecutor.java:602)
>     at 
> com.bonc.ds.xquery.explorev2.stephandler.operator.AbstractOperator.execute(AbstractOperator.java:183)
> I see that this issue is fixed by 2.4.0.
> I have no idea, i use hadoop3.1.4 spark 2.4.4



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38433) Add Shell Code Style Check Action

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502172#comment-17502172
 ] 

Apache Spark commented on SPARK-38433:
--

User 'stczwd' has created a pull request for this issue:
https://github.com/apache/spark/pull/35751

> Add Shell Code Style Check Action
> -
>
> Key: SPARK-38433
> URL: https://issues.apache.org/jira/browse/SPARK-38433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> There is no shell code check in the current spark github actions. Some shell 
> codes are written incorrectly and run abnormally in special cases. Besides, 
> they cannot also pass the check of the shellcheck plugin, especially in IDEA 
> or shellcheck actions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38433) Add Shell Code Style Check Action

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38433:


Assignee: (was: Apache Spark)

> Add Shell Code Style Check Action
> -
>
> Key: SPARK-38433
> URL: https://issues.apache.org/jira/browse/SPARK-38433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> There is no shell code check in the current spark github actions. Some shell 
> codes are written incorrectly and run abnormally in special cases. Besides, 
> they cannot also pass the check of the shellcheck plugin, especially in IDEA 
> or shellcheck actions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38433) Add Shell Code Style Check Action

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38433:


Assignee: Apache Spark

> Add Shell Code Style Check Action
> -
>
> Key: SPARK-38433
> URL: https://issues.apache.org/jira/browse/SPARK-38433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Assignee: Apache Spark
>Priority: Major
>
> There is no shell code check in the current spark github actions. Some shell 
> codes are written incorrectly and run abnormally in special cases. Besides, 
> they cannot also pass the check of the shellcheck plugin, especially in IDEA 
> or shellcheck actions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38413) Invalid call to exprId

2022-03-07 Thread liujintao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujintao resolved SPARK-38413.
---
Resolution: Fixed

> Invalid call to exprId
> --
>
> Key: SPARK-38413
> URL: https://issues.apache.org/jira/browse/SPARK-38413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: liujintao
>Priority: Major
>
> java.sql.SQLException: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'c_column_2
> java.lang.RuntimeException: java.sql.SQLException: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'c_column_2
>     at 
> com.bonc.ds.xquery.explorev2.stephandler.utils.HiveJdbcExecutor.createTableWithSql(HiveJdbcExecutor.java:602)
>     at 
> com.bonc.ds.xquery.explorev2.stephandler.operator.AbstractOperator.execute(AbstractOperator.java:183)
> I see that this issue is fixed by 2.4.0.
> I have no idea, i use hadoop3.1.4 spark 2.4.4



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread huangtengfei (Jira)
huangtengfei created SPARK-38434:


 Summary: Correct semantic of 
CheckAnalysis.getDataTypesAreCompatibleFn method
 Key: SPARK-38434
 URL: https://issues.apache.org/jira/browse/SPARK-38434
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: huangtengfei


Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
 implemented as:

{code:java}
  private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
DataType) => Boolean = {
val isUnion = plan.isInstanceOf[Union]
if (isUnion) {
  (dt1: DataType, dt2: DataType) =>
!DataType.equalsStructurally(dt1, dt2, true)
} else {
  // SPARK-18058: we shall not care about the nullability of columns
  (dt1: DataType, dt2: DataType) =>
TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty
}
  }
{code}

Return false when data types are compatible, otherwise return true, which is 
pretty confusing.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38434:


Assignee: Apache Spark

> Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
> 
>
> Key: SPARK-38434
> URL: https://issues.apache.org/jira/browse/SPARK-38434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: huangtengfei
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
>  implemented as:
> {code:java}
>   private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
> DataType) => Boolean = {
> val isUnion = plan.isInstanceOf[Union]
> if (isUnion) {
>   (dt1: DataType, dt2: DataType) =>
> !DataType.equalsStructurally(dt1, dt2, true)
> } else {
>   // SPARK-18058: we shall not care about the nullability of columns
>   (dt1: DataType, dt2: DataType) =>
> TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
> dt2.asNullable).isEmpty
> }
>   }
> {code}
> Return false when data types are compatible, otherwise return true, which is 
> pretty confusing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38434:


Assignee: (was: Apache Spark)

> Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
> 
>
> Key: SPARK-38434
> URL: https://issues.apache.org/jira/browse/SPARK-38434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: huangtengfei
>Priority: Minor
>
> Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
>  implemented as:
> {code:java}
>   private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
> DataType) => Boolean = {
> val isUnion = plan.isInstanceOf[Union]
> if (isUnion) {
>   (dt1: DataType, dt2: DataType) =>
> !DataType.equalsStructurally(dt1, dt2, true)
> } else {
>   // SPARK-18058: we shall not care about the nullability of columns
>   (dt1: DataType, dt2: DataType) =>
> TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
> dt2.asNullable).isEmpty
> }
>   }
> {code}
> Return false when data types are compatible, otherwise return true, which is 
> pretty confusing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502215#comment-17502215
 ] 

Apache Spark commented on SPARK-38434:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/35752

> Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
> 
>
> Key: SPARK-38434
> URL: https://issues.apache.org/jira/browse/SPARK-38434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: huangtengfei
>Priority: Minor
>
> Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
>  implemented as:
> {code:java}
>   private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
> DataType) => Boolean = {
> val isUnion = plan.isInstanceOf[Union]
> if (isUnion) {
>   (dt1: DataType, dt2: DataType) =>
> !DataType.equalsStructurally(dt1, dt2, true)
> } else {
>   // SPARK-18058: we shall not care about the nullability of columns
>   (dt1: DataType, dt2: DataType) =>
> TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
> dt2.asNullable).isEmpty
> }
>   }
> {code}
> Return false when data types are compatible, otherwise return true, which is 
> pretty confusing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502214#comment-17502214
 ] 

Apache Spark commented on SPARK-38434:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/35752

> Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
> 
>
> Key: SPARK-38434
> URL: https://issues.apache.org/jira/browse/SPARK-38434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: huangtengfei
>Priority: Minor
>
> Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
>  implemented as:
> {code:java}
>   private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
> DataType) => Boolean = {
> val isUnion = plan.isInstanceOf[Union]
> if (isUnion) {
>   (dt1: DataType, dt2: DataType) =>
> !DataType.equalsStructurally(dt1, dt2, true)
> } else {
>   // SPARK-18058: we shall not care about the nullability of columns
>   (dt1: DataType, dt2: DataType) =>
> TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
> dt2.asNullable).isEmpty
> }
>   }
> {code}
> Return false when data types are compatible, otherwise return true, which is 
> pretty confusing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38394) build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle classpath error

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502219#comment-17502219
 ] 

Apache Spark commented on SPARK-38394:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/35753

> build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle 
> classpath error
> --
>
> Key: SPARK-38394
> URL: https://issues.apache.org/jira/browse/SPARK-38394
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.3.0
>
>
> builidng spark master with {{-Dhadoop.version=3.4.0-SNAPSHOT}} and a local 
> hadoop build breaks in the sbt compiler plugin
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.3.0
> {code}
> * this is the classpath of the sbt compiler
> * hadoop hasn't been doing anything related to bouncy castle.
> setting scala-maven-plugin to 3.4.0 makes this go away, i.e. reapplying 
> SPARK-36547
> the implication here is that the plugin version is going to have to be 
> configured in different profiles.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38394) build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle classpath error

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502218#comment-17502218
 ] 

Apache Spark commented on SPARK-38394:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/35753

> build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle 
> classpath error
> --
>
> Key: SPARK-38394
> URL: https://issues.apache.org/jira/browse/SPARK-38394
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.3.0
>
>
> builidng spark master with {{-Dhadoop.version=3.4.0-SNAPSHOT}} and a local 
> hadoop build breaks in the sbt compiler plugin
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.3.0
> {code}
> * this is the classpath of the sbt compiler
> * hadoop hasn't been doing anything related to bouncy castle.
> setting scala-maven-plugin to 3.4.0 makes this go away, i.e. reapplying 
> SPARK-36547
> the implication here is that the plugin version is going to have to be 
> configured in different profiles.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)
Julien Peloton created SPARK-38435:
--

 Summary: Pandas UDF with type hints crashes at import
 Key: SPARK-38435
 URL: https://issues.apache.org/jira/browse/SPARK-38435
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
 Environment: Spark: 3.1
Python: 3.7
Reporter: Julien Peloton


## Old style pandas UDF

let's consider a pandas UDF defined in the old style:

 

 
{code:java}
// code placeholder
{code}
 

 

```python
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
```

I can import it and use it as:

```python
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show()
```

and launch it via:

```bash
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+
```

Except the `UserWarning`, the code is working as expected.

## New style pandas UDF: using type hint

Let's now switch to the version using type hints:

```python
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper()
```

But this time, I obtain an `AttributeError`:

```
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
```

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827

which at the time of the import is not the case. 

## Questions

First, am I doing something wrong? I do not see in the documentation 
(https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html)
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users wou

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:java}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper(){code}
I can import it and use it as:
{code:java}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upperspark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:java}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:

```python
 # mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper()
```

But this time, I obtain an `AttributeError`:

```
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig] for more 
info.
```

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
 # 
 ## Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[https://github

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:java}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper(){code}
I can import it and use it as:
{code:java}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upperspark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:java}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:java}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:java}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[https://gith

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:java}
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}{code}
I can import it and use it as:
{code:java}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upperspark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:java}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:java}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:java}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-3

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:

{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}
I can import it and use it as:
{code:python}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upperspark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:python}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:bash}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[ht

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}
I can import it and use it as:
{code:python}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:python}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:bash}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[h

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:python}
# mymod.py
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}
I can import it and use it as:
{code:python}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:python}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:bash}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[https://github.com/ap

[jira] [Updated] (SPARK-38435) Pandas UDF with type hints crashes at import

2022-03-07 Thread Julien Peloton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
---
Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:python}
# mymod.py
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}
I can import it and use it as:
{code:python}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:python}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--+
|to_upper(name)|
+--+
|      JOHN DOE|
+--+ {code}
as expected, we obtain the `UserWarning`, but the code is working fine.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:bash}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in 
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in 
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[https://

[jira] [Commented] (SPARK-38407) ANSI Cast: loosen the limitation of casting non-null complex types

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502239#comment-17502239
 ] 

Apache Spark commented on SPARK-38407:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35754

> ANSI Cast: loosen the limitation of casting non-null complex types
> --
>
> Key: SPARK-38407
> URL: https://issues.apache.org/jira/browse/SPARK-38407
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> When ANSI mode is off, `ArrayType(DoubleType, containsNull = false)` can't 
> cast as `ArrayType(IntegerType, containsNull = false)` since there can be 
> overflow thus result in null results and breaks the non-null constraint.
>  
> When ANSI mode is on, currently Spark SQL has the same behavior. However, 
> this is not correct since the non-null constraint won't be break. Spark SQL 
> can just execute the cast and throw runtime error on overflow, just like 
> casting DoubleType as IntegerType.
>  
> This applies to MapType and StructType as well.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38434.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35752
[https://github.com/apache/spark/pull/35752]

> Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
> 
>
> Key: SPARK-38434
> URL: https://issues.apache.org/jira/browse/SPARK-38434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
>  implemented as:
> {code:java}
>   private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
> DataType) => Boolean = {
> val isUnion = plan.isInstanceOf[Union]
> if (isUnion) {
>   (dt1: DataType, dt2: DataType) =>
> !DataType.equalsStructurally(dt1, dt2, true)
> } else {
>   // SPARK-18058: we shall not care about the nullability of columns
>   (dt1: DataType, dt2: DataType) =>
> TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
> dt2.asNullable).isEmpty
> }
>   }
> {code}
> Return false when data types are compatible, otherwise return true, which is 
> pretty confusing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38434) Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method

2022-03-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38434:
--

Assignee: huangtengfei

> Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
> 
>
> Key: SPARK-38434
> URL: https://issues.apache.org/jira/browse/SPARK-38434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Minor
>
> Currently, in `CheckAnalysis` method  [getDataTypesAreCompatibleFn 
> |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L606]
>  implemented as:
> {code:java}
>   private def getDataTypesAreCompatibleFn(plan: LogicalPlan): (DataType, 
> DataType) => Boolean = {
> val isUnion = plan.isInstanceOf[Union]
> if (isUnion) {
>   (dt1: DataType, dt2: DataType) =>
> !DataType.equalsStructurally(dt1, dt2, true)
> } else {
>   // SPARK-18058: we shall not care about the nullability of columns
>   (dt1: DataType, dt2: DataType) =>
> TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
> dt2.asNullable).isEmpty
> }
>   }
> {code}
> Return false when data types are compatible, otherwise return true, which is 
> pretty confusing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Jira
Bjørn Jørgensen created SPARK-38436:
---

 Summary: Duplicate test_floor functions 
 Key: SPARK-38436
 URL: https://issues.apache.org/jira/browse/SPARK-38436
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Bjørn Jørgensen


In python/pyspark/pandas/tests/test_series_datetime.py line 262 



{code:java}
def test_floor(self):
self.check_func(lambda x: x.dt.floor(freq="min"))
self.check_func(lambda x: x.dt.floor(freq="H"))

def test_ceil(self):
self.check_func(lambda x: x.dt.floor(freq="min"))
self.check_func(lambda x: x.dt.floor(freq="H"))
{code}

Change  x.dt.floor to  x.dt.ceil 




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502316#comment-17502316
 ] 

Apache Spark commented on SPARK-38436:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35755

> Duplicate test_floor functions 
> ---
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Trivial
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38436:


Assignee: Apache Spark

> Duplicate test_floor functions 
> ---
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Trivial
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502315#comment-17502315
 ] 

Apache Spark commented on SPARK-38436:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35755

> Duplicate test_floor functions 
> ---
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Trivial
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38436:


Assignee: (was: Apache Spark)

> Duplicate test_floor functions 
> ---
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Trivial
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38282) Avoid duplicating complex partitioning expressions

2022-03-07 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502373#comment-17502373
 ] 

Wenchen Fan commented on SPARK-38282:
-

cc [~yumwang] [~chengsu] 

> Avoid duplicating complex partitioning expressions
> --
>
> Key: SPARK-38282
> URL: https://issues.apache.org/jira/browse/SPARK-38282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> Spark will duplicate all non-trivial expressions in Window.partitionBy, that 
> will result in duplicate exchanges and WindowExec nodes.
> An example unit test:
> {code}
>   test("SPARK-38282: Avoid duplicating complex partitioning expressions") {
> val group = functions.col("id") % 2
> val min = functions.min("id").over(Window.partitionBy(group))
> val max = functions.max("id").over(Window.partitionBy(group))
> val df1 = spark.range(1, 4)
>   .withColumn("ratio", max / min)
> val df2 = spark.range(1, 4)
>   .withColumn("min", min)
>   .withColumn("max", max)
>   .select(col("id"), (col("max") / col("min")).as("ratio"))
> Seq(df1, df2).foreach { df =>
>   checkAnswer(
> df,
> Seq(Row(1L, 3.0), Row(2L, 1.0), Row(3L, 3.0)))
>   val windows = collect(df.queryExecution.executedPlan) {
> case w: WindowExec => w
>   }
>   assert(windows.size == 1)
> }
>   }
> {code}
> The query plan for this (_w0#5L and _w1#6L are duplicates):
> {code}
> Window [min(id#2L) windowspecdefinition(_w1#6L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS _we1#8L], [_w1#6L]
>+- *(4) Sort [_w1#6L ASC NULLS FIRST], false, 0
>   +- AQEShuffleRead coalesced
>  +- ShuffleQueryStage 1
> +- Exchange hashpartitioning(_w1#6L, 5), ENSURE_REQUIREMENTS, 
> [id=#256]
>+- *(3) Project [id#2L, _w1#6L, _we0#7L]
>   +- Window [max(id#2L) windowspecdefinition(_w0#5L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS _we0#7L], [_w0#5L]
>  +- *(2) Sort [_w0#5L ASC NULLS FIRST], false, 0
> +- AQEShuffleRead coalesced
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(_w0#5L, 5), 
> ENSURE_REQUIREMENTS, [id=#203]
>  +- *(1) Project [id#2L, (id#2L % 2) AS 
> _w0#5L, (id#2L % 2) AS _w1#6L]
> +- *(1) Range (1, 4, step=1, splits=2)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38104.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35718
[https://github.com/apache/spark/pull/35718]

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Yuto Akutsu
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38104:


Assignee: Yuto Akutsu

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Yuto Akutsu
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-03-07 Thread Jason Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501894#comment-17501894
 ] 

Jason Xu edited comment on SPARK-38388 at 3/7/22, 6:08 PM:
---

[~jiangxb1987] thanks for reply, above repo example uses DataFrame APIs and 
doesn't use RDD directly, I don't follow how to override the 
`getOutputDeterministicLevel` function in this case. If I missed something, 
could you help suggest how to modify above repo code?
I see `getOutputDeterministicLevel` function in RDD is introduced in 
[https://github.com/apache/spark/pull/22112|https://github.com/apache/spark/pull/22112,]
 ,does it only help when user create a customized RDD?


was (Author: kings129):
[~jiangxb1987] thanks for reply, above repo example uses DataFrame APIs and 
doesn't use RDD directly, I don't follow how to override the 
`getOutputDeterministicLevel` function in this case. If I missed something, 
could you help suggest how to modify above repo code?
I see `getOutputDeterministicLevel` function in RDD is introduced in 
[https://github.com/apache/spark/pull/22112,] does it only help when user 
create a customized RDD?

> Repartition + Stage retries could lead to incorrect data 
> -
>
> Key: SPARK-38388
> URL: https://issues.apache.org/jira/browse/SPARK-38388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.1.1
> Environment: Spark 2.4 and 3.1
>Reporter: Jason Xu
>Priority: Major
>
> Spark repartition uses RoundRobinPartitioning, the generated results is 
> non-deterministic when data has some randomness and stage/task retries happen.
> The bug can be triggered when upstream data has some randomness, a 
> repartition is called on them, then followed by result stage (could be more 
> stages).
> As the pattern shows below:
> upstream stage (data with randomness) -> (repartition shuffle) -> result stage
> When one executor goes down at result stage, some tasks of that stage might 
> have finished, others would fail, shuffle files on that executor also get 
> lost, some tasks from previous stage (upstream data generation, repartition) 
> will need to rerun to generate dependent shuffle data files.
> Because data has some randomness, regenerated data in upstream retried tasks 
> is slightly different, repartition then generates inconsistent ordering, then 
> tasks at result stage will be retried generating different data.
> This is similar but different to 
> https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra 
> local sort to make the row ordering deterministic, the sorting algorithm it 
> uses simply compares row/record binaries. But in this case, upstream data has 
> some randomness, the sorting algorithm doesn't help keep the order, thus 
> RoundRobinPartitioning introduced non-deterministic result.
> The following code returns 986415, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> case class TestObject(id: Long, value: Double)
> val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
> $"id").withColumn("val", rand()).repartition(100).map { 
>   row => if (TaskContext.get.stageAttemptNumber == 0 && 
> TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
> throw new Exception("pkill -f java".!!)
>   }
>   TestObject(row.getLong(0), row.getDouble(1))
> }
> ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")
> spark.sql("select count(distinct id) from tmp.test_table").show{code}
> Command: 
> {code:java}
> spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
> --conf spark.shuffle.service.enabled=false){code}
> To simulate the issue, disable external shuffle service is needed (if it's 
> also enabled by default in your environment),  this is to trigger shuffle 
> file loss and previous stage retries.
> In our production, we have external shuffle service enabled, this data 
> correctness issue happened when there were node losses.
> Although there's some non-deterministic factor in upstream data, user 
> wouldn't expect  to see incorrect result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-03-07 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502481#comment-17502481
 ] 

Xingbo Jiang commented on SPARK-38388:
--

> upstream stage (data with randomness)

I assume you are using some customized data source that generated the random 
data, but didn't correctly set the outputDeterministicLevel. A proper fix is to 
modify the data source to reflect the nature of data randomness.

> Repartition + Stage retries could lead to incorrect data 
> -
>
> Key: SPARK-38388
> URL: https://issues.apache.org/jira/browse/SPARK-38388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.1.1
> Environment: Spark 2.4 and 3.1
>Reporter: Jason Xu
>Priority: Major
>
> Spark repartition uses RoundRobinPartitioning, the generated results is 
> non-deterministic when data has some randomness and stage/task retries happen.
> The bug can be triggered when upstream data has some randomness, a 
> repartition is called on them, then followed by result stage (could be more 
> stages).
> As the pattern shows below:
> upstream stage (data with randomness) -> (repartition shuffle) -> result stage
> When one executor goes down at result stage, some tasks of that stage might 
> have finished, others would fail, shuffle files on that executor also get 
> lost, some tasks from previous stage (upstream data generation, repartition) 
> will need to rerun to generate dependent shuffle data files.
> Because data has some randomness, regenerated data in upstream retried tasks 
> is slightly different, repartition then generates inconsistent ordering, then 
> tasks at result stage will be retried generating different data.
> This is similar but different to 
> https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra 
> local sort to make the row ordering deterministic, the sorting algorithm it 
> uses simply compares row/record binaries. But in this case, upstream data has 
> some randomness, the sorting algorithm doesn't help keep the order, thus 
> RoundRobinPartitioning introduced non-deterministic result.
> The following code returns 986415, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> case class TestObject(id: Long, value: Double)
> val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
> $"id").withColumn("val", rand()).repartition(100).map { 
>   row => if (TaskContext.get.stageAttemptNumber == 0 && 
> TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
> throw new Exception("pkill -f java".!!)
>   }
>   TestObject(row.getLong(0), row.getDouble(1))
> }
> ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")
> spark.sql("select count(distinct id) from tmp.test_table").show{code}
> Command: 
> {code:java}
> spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
> --conf spark.shuffle.service.enabled=false){code}
> To simulate the issue, disable external shuffle service is needed (if it's 
> also enabled by default in your environment),  this is to trigger shuffle 
> file loss and previous stage retries.
> In our production, we have external shuffle service enabled, this data 
> correctness issue happened when there were node losses.
> Although there's some non-deterministic factor in upstream data, user 
> wouldn't expect  to see incorrect result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-03-07 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502486#comment-17502486
 ] 

Xingbo Jiang commented on SPARK-38388:
--

I see in your example you used `rand()` function between Repartition operators, 
I don't think this would be a valid use case, do you really need to do it in 
production?

> Repartition + Stage retries could lead to incorrect data 
> -
>
> Key: SPARK-38388
> URL: https://issues.apache.org/jira/browse/SPARK-38388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.1.1
> Environment: Spark 2.4 and 3.1
>Reporter: Jason Xu
>Priority: Major
>
> Spark repartition uses RoundRobinPartitioning, the generated results is 
> non-deterministic when data has some randomness and stage/task retries happen.
> The bug can be triggered when upstream data has some randomness, a 
> repartition is called on them, then followed by result stage (could be more 
> stages).
> As the pattern shows below:
> upstream stage (data with randomness) -> (repartition shuffle) -> result stage
> When one executor goes down at result stage, some tasks of that stage might 
> have finished, others would fail, shuffle files on that executor also get 
> lost, some tasks from previous stage (upstream data generation, repartition) 
> will need to rerun to generate dependent shuffle data files.
> Because data has some randomness, regenerated data in upstream retried tasks 
> is slightly different, repartition then generates inconsistent ordering, then 
> tasks at result stage will be retried generating different data.
> This is similar but different to 
> https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra 
> local sort to make the row ordering deterministic, the sorting algorithm it 
> uses simply compares row/record binaries. But in this case, upstream data has 
> some randomness, the sorting algorithm doesn't help keep the order, thus 
> RoundRobinPartitioning introduced non-deterministic result.
> The following code returns 986415, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> case class TestObject(id: Long, value: Double)
> val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
> $"id").withColumn("val", rand()).repartition(100).map { 
>   row => if (TaskContext.get.stageAttemptNumber == 0 && 
> TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
> throw new Exception("pkill -f java".!!)
>   }
>   TestObject(row.getLong(0), row.getDouble(1))
> }
> ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")
> spark.sql("select count(distinct id) from tmp.test_table").show{code}
> Command: 
> {code:java}
> spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
> --conf spark.shuffle.service.enabled=false){code}
> To simulate the issue, disable external shuffle service is needed (if it's 
> also enabled by default in your environment),  this is to trigger shuffle 
> file loss and previous stage retries.
> In our production, we have external shuffle service enabled, this data 
> correctness issue happened when there were node losses.
> Although there's some non-deterministic factor in upstream data, user 
> wouldn't expect  to see incorrect result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38414) Remove redundant SuppressWarnings

2022-03-07 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-38414.

Fix Version/s: 3.3.0
 Assignee: Yang Jie
   Resolution: Fixed

> Remove redundant SuppressWarnings
> -
>
> Key: SPARK-38414
> URL: https://issues.apache.org/jira/browse/SPARK-38414
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38436:
-

Assignee: Bjørn Jørgensen

> Duplicate test_floor functions 
> ---
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Trivial
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38436) Fix `test_ceil` to test `ceil`

2022-03-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38436:
--
Summary: Fix `test_ceil` to test `ceil`  (was: Duplicate test_floor 
functions )

> Fix `test_ceil` to test `ceil`
> --
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Trivial
> Fix For: 3.3.0, 3.2.2
>
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38436) Duplicate test_floor functions

2022-03-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38436.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35755
[https://github.com/apache/spark/pull/35755]

> Duplicate test_floor functions 
> ---
>
> Key: SPARK-38436
> URL: https://issues.apache.org/jira/browse/SPARK-38436
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Trivial
> Fix For: 3.3.0, 3.2.2
>
>
> In python/pyspark/pandas/tests/test_series_datetime.py line 262 
> {code:java}
> def test_floor(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> def test_ceil(self):
> self.check_func(lambda x: x.dt.floor(freq="min"))
> self.check_func(lambda x: x.dt.floor(freq="H"))
> {code}
> Change  x.dt.floor to  x.dt.ceil 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38437) Dynamic serialization of Java datetime objects to micros/days

2022-03-07 Thread Max Gekk (Jira)
Max Gekk created SPARK-38437:


 Summary: Dynamic serialization of Java datetime objects to 
micros/days
 Key: SPARK-38437
 URL: https://issues.apache.org/jira/browse/SPARK-38437
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Make serializers to micros/days more tolerant to input Java objects, and accept:
- for timestamps: java.sql.Timestamp and java.time.Instant
- for days: java.sql.Date and java.time.LocalDate

This should make Spark SQL more reliable to user's and datasource inputs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38437) Dynamic serialization of Java datetime objects to micros/days

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502530#comment-17502530
 ] 

Apache Spark commented on SPARK-38437:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35756

> Dynamic serialization of Java datetime objects to micros/days
> -
>
> Key: SPARK-38437
> URL: https://issues.apache.org/jira/browse/SPARK-38437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Make serializers to micros/days more tolerant to input Java objects, and 
> accept:
> - for timestamps: java.sql.Timestamp and java.time.Instant
> - for days: java.sql.Date and java.time.LocalDate
> This should make Spark SQL more reliable to user's and datasource inputs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38437) Dynamic serialization of Java datetime objects to micros/days

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38437:


Assignee: Apache Spark

> Dynamic serialization of Java datetime objects to micros/days
> -
>
> Key: SPARK-38437
> URL: https://issues.apache.org/jira/browse/SPARK-38437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Make serializers to micros/days more tolerant to input Java objects, and 
> accept:
> - for timestamps: java.sql.Timestamp and java.time.Instant
> - for days: java.sql.Date and java.time.LocalDate
> This should make Spark SQL more reliable to user's and datasource inputs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38437) Dynamic serialization of Java datetime objects to micros/days

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38437:


Assignee: (was: Apache Spark)

> Dynamic serialization of Java datetime objects to micros/days
> -
>
> Key: SPARK-38437
> URL: https://issues.apache.org/jira/browse/SPARK-38437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Make serializers to micros/days more tolerant to input Java objects, and 
> accept:
> - for timestamps: java.sql.Timestamp and java.time.Instant
> - for days: java.sql.Date and java.time.LocalDate
> This should make Spark SQL more reliable to user's and datasource inputs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38402) Improve user experience when working on data frames created from CSV and JSON in PERMISSIVE mode.

2022-03-07 Thread Dilip Biswal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502532#comment-17502532
 ] 

Dilip Biswal commented on SPARK-38402:
--

[~hyukjin.kwon] Thanks !!
Yeah, that should work. The only thing is, this puts an extra burden on the 
application to be aware of the context (i.e accessing the error data frame) and 
do this additional branching. We were wondering if this can be done implicitly 
by the runtime. After all, we are simply trying to do an operation on a data 
frame that is returned to us by spark.

> Improve user experience when working on data frames created from CSV and JSON 
> in PERMISSIVE mode.
> -
>
> Key: SPARK-38402
> URL: https://issues.apache.org/jira/browse/SPARK-38402
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Dilip Biswal
>Priority: Major
>
> In our data processing pipeline, we first process the user supplied data and 
> eliminate invalid/corrupt records. So we parse JSON and CSV files in 
> PERMISSIVE mode where all the invalid records are captured in 
> "_corrupt_record". We then apply predicates on "_corrupt_record" to eliminate 
> the bad records before subjecting the good records further in the processing 
> pipeline.
> We encountered two issues.
> 1. The introduction of "predicate pushdown" for CSV, does not take into 
> account this system generated "_corrupt_column" and tries to push this down 
> to scan resulting in an exception as the column is not part of base schema. 
> 2. Applying predicates on "_corrupt_column" results in a AnalysisException 
> like following.
> {code:java}
> val schema = new StructType()
>   .add("id",IntegerType,true)
>   .add("weight",IntegerType,true) // The weight field is defined wrongly. The 
> actual data contains floating point numbers, while the schema specifies an 
> integer.
>   .add("price",IntegerType,true)
>   .add("_corrupt_record", StringType, true) // The schema contains a special 
> column _corrupt_record, which does not exist in the data. This column 
> captures rows that did not parse correctly.
> val csv_with_wrong_schema = spark.read.format("csv")
>   .option("header", "true")
>   .schema(schema)
>   .load("/FileStore/tables/csv_corrupt_record.csv")
> val badRows = csv_with_wrong_schema.filter($"_corrupt_record".isNotNull)
> 7
> val numBadRows = badRows.count()
>  Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
> referenced columns only include the internal corrupt record column
> (named _corrupt_record by default). For example:
> spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
> and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
> Instead, you can cache or save the parsed results and then send the same 
> query.
> For example, val df = spark.read.schema(schema).csv(file).cache() and then
> df.filter($"_corrupt_record".isNotNull).count().
> {code:java}
> For (1), we have disabled predicate pushdown.
> For (2), we currently cache the data frame before using it , however, its not 
> convenient and we would like to see a better user experience.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-03-07 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-38285:
---

Assignee: L. C. Hsieh

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Assignee: L. C. Hsieh
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {code:python}
> spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False

[jira] [Resolved] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-03-07 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-38285.
-
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35749
[https://github.com/apache/spark/pull/35749]

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {c

[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes

2022-03-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502555#comment-17502555
 ] 

Thomas Graves commented on SPARK-38379:
---

so I actually created another pod with Spark client in it and use the 
spark-shell.

[https://spark.apache.org/docs/3.2.1/running-on-kubernetes.html#client-mode]

Only thing I had to do was make sure ports were available. 

Since you don't run in this mode, I can investigate more.

> Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes 
> --
>
> Key: SPARK-38379
> URL: https://issues.apache.org/jira/browse/SPARK-38379
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in 
> client mode.  I'm using persistent local volumes to mount nvme under /data in 
> the executors and on startup the driver always throws the warning below.
> using these options:
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
>  
>  
> {code:java}
> 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> java.util.NoSuchElementException: spark.app.id
>         at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.SparkConf.get(SparkConf.scala:245)
>         at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>         at scala.collection.immutable.List.foldLeft(List.scala:91)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391)
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
>         at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>         at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
>         at

[jira] [Commented] (SPARK-38183) Show warning when creating pandas-on-Spark session under ANSI mode.

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502610#comment-17502610
 ] 

Apache Spark commented on SPARK-38183:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/35757

> Show warning when creating pandas-on-Spark session under ANSI mode.
> ---
>
> Key: SPARK-38183
> URL: https://issues.apache.org/jira/browse/SPARK-38183
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some 
> unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
>  * It raises exception when {{div}} & {{mod}} related methods returns null 
> (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
>angels  degress
> 0   0  360
> 1   3  180
> 2   4  360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 32.0 (TID 165) (172.30.1.44 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.{code}
>  * It raises exception when DataFrame for {{ps.melt}} has not the same column 
> type.
>  
> {code:java}
> >>> df
>    A  B  C
> 0  a  1  2
> 1  b  3  4
> 2  c  5  6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
> struct('B', B), struct('C', C))' due to data type mismatch: input to function 
> array should all be the same type, but it's 
> [struct, struct, 
> struct]
> To fix the error, you might need to add explicit type casts. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> __natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
> struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
> pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> monotonically_increasing_id() AS __natural_order__#231L]
>    +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
>  * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
> remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
> ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{CategoricalIndex.set_categories}} doesn't set 
> the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0    apple
> 1      1.0
> 2        2
> 3       -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 
> 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
> numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
>  * It raises exception when {{strings.StringMethods.rsplit}} - also 
> {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0                       this is a regular sentence
> 1    https://docs.python.org/3/tutorial/index.html
> 2                                             None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 
> 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
> numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false 
> to bypass this error.{code}
>  * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and 
> the categories of {{CategoricalDtype}} is not matched wit

[jira] [Created] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-03-07 Thread Rafal Wojdyla (Jira)
Rafal Wojdyla created SPARK-38438:
-

 Summary: Can't update spark.jars.packages on existing 
global/default context
 Key: SPARK-38438
 URL: https://issues.apache.org/jira/browse/SPARK-38438
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 3.2.1
 Environment: py: 3.9
spark: 3.2.1
Reporter: Rafal Wojdyla


Reproduction:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# later on we want to update jars.packages, here's e.g. spark-hats
s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# line below return None, the config was not propagated:
s._sc._conf.get("spark.jars.packages")
{code}

Stopping the context doesn't help, in fact it's even more confusing, because 
the configuration is updated, but doesn't have an effect:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

s.stop()

s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
# doesn't download the jar/package, as it would if there was no global context
# thus the extra package is unusable. It's not downloaded, or added to the
# classpath.
s._sc._conf.get("spark.jars.packages")
{code}

One workaround is to stop the context AND kill the JVM gateway, which seems to 
be a kind of hard reset:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
SparkContext._gateway = None
SparkContext._jvm = None

s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# Now we are guaranteed there's a new spark session, and packages
# are downloaded, added to the classpath etc.
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-03-07 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38438:
--
Description: 
Reproduction:

{code:python}
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# later on we want to update jars.packages, here's e.g. spark-hats
s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# line below return None, the config was not propagated:
s._sc._conf.get("spark.jars.packages")
{code}

Stopping the context doesn't help, in fact it's even more confusing, because 
the configuration is updated, but doesn't have an effect:

{code:python}
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

s.stop()

s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
# doesn't download the jar/package, as it would if there was no global context
# thus the extra package is unusable. It's not downloaded, or added to the
# classpath.
s._sc._conf.get("spark.jars.packages")
{code}

One workaround is to stop the context AND kill the JVM gateway, which seems to 
be a kind of hard reset:

{code:python}
from pyspark import SparkContext
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
SparkContext._gateway = None
SparkContext._jvm = None

s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# Now we are guaranteed there's a new spark session, and packages
# are downloaded, added to the classpath etc.
{code}

  was:
Reproduction:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# later on we want to update jars.packages, here's e.g. spark-hats
s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# line below return None, the config was not propagated:
s._sc._conf.get("spark.jars.packages")
{code}

Stopping the context doesn't help, in fact it's even more confusing, because 
the configuration is updated, but doesn't have an effect:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

s.stop()

s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
# doesn't download the jar/package, as it would if there was no global context
# thus the extra package is unusable. It's not downloaded, or added to the
# classpath.
s._sc._conf.get("spark.jars.packages")
{code}

One workaround is to stop the context AND kill the JVM gateway, which seems to 
be a kind of hard reset:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
SparkContext._gateway = None
SparkContext._jvm = None

s = (SparkSession.builder
 .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
 .getOrCreate())

# Now we are guaranteed there's a new spark session, and packages
# are downloaded, added to the classpath etc.
{code}


> Can't update spark.jars.packages on existing global/default context
> ---
>
> Key: SPARK-38438
> URL: https://issues.apache.org/jira/browse/SPARK-38438
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: py: 3.9
> spark: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # later on we want to update jars.packages, here's e.g. spark-hats
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # line below return None, the config was not propagated:
> s._sc._conf.get("spark.jars.packages")
> {code}
> Stopping the context doesn't help, in fact it's even more confusing, because 
> the configuration is updated, but doesn't have an effect:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> s.stop()
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # now this line returns 'za

[jira] [Commented] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-03-07 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502633#comment-17502633
 ] 

Rafal Wojdyla commented on SPARK-38438:
---

The workaround actually doesn't stop the existing JVM, it does stop most of the 
threads in the JVM (including spark context related, and py4j gateway), turns 
out the only thread left is the `main` thread:

{noformat}
"main" #1 prio=5 os_prio=31 cpu=1381.53ms elapsed=67.25s tid=0x7fc478809000 
nid=0x2703 runnable  [0x7c094000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(java.base@11.0.9.1/Native Method)
at 
java.io.FileInputStream.read(java.base@11.0.9.1/FileInputStream.java:279)
at 
java.io.BufferedInputStream.fill(java.base@11.0.9.1/BufferedInputStream.java:252)
at 
java.io.BufferedInputStream.read(java.base@11.0.9.1/BufferedInputStream.java:271)
- locked <0x0007c1012ca0> (a java.io.BufferedInputStream)
at 
org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:68)
at 
org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.9.1/Native 
Method)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.9.1/NativeMethodAccessorImpl.java:62)
at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.9.1/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@11.0.9.1/Method.java:566)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

This is waiting on the python process to stop: 
https://github.com/apache/spark/blob/71991f75ff441e80a52cb71f66f46bfebdb05671/core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala#L68-L70

Would it make sense to just close the stdin to trigger shutdown of the JVM, in 
which case the hard reset would be:

{code:python}
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
{code}

> Can't update spark.jars.packages on existing global/default context
> ---
>
> Key: SPARK-38438
> URL: https://issues.apache.org/jira/browse/SPARK-38438
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: py: 3.9
> spark: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # later on we want to update jars.packages, here's e.g. spark-hats
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # line below return None, the config was not propagated:
> s._sc._conf.get("spark.jars.packages")
> {code}
> Stopping the context doesn't help, in fact it's even more confusing, because 
> the configuration is updated, but doesn't have an effect:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> s.stop()
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
> # doesn't download the jar/package, as it would if there was no global context
> # thus the extra package is unusable. It's not downloaded, or added to the
> # classpath.
> s._sc._conf.get("spark.jars.packages")
> {code}
> One workaround is to stop the context AND kill the JVM gateway, which seems 
> to be a kind of hard reset:
> {code:python}
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # Hard reset:
> s.stop()
> s._sc._gateway.shutdown()
> SparkContext._gateway = None
> SparkContext._jvm = None
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # Now we are guaranteed there's a 

[jira] [Updated] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-03-07 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38438:
--
Issue Type: Bug  (was: New Feature)

> Can't update spark.jars.packages on existing global/default context
> ---
>
> Key: SPARK-38438
> URL: https://issues.apache.org/jira/browse/SPARK-38438
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: py: 3.9
> spark: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # later on we want to update jars.packages, here's e.g. spark-hats
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # line below return None, the config was not propagated:
> s._sc._conf.get("spark.jars.packages")
> {code}
> Stopping the context doesn't help, in fact it's even more confusing, because 
> the configuration is updated, but doesn't have an effect:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> s.stop()
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
> # doesn't download the jar/package, as it would if there was no global context
> # thus the extra package is unusable. It's not downloaded, or added to the
> # classpath.
> s._sc._conf.get("spark.jars.packages")
> {code}
> One workaround is to stop the context AND kill the JVM gateway, which seems 
> to be a kind of hard reset:
> {code:python}
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # Hard reset:
> s.stop()
> s._sc._gateway.shutdown()
> SparkContext._gateway = None
> SparkContext._jvm = None
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # Now we are guaranteed there's a new spark session, and packages
> # are downloaded, added to the classpath etc.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38385) Improve error messages of 'mismatched input' cases from ANTLR

2022-03-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38385.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35707
[https://github.com/apache/spark/pull/35707]

> Improve error messages of 'mismatched input' cases from ANTLR
> -
>
> Key: SPARK-38385
> URL: https://issues.apache.org/jira/browse/SPARK-38385
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.3.0
>
>
> Please view the parent task description for the general idea: 
> https://issues.apache.org/jira/browse/SPARK-38384
> h1. Mismatched Input
> h2. Case 1
> Before
> {code:java}
> ParseException: 
> mismatched input 'sel' expecting {'(', 'APPLY', 'CONVERT', 'COPY', 
> 'OPTIMIZE', 'RESTORE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 
> 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 
> 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 
> 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 
> 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'SYNC', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
> == SQL ==
> sel 1
> ^^^ {code}
> After
> {code:java}
> ParseException: 
> syntax error at or near 'sel'(line 1, pos 0)
> == SQL ==
> sel 1
> ^^^ {code}
> Changes:
>  # Adjust the words, from ‘mismatched input {}’ to more readable one, ‘syntax 
> error at or near {}.’. This also aligns with the PostgreSQL error messages. 
>  # Remove the expecting full list.
> h2. Case 2
> Before
> {code:java}
> ParseException: 
> mismatched input '' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 
> 'RESTORE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 
> 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 
> 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 
> 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 
> 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 
> 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
> == SQL ==
> ^^^ {code}
> After
> {code:java}
> ParseException: 
> syntax error, unexpected empty SQL statement(line 1, pos 0)
> == SQL ==
> ^^^{code}
> Changes:
>  # For empty query, output a specific error message ‘syntax error, unexpected 
> empty SQL statement’
> h2. Case 3
> Before
> {code:java}
> ParseException: 
> mismatched input '' expecting {'APPLY', 'CALLED', 'CHANGES', 'CLONE', 
> 'COLLECT', 'CONTAINS', 'CONVERT', 'COPY', 'COPY_OPTIONS', 'CREDENTIAL', 
> 'CREDENTIALS', 'DEEP', 'DEFINER', 'DELTA', 'DETERMINISTIC', 'ENCRYPTION', 
> 'EXPECT', 'FAIL', 'FILES',… (omit long message) 'TRIM', 'TRUE', 'TRUNCATE', 
> 'TRY_CAST', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 
> 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 
> 'VERSION', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'WITHIN', 
> 'YEAR', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11)
> == SQL ==
> select 1  (
> ---^^^ {code}
> After
> {code:java}
> ParseException: 
> syntax error at or near end of input(line 1, pos 11)
> == SQL ==
> select 1  (
> ---^^^{code}
> Changes:
>  # For the faulty token , substitute it to a readable string ‘end of 
> input’.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38385) Improve error messages of 'mismatched input' cases from ANTLR

2022-03-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38385:
---

Assignee: Xinyi Yu

> Improve error messages of 'mismatched input' cases from ANTLR
> -
>
> Key: SPARK-38385
> URL: https://issues.apache.org/jira/browse/SPARK-38385
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>
> Please view the parent task description for the general idea: 
> https://issues.apache.org/jira/browse/SPARK-38384
> h1. Mismatched Input
> h2. Case 1
> Before
> {code:java}
> ParseException: 
> mismatched input 'sel' expecting {'(', 'APPLY', 'CONVERT', 'COPY', 
> 'OPTIMIZE', 'RESTORE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 
> 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 
> 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 
> 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 
> 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'SYNC', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
> == SQL ==
> sel 1
> ^^^ {code}
> After
> {code:java}
> ParseException: 
> syntax error at or near 'sel'(line 1, pos 0)
> == SQL ==
> sel 1
> ^^^ {code}
> Changes:
>  # Adjust the words, from ‘mismatched input {}’ to more readable one, ‘syntax 
> error at or near {}.’. This also aligns with the PostgreSQL error messages. 
>  # Remove the expecting full list.
> h2. Case 2
> Before
> {code:java}
> ParseException: 
> mismatched input '' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 
> 'RESTORE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 
> 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 
> 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 
> 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 
> 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 
> 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
> == SQL ==
> ^^^ {code}
> After
> {code:java}
> ParseException: 
> syntax error, unexpected empty SQL statement(line 1, pos 0)
> == SQL ==
> ^^^{code}
> Changes:
>  # For empty query, output a specific error message ‘syntax error, unexpected 
> empty SQL statement’
> h2. Case 3
> Before
> {code:java}
> ParseException: 
> mismatched input '' expecting {'APPLY', 'CALLED', 'CHANGES', 'CLONE', 
> 'COLLECT', 'CONTAINS', 'CONVERT', 'COPY', 'COPY_OPTIONS', 'CREDENTIAL', 
> 'CREDENTIALS', 'DEEP', 'DEFINER', 'DELTA', 'DETERMINISTIC', 'ENCRYPTION', 
> 'EXPECT', 'FAIL', 'FILES',… (omit long message) 'TRIM', 'TRUE', 'TRUNCATE', 
> 'TRY_CAST', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 
> 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 
> 'VERSION', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'WITHIN', 
> 'YEAR', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11)
> == SQL ==
> select 1  (
> ---^^^ {code}
> After
> {code:java}
> ParseException: 
> syntax error at or near end of input(line 1, pos 11)
> == SQL ==
> select 1  (
> ---^^^{code}
> Changes:
>  # For the faulty token , substitute it to a readable string ‘end of 
> input’.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38439) Add Braces with if,else,for,do and while statements

2022-03-07 Thread qian (Jira)
qian created SPARK-38439:


 Summary: Add Braces with if,else,for,do and while statements
 Key: SPARK-38439
 URL: https://issues.apache.org/jira/browse/SPARK-38439
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.2.1, 3.2.0
Reporter: qian


Braces are used with {_}if{_}, {_}else{_}, {_}for{_}, _do_ and _while_ 
statements, even if the body contains only a single statement. Avoid using the 
following example: 
{code:java}
if (condition) statements;
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38282) Avoid duplicating complex partitioning expressions

2022-03-07 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502674#comment-17502674
 ] 

Yuming Wang commented on SPARK-38282:
-

Some related PRs:
https://github.com/apache/spark/pull/33522
https://github.com/apache/spark/pull/34334

> Avoid duplicating complex partitioning expressions
> --
>
> Key: SPARK-38282
> URL: https://issues.apache.org/jira/browse/SPARK-38282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> Spark will duplicate all non-trivial expressions in Window.partitionBy, that 
> will result in duplicate exchanges and WindowExec nodes.
> An example unit test:
> {code}
>   test("SPARK-38282: Avoid duplicating complex partitioning expressions") {
> val group = functions.col("id") % 2
> val min = functions.min("id").over(Window.partitionBy(group))
> val max = functions.max("id").over(Window.partitionBy(group))
> val df1 = spark.range(1, 4)
>   .withColumn("ratio", max / min)
> val df2 = spark.range(1, 4)
>   .withColumn("min", min)
>   .withColumn("max", max)
>   .select(col("id"), (col("max") / col("min")).as("ratio"))
> Seq(df1, df2).foreach { df =>
>   checkAnswer(
> df,
> Seq(Row(1L, 3.0), Row(2L, 1.0), Row(3L, 3.0)))
>   val windows = collect(df.queryExecution.executedPlan) {
> case w: WindowExec => w
>   }
>   assert(windows.size == 1)
> }
>   }
> {code}
> The query plan for this (_w0#5L and _w1#6L are duplicates):
> {code}
> Window [min(id#2L) windowspecdefinition(_w1#6L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS _we1#8L], [_w1#6L]
>+- *(4) Sort [_w1#6L ASC NULLS FIRST], false, 0
>   +- AQEShuffleRead coalesced
>  +- ShuffleQueryStage 1
> +- Exchange hashpartitioning(_w1#6L, 5), ENSURE_REQUIREMENTS, 
> [id=#256]
>+- *(3) Project [id#2L, _w1#6L, _we0#7L]
>   +- Window [max(id#2L) windowspecdefinition(_w0#5L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS _we0#7L], [_w0#5L]
>  +- *(2) Sort [_w0#5L ASC NULLS FIRST], false, 0
> +- AQEShuffleRead coalesced
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(_w0#5L, 5), 
> ENSURE_REQUIREMENTS, [id=#203]
>  +- *(1) Project [id#2L, (id#2L % 2) AS 
> _w0#5L, (id#2L % 2) AS _w1#6L]
> +- *(1) Range (1, 4, step=1, splits=2)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37895) Error while joining two tables with non-english field names

2022-03-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37895.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35726
[https://github.com/apache/spark/pull/35726]

> Error while joining two tables with non-english field names
> ---
>
> Key: SPARK-37895
> URL: https://issues.apache.org/jira/browse/SPARK-37895
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Marina Krasilnikova
>Assignee: Pablo Langa Blanco
>Priority: Minor
> Fix For: 3.3.0
>
>
> While trying to join two tables with non-english field names in postgresql 
> with query like
> "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join  view2 
> on view1.`Имя2`=view2.`Имя4`"
> we get an error which says that there is no field "`Имя4`" (field name is 
> surrounded by backticks).
> It appears that to get the data from the second table it constructs query like
> SELECT "Имя3","Имя4" FROM "public"."tab2"  WHERE ("`Имя4`" IS NOT NULL) 
> and these backticks are redundant in WHERE clause.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37895) Error while joining two tables with non-english field names

2022-03-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37895:
---

Assignee: Pablo Langa Blanco

> Error while joining two tables with non-english field names
> ---
>
> Key: SPARK-37895
> URL: https://issues.apache.org/jira/browse/SPARK-37895
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Marina Krasilnikova
>Assignee: Pablo Langa Blanco
>Priority: Minor
>
> While trying to join two tables with non-english field names in postgresql 
> with query like
> "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join  view2 
> on view1.`Имя2`=view2.`Имя4`"
> we get an error which says that there is no field "`Имя4`" (field name is 
> surrounded by backticks).
> It appears that to get the data from the second table it constructs query like
> SELECT "Имя3","Имя4" FROM "public"."tab2"  WHERE ("`Имя4`" IS NOT NULL) 
> and these backticks are redundant in WHERE clause.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38440) DataFilter pushed down with PartitionFilter fro Parquet V1 Datasource

2022-03-07 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-38440:
--

 Summary: DataFilter pushed down with PartitionFilter fro Parquet 
V1 Datasource
 Key: SPARK-38440
 URL: https://issues.apache.org/jira/browse/SPARK-38440
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Jackey Lee


Based on SPARK-38041, we can pushdown dataFilter with partitionFilter to 
Parquet V1 datasource, and remove partitionFilter at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-03-07 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502633#comment-17502633
 ] 

Rafal Wojdyla edited comment on SPARK-38438 at 3/8/22, 4:33 AM:


The workaround actually doesn't stop the existing JVM, it does stop most of the 
threads in the JVM (including spark context related, and py4j gateway), turns 
out the only (non-daemon) thread left is the `main` thread:

{noformat}
"main" #1 prio=5 os_prio=31 cpu=1381.53ms elapsed=67.25s tid=0x7fc478809000 
nid=0x2703 runnable  [0x7c094000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(java.base@11.0.9.1/Native Method)
at 
java.io.FileInputStream.read(java.base@11.0.9.1/FileInputStream.java:279)
at 
java.io.BufferedInputStream.fill(java.base@11.0.9.1/BufferedInputStream.java:252)
at 
java.io.BufferedInputStream.read(java.base@11.0.9.1/BufferedInputStream.java:271)
- locked <0x0007c1012ca0> (a java.io.BufferedInputStream)
at 
org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:68)
at 
org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.9.1/Native 
Method)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.9.1/NativeMethodAccessorImpl.java:62)
at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.9.1/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@11.0.9.1/Method.java:566)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

This is waiting on the python process to stop: 
https://github.com/apache/spark/blob/71991f75ff441e80a52cb71f66f46bfebdb05671/core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala#L68-L70

Would it make sense to just close the stdin to trigger shutdown of the JVM, in 
which case the hard reset would be:

{code:python}
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
{code}


was (Author: ravwojdyla):
The workaround actually doesn't stop the existing JVM, it does stop most of the 
threads in the JVM (including spark context related, and py4j gateway), turns 
out the only thread left is the `main` thread:

{noformat}
"main" #1 prio=5 os_prio=31 cpu=1381.53ms elapsed=67.25s tid=0x7fc478809000 
nid=0x2703 runnable  [0x7c094000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(java.base@11.0.9.1/Native Method)
at 
java.io.FileInputStream.read(java.base@11.0.9.1/FileInputStream.java:279)
at 
java.io.BufferedInputStream.fill(java.base@11.0.9.1/BufferedInputStream.java:252)
at 
java.io.BufferedInputStream.read(java.base@11.0.9.1/BufferedInputStream.java:271)
- locked <0x0007c1012ca0> (a java.io.BufferedInputStream)
at 
org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:68)
at 
org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.9.1/Native 
Method)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.9.1/NativeMethodAccessorImpl.java:62)
at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.9.1/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@11.0.9.1/Method.java:566)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(Sp

[jira] [Assigned] (SPARK-38439) Add Braces with if,else,for,do and while statements

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38439:


Assignee: Apache Spark

> Add Braces with if,else,for,do and while statements
> ---
>
> Key: SPARK-38439
> URL: https://issues.apache.org/jira/browse/SPARK-38439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: qian
>Assignee: Apache Spark
>Priority: Minor
>
> Braces are used with {_}if{_}, {_}else{_}, {_}for{_}, _do_ and _while_ 
> statements, even if the body contains only a single statement. Avoid using 
> the following example: 
> {code:java}
> if (condition) statements;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38439) Add Braces with if,else,for,do and while statements

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502712#comment-17502712
 ] 

Apache Spark commented on SPARK-38439:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35758

> Add Braces with if,else,for,do and while statements
> ---
>
> Key: SPARK-38439
> URL: https://issues.apache.org/jira/browse/SPARK-38439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: qian
>Priority: Minor
>
> Braces are used with {_}if{_}, {_}else{_}, {_}for{_}, _do_ and _while_ 
> statements, even if the body contains only a single statement. Avoid using 
> the following example: 
> {code:java}
> if (condition) statements;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38439) Add Braces with if,else,for,do and while statements

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38439:


Assignee: (was: Apache Spark)

> Add Braces with if,else,for,do and while statements
> ---
>
> Key: SPARK-38439
> URL: https://issues.apache.org/jira/browse/SPARK-38439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: qian
>Priority: Minor
>
> Braces are used with {_}if{_}, {_}else{_}, {_}for{_}, _do_ and _while_ 
> statements, even if the body contains only a single statement. Avoid using 
> the following example: 
> {code:java}
> if (condition) statements;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38103) Use error classes in the parsing errors of transform

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502725#comment-17502725
 ] 

Apache Spark commented on SPARK-38103:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/35759

> Use error classes in the parsing errors of transform
> 
>
> Key: SPARK-38103
> URL: https://issues.apache.org/jira/browse/SPARK-38103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * transformNotSupportQuantifierError
> * transformWithSerdeUnsupportedError
> * tooManyArgumentsForTransformError
> * notEnoughArgumentsForTransformError
> * invalidTransformArgumentError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37865) Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502724#comment-17502724
 ] 

Apache Spark commented on SPARK-37865:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35760

> Spark should not dedup the groupingExpressions when the first child of Union 
> has duplicate columns
> --
>
> Key: SPARK-37865
> URL: https://issues.apache.org/jira/browse/SPARK-37865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Gao
>Priority: Major
>
> When the first child of Union has duplicate columns like select a, a from t1 
> union select a, b from t2, spark only use the first column to aggregate the 
> results, which would make the results incorrect, and this behavior is 
> inconsistent with other engines like PostgreSQL, MySQL. We could alias the 
> attribute of the first child of union to resolve this, or you could argue 
> that this is the feature of Spark SQL.
> sample query:
> select
> a,
> a
> from values (1, 1), (1, 2) as t1(a, b)
> UNION
> SELECT
> a,
> b
> from values (1, 1), (1, 2) as t2(a, b)
> result is
> (1,1)
> result from PostgreSQL and MySQL
> (1,1)
> (1,2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38103) Use error classes in the parsing errors of transform

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38103:


Assignee: Apache Spark

> Use error classes in the parsing errors of transform
> 
>
> Key: SPARK-38103
> URL: https://issues.apache.org/jira/browse/SPARK-38103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * transformNotSupportQuantifierError
> * transformWithSerdeUnsupportedError
> * tooManyArgumentsForTransformError
> * notEnoughArgumentsForTransformError
> * invalidTransformArgumentError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38103) Use error classes in the parsing errors of transform

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38103:


Assignee: (was: Apache Spark)

> Use error classes in the parsing errors of transform
> 
>
> Key: SPARK-38103
> URL: https://issues.apache.org/jira/browse/SPARK-38103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * transformNotSupportQuantifierError
> * transformWithSerdeUnsupportedError
> * tooManyArgumentsForTransformError
> * notEnoughArgumentsForTransformError
> * invalidTransformArgumentError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-03-07 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38441:


 Summary: Support string and bool `regex` in `Series.replace`
 Key: SPARK-38441
 URL: https://issues.apache.org/jira/browse/SPARK-38441
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Support string and bool `regex` in `Series.replace` in order to reach parity 
with pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2022-03-07 Thread Yishai Chernovitzky (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502743#comment-17502743
 ] 

Yishai Chernovitzky commented on SPARK-37690:
-

It seems to be resolved by SPARK-38318

> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38441:


Assignee: (was: Apache Spark)

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38441:


Assignee: Apache Spark

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502753#comment-17502753
 ] 

Apache Spark commented on SPARK-38441:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35747

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38442) Fix ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite under ANSI mode

2022-03-07 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38442:
--

 Summary: Fix 
ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite
 under ANSI mode
 Key: SPARK-38442
 URL: https://issues.apache.org/jira/browse/SPARK-38442
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38442) Fix ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite under ANSI mode

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38442:


Assignee: Apache Spark  (was: Gengliang Wang)

> Fix 
> ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite
>  under ANSI mode
> 
>
> Key: SPARK-38442
> URL: https://issues.apache.org/jira/browse/SPARK-38442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38442) Fix ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite under ANSI mode

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502759#comment-17502759
 ] 

Apache Spark commented on SPARK-38442:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35761

> Fix 
> ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite
>  under ANSI mode
> 
>
> Key: SPARK-38442
> URL: https://issues.apache.org/jira/browse/SPARK-38442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38442) Fix ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite under ANSI mode

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38442:


Assignee: Gengliang Wang  (was: Apache Spark)

> Fix 
> ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite
>  under ANSI mode
> 
>
> Key: SPARK-38442
> URL: https://issues.apache.org/jira/browse/SPARK-38442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38442) Fix ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite under ANSI mode

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502761#comment-17502761
 ] 

Apache Spark commented on SPARK-38442:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35761

> Fix 
> ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite
>  under ANSI mode
> 
>
> Key: SPARK-38442
> URL: https://issues.apache.org/jira/browse/SPARK-38442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes

2022-03-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502771#comment-17502771
 ] 

Dongjoon Hyun commented on SPARK-38379:
---

Thank you for your investigation, [~tgraves].

> Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes 
> --
>
> Key: SPARK-38379
> URL: https://issues.apache.org/jira/browse/SPARK-38379
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in 
> client mode.  I'm using persistent local volumes to mount nvme under /data in 
> the executors and on startup the driver always throws the warning below.
> using these options:
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
>  
>  
> {code:java}
> 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> java.util.NoSuchElementException: spark.app.id
>         at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.SparkConf.get(SparkConf.scala:245)
>         at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>         at scala.collection.immutable.List.foldLeft(List.scala:91)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391)
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
>         at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>         at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocat

[jira] [Created] (SPARK-38443) Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION

2022-03-07 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-38443:
---

 Summary: Document config 
STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION
 Key: SPARK-38443
 URL: https://issues.apache.org/jira/browse/SPARK-38443
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: L. C. Hsieh


We have use case the customer faces issue on large shuffle write in session 
window. The config is hidden, although it is useful for such case. We should 
document it so end users can find it easily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38443) Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38443:


Assignee: Apache Spark

> Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION
> --
>
> Key: SPARK-38443
> URL: https://issues.apache.org/jira/browse/SPARK-38443
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> We have use case the customer faces issue on large shuffle write in session 
> window. The config is hidden, although it is useful for such case. We should 
> document it so end users can find it easily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38443) Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION

2022-03-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38443:


Assignee: (was: Apache Spark)

> Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION
> --
>
> Key: SPARK-38443
> URL: https://issues.apache.org/jira/browse/SPARK-38443
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Minor
>
> We have use case the customer faces issue on large shuffle write in session 
> window. The config is hidden, although it is useful for such case. We should 
> document it so end users can find it easily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38443) Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502779#comment-17502779
 ] 

Apache Spark commented on SPARK-38443:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35762

> Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION
> --
>
> Key: SPARK-38443
> URL: https://issues.apache.org/jira/browse/SPARK-38443
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Minor
>
> We have use case the customer faces issue on large shuffle write in session 
> window. The config is hidden, although it is useful for such case. We should 
> document it so end users can find it easily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38443) Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION

2022-03-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502780#comment-17502780
 ] 

Apache Spark commented on SPARK-38443:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35762

> Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION
> --
>
> Key: SPARK-38443
> URL: https://issues.apache.org/jira/browse/SPARK-38443
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Minor
>
> We have use case the customer faces issue on large shuffle write in session 
> window. The config is hidden, although it is useful for such case. We should 
> document it so end users can find it easily.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org