[jira] [Created] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-08 Thread Rafal Wojdyla (Jira)
Rafal Wojdyla created SPARK-37577:
-

 Summary: ClassCastException: ArrayType cannot be cast to StructType
 Key: SPARK-37577
 URL: https://issues.apache.org/jira/browse/SPARK-37577
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
 Environment: Py: 3.9
Reporter: Rafal Wojdyla


Reproduction:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

t = StructType([StructField('o', ArrayType(StructType([StructField('s',
   StringType(), False), StructField('b',
   ArrayType(StructType([StructField('e', StringType(),
   False)]), True), False)]), True), False)])

(
spark.createDataFrame([], schema=t)
.select(F.explode("o").alias("eo"))
.select("eo.*")
.select(F.explode("b"))
.count()
)
{code}

Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note that 
if you remove, field {{s}}, the code works fine, which is a bit unexpected and 
likely a clue.

{noformat}
Py4JJavaError: An error occurred while calling o156.count.
: java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
cannot be cast to class org.apache.spark.sql.types.StructType 
(org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType 
are in unnamed module of loader 'app')
at 
org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
at 
org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
at 
org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
at 
org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
at 
org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
at 
org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
at 
org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
at 
org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366)
at 
org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$14.applyOrElse(Optimizer.scala:826)
at 
org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$14.applyOrElse(Optimizer.scala:783)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(Tree

[jira] [Updated] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer

2021-12-08 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37570:
--
Environment: 
Mypy: v0.910-1
Py: 3.9

  was:
Mypy version: v0.910-1
Py: 3.9


> mypy breaks on pyspark.pandas.plot.core.Bucketizer
> --
>
> Key: SPARK-37570
> URL: https://issues.apache.org/jira/browse/SPARK-37570
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Mypy: v0.910-1
> Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 
> 3.1.2), see stacktrace below:
> {noformat}
> Traceback (most recent call last):
>   File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", 
> line 8, in 
> sys.exit(console_entry())
>   File 
> "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py",
>  line 11, in console_entry
> main(None, sys.stdout, sys.stderr)
>   File "mypy/main.py", line 87, in main
>   File "mypy/main.py", line 165, in run_build
>   File "mypy/build.py", line 179, in build
>   File "mypy/build.py", line 254, in _build
>   File "mypy/build.py", line 2697, in dispatch
>   File "mypy/build.py", line 3021, in process_graph
>   File "mypy/build.py", line 3138, in process_stale_scc
>   File "mypy/build.py", line 2288, in write_cache
>   File "mypy/build.py", line 1475, in write_cache
>   File "mypy/nodes.py", line 313, in serialize
>   File "mypy/nodes.py", line 3149, in serialize
>   File "mypy/nodes.py", line 3083, in serialize
> AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is 
> unexpectedly incomplete
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer

2021-12-08 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37570:
--
Description: 
Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 3.1.2), 
see stacktrace below:

{noformat}
Traceback (most recent call last):
  File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", 
line 8, in 
sys.exit(console_entry())
  File 
"/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py",
 line 11, in console_entry
main(None, sys.stdout, sys.stderr)
  File "mypy/main.py", line 87, in main
  File "mypy/main.py", line 165, in run_build
  File "mypy/build.py", line 179, in build
  File "mypy/build.py", line 254, in _build
  File "mypy/build.py", line 2697, in dispatch
  File "mypy/build.py", line 3021, in process_graph
  File "mypy/build.py", line 3138, in process_stale_scc
  File "mypy/build.py", line 2288, in write_cache
  File "mypy/build.py", line 1475, in write_cache
  File "mypy/nodes.py", line 313, in serialize
  File "mypy/nodes.py", line 3149, in serialize
  File "mypy/nodes.py", line 3083, in serialize
AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is 
unexpectedly incomplete
{noformat}

  was:
Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 3.1.2), 
see stacktrace below:

{noformat}
Traceback (most recent call last):
  File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", 
line 8, in 
sys.exit(console_entry())
  File 
"/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py",
 line 11, in console_entry
main(None, sys.stdout, sys.stderr)
  File "mypy/main.py", line 87, in main
  File "mypy/main.py", line 165, in run_build
  File "mypy/build.py", line 179, in build
  File "mypy/build.py", line 254, in _build
  File "mypy/build.py", line 2697, in dispatch
  File "mypy/build.py", line 3021, in process_graph
  File "mypy/build.py", line 3138, in process_stale_scc
  File "mypy/build.py", line 2288, in write_cache
  File "mypy/build.py", line 1475, in write_cache
  File "mypy/nodes.py", line 313, in serialize
  File "mypy/nodes.py", line 3149, in serialize
  File "mypy/nodes.py", line 3083, in serialize
AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is 
unexpectedly incomplete
{noformat}

Mypy version: v0.910-1
Py: 3.9


> mypy breaks on pyspark.pandas.plot.core.Bucketizer
> --
>
> Key: SPARK-37570
> URL: https://issues.apache.org/jira/browse/SPARK-37570
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 
> 3.1.2), see stacktrace below:
> {noformat}
> Traceback (most recent call last):
>   File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", 
> line 8, in 
> sys.exit(console_entry())
>   File 
> "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py",
>  line 11, in console_entry
> main(None, sys.stdout, sys.stderr)
>   File "mypy/main.py", line 87, in main
>   File "mypy/main.py", line 165, in run_build
>   File "mypy/build.py", line 179, in build
>   File "mypy/build.py", line 254, in _build
>   File "mypy/build.py", line 2697, in dispatch
>   File "mypy/build.py", line 3021, in process_graph
>   File "mypy/build.py", line 3138, in process_stale_scc
>   File "mypy/build.py", line 2288, in write_cache
>   File "mypy/build.py", line 1475, in write_cache
>   File "mypy/nodes.py", line 313, in serialize
>   File "mypy/nodes.py", line 3149, in serialize
>   File "mypy/nodes.py", line 3083, in serialize
> AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is 
> unexpectedly incomplete
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer

2021-12-08 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37570:
--
Environment: 
Mypy version: v0.910-1
Py: 3.9

> mypy breaks on pyspark.pandas.plot.core.Bucketizer
> --
>
> Key: SPARK-37570
> URL: https://issues.apache.org/jira/browse/SPARK-37570
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Mypy version: v0.910-1
> Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 
> 3.1.2), see stacktrace below:
> {noformat}
> Traceback (most recent call last):
>   File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", 
> line 8, in 
> sys.exit(console_entry())
>   File 
> "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py",
>  line 11, in console_entry
> main(None, sys.stdout, sys.stderr)
>   File "mypy/main.py", line 87, in main
>   File "mypy/main.py", line 165, in run_build
>   File "mypy/build.py", line 179, in build
>   File "mypy/build.py", line 254, in _build
>   File "mypy/build.py", line 2697, in dispatch
>   File "mypy/build.py", line 3021, in process_graph
>   File "mypy/build.py", line 3138, in process_stale_scc
>   File "mypy/build.py", line 2288, in write_cache
>   File "mypy/build.py", line 1475, in write_cache
>   File "mypy/nodes.py", line 313, in serialize
>   File "mypy/nodes.py", line 3149, in serialize
>   File "mypy/nodes.py", line 3083, in serialize
> AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is 
> unexpectedly incomplete
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37533) New SQL function: try_element_at

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455039#comment-17455039
 ] 

Apache Spark commented on SPARK-37533:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34833

> New SQL function: try_element_at
> 
>
> Key: SPARK-37533
> URL: https://issues.apache.org/jira/browse/SPARK-37533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Add New SQL functions `try_element_at`, which is identical to the 
> `element_at` except that it returns null if error occurs
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043
 ] 

Guo Wei commented on SPARK-37575:
-

I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, values(i) has been changed  to {color:#00875a}options.nullValue{color}, 
default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator{color:#172b4d}?{color}{color}

 

 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043
 ] 

Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM:
---

I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 

 


was (Author: wayne guo):
I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, values(i) has been changed  to {color:#00875a}options.nullValue{color}, 
default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator{color:#172b4d}?{color}{color}

 

 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043
 ] 

Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM:
---

I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 


was (Author: wayne guo):
I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 

 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043
 ] 

Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:41 AM:
---

I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  "" in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 


was (Author: wayne guo):
I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
if (!row.isNullAt(i)) {
  values(i) = valueConverters(i).apply(row, i)
} else {
  values(i) = options.nullValue
}
i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
  usingNullOrEmptyValue = true;
  return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
  usingNullOrEmptyValue = true;
  return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455053#comment-17455053
 ] 

Guo Wei commented on SPARK-37575:
-

I think I can make a fast repair to verify my conclusion and add some test 
cases.

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455055#comment-17455055
 ] 

Hyukjin Kwon commented on SPARK-37575:
--

Yeah, it would be great to fix it, and upgrade the version used in Apache Spark.

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-08 Thread Sandeep Katta (Jira)
Sandeep Katta created SPARK-37578:
-

 Summary: DSV2 is not updating Output Metrics
 Key: SPARK-37578
 URL: https://issues.apache.org/jira/browse/SPARK-37578
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2, 3.0.3
Reporter: Sandeep Katta


Repro code

./bin/spark-shell --master local  --jars 
/Users/jars/iceberg-spark3-runtime-0.12.1.jar

 
{code:java}

import scala.collection.mutable
import org.apache.spark.scheduler._val bytesWritten = new 
mutable.ArrayBuffer[Long]()
val recordsWritten = new mutable.ArrayBuffer[Long]()
val bytesWrittenListener = new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
    bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
    recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
  }
}
spark.sparkContext.addSparkListener(bytesWrittenListener)
try {
val df = spark.range(1000).toDF("id")
  df.write.format("iceberg").save("Users/data/dsv2_test")
  
assert(bytesWritten.sum > 0)
assert(recordsWritten.sum > 0)
} finally {
  spark.sparkContext.removeSparkListener(bytesWrittenListener)
} {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception

2021-12-08 Thread page (Jira)
page created SPARK-37579:


 Summary: Called spark.sql multiple times,union multiple DataFrame, 
groupBy and pivot, join other table view cause exception
 Key: SPARK-37579
 URL: https://issues.apache.org/jira/browse/SPARK-37579
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.7
Reporter: page


Possible steps to reproduce:
1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4]
2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling 
Dataset#unionByName
3. Run 
{code:java}
d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code}
,produce DataFrame d6
4. DataFrame d6 join another DataFrame d7
5. Call function like count to trigger spark job
6. Exception happend
 
stack trace:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at 
org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88)
at 
org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136)
at 
org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2836)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal 
numbers of partitions: List(2, 1)
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at org.apache.spark.ShuffleDependency.(Dependency.scala:94)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112)
at 
org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284)
at 
org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2$$anonfun$apply$3.apply(QueryStage.scala:81)
at 
org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2$$anonfun$apply$3.apply(QueryStage.scala:81)
at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionIdAndJobDesc(SQLExecution.scala:157)
at 
org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2.apply(QueryStage.scala:80)
at 
org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2.apply(QueryStage.scala:78)

[jira] [Created] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

2021-12-08 Thread wangshengjie (Jira)
wangshengjie created SPARK-37580:


 Summary: Optimize current TaskSetManager abort logic when task 
failed count reach the threshold
 Key: SPARK-37580
 URL: https://issues.apache.org/jira/browse/SPARK-37580
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: wangshengjie


In production environment, we found some logic leak about TaskSetManager abort. 
For example:

If one task has failed 3 times(max failed threshold is 4 in default), and there 
is a retry task and speculative task both in running state, then one of these 2 
task attempts succeed and to cancel another. But executor which task need to be 
cancelled lost(oom in our situcation), this task marked as failed, and 
TaskSetManager handle this failed task attempt, it has failed 4 times so abort 
this stage and cause job failed.

I created the patch for this bug and will soon be sent the pull request.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455094#comment-17455094
 ] 

Apache Spark commented on SPARK-37580:
--

User 'wangshengjie123' has created a pull request for this issue:
https://github.com/apache/spark/pull/34834

> Optimize current TaskSetManager abort logic when task failed count reach the 
> threshold
> --
>
> Key: SPARK-37580
> URL: https://issues.apache.org/jira/browse/SPARK-37580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: wangshengjie
>Priority: Major
>
> In production environment, we found some logic leak about TaskSetManager 
> abort. For example:
> If one task has failed 3 times(max failed threshold is 4 in default), and 
> there is a retry task and speculative task both in running state, then one of 
> these 2 task attempts succeed and to cancel another. But executor which task 
> need to be cancelled lost(oom in our situcation), this task marked as failed, 
> and TaskSetManager handle this failed task attempt, it has failed 4 times so 
> abort this stage and cause job failed.
> I created the patch for this bug and will soon be sent the pull request.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37580:


Assignee: (was: Apache Spark)

> Optimize current TaskSetManager abort logic when task failed count reach the 
> threshold
> --
>
> Key: SPARK-37580
> URL: https://issues.apache.org/jira/browse/SPARK-37580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: wangshengjie
>Priority: Major
>
> In production environment, we found some logic leak about TaskSetManager 
> abort. For example:
> If one task has failed 3 times(max failed threshold is 4 in default), and 
> there is a retry task and speculative task both in running state, then one of 
> these 2 task attempts succeed and to cancel another. But executor which task 
> need to be cancelled lost(oom in our situcation), this task marked as failed, 
> and TaskSetManager handle this failed task attempt, it has failed 4 times so 
> abort this stage and cause job failed.
> I created the patch for this bug and will soon be sent the pull request.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37580:


Assignee: Apache Spark

> Optimize current TaskSetManager abort logic when task failed count reach the 
> threshold
> --
>
> Key: SPARK-37580
> URL: https://issues.apache.org/jira/browse/SPARK-37580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: wangshengjie
>Assignee: Apache Spark
>Priority: Major
>
> In production environment, we found some logic leak about TaskSetManager 
> abort. For example:
> If one task has failed 3 times(max failed threshold is 4 in default), and 
> there is a retry task and speculative task both in running state, then one of 
> these 2 task attempts succeed and to cancel another. But executor which task 
> need to be cancelled lost(oom in our situcation), this task marked as failed, 
> and TaskSetManager handle this failed task attempt, it has failed 4 times so 
> abort this stage and cause job failed.
> I created the patch for this bug and will soon be sent the pull request.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)
ocean created SPARK-37581:
-

 Summary: sql hang at planning stage
 Key: SPARK-37581
 URL: https://issues.apache.org/jira/browse/SPARK-37581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.1
Reporter: ocean


when exec a sql, this sql hang at planning stage.

when disable DPP, sql can finish normally.

we can reproduce this  problem through example below:

create table test.test_a (
day string,
week int,
weekday int)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_a partition (dt=20211126) values('1',1,2);

create table test.test_b (
session_id string,
device_id string,
brand string,
model string,
wx_version string,
os string,
net_work_type string,
app_id string,
app_name string,
col_z string,
page_url string,
page_title string,
olabel string,
otitle string,
source string,
send_dt string,
recv_dt string,
request_time string,
write_time string,
client_ip string,
col_a string,
dt_hour varchar(12),
product string,
channelfrom string,
customer_um string,
kb_code string,
col_b string,
rectype string,
errcode string,
col_c string,
pageid_merge string)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_b partition(dt=20211126)
values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');

create table if not exists test.test_c stored as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t2
on calendar.dt = t2.dt


left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t4
on calendar.dt = t4.dt
left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t5
on calendar.dt = t5.dt
left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t11
on calendar.dt = t11.dt

left join
(select dt,count(distinct kb_code) as l_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t12
on calendar.dt = t12.dt

left join
(select dt,count(distinct kb_code) as m_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t13
on calendar.dt = t13.dt

left join
(select dt,count(distinct kb_code) as n_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t14
on calendar.dt = t14.dt

left join
(select dt,count(distinct kb_code) as o_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge 

[jira] [Created] (SPARK-37582) Support the binary type by contains()

2021-12-08 Thread Max Gekk (Jira)
Max Gekk created SPARK-37582:


 Summary: Support the binary type by contains()
 Key: SPARK-37582
 URL: https://issues.apache.org/jira/browse/SPARK-37582
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


New function which was exposed by https://github.com/apache/spark/pull/34761 
accepts the string type only. The ticket aims to support the binary type as 
well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37582) Support the binary type by contains()

2021-12-08 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455194#comment-17455194
 ] 

Max Gekk commented on SPARK-37582:
--

[~angerszhuuu] Would you like to work on this? If so, please, leave a comment 
here otherwise I will implement this myself.

> Support the binary type by contains()
> -
>
> Key: SPARK-37582
> URL: https://issues.apache.org/jira/browse/SPARK-37582
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function which was exposed by https://github.com/apache/spark/pull/34761 
> accepts the string type only. The ticket aims to support the binary type as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-08 Thread Max Gekk (Jira)
Max Gekk created SPARK-37583:


 Summary: Support the binary type by startswith() and endswith()
 Key: SPARK-37583
 URL: https://issues.apache.org/jira/browse/SPARK-37583
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


New function were exposed by https://github.com/apache/spark/pull/34782 but 
they accept the string type only. To make migration from other systems easier, 
it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37582) Support the binary type by contains()

2021-12-08 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455200#comment-17455200
 ] 

angerszhu commented on SPARK-37582:
---

Sure, will raise a pr soon

> Support the binary type by contains()
> -
>
> Key: SPARK-37582
> URL: https://issues.apache.org/jira/browse/SPARK-37582
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function which was exposed by https://github.com/apache/spark/pull/34761 
> accepts the string type only. The ticket aims to support the binary type as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37584) New function: map_contains_key

2021-12-08 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-37584:
--

 Summary: New function: map_contains_key
 Key: SPARK-37584
 URL: https://issues.apache.org/jira/browse/SPARK-37584
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Add a new function map_contains_key, which returns true if the map contains the 
key

Examples:
> SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
true
> SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37445) Update hadoop-profile

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455202#comment-17455202
 ] 

Apache Spark commented on SPARK-37445:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34835

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37584) New SQL function: map_contains_key

2021-12-08 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37584:
---
Summary: New SQL function: map_contains_key  (was: New function: 
map_contains_key)

> New SQL function: map_contains_key
> --
>
> Key: SPARK-37584
> URL: https://issues.apache.org/jira/browse/SPARK-37584
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Add a new function map_contains_key, which returns true if the map contains 
> the key
> Examples:
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
> true
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
> false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37584) New SQL function: map_contains_key

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455204#comment-17455204
 ] 

Apache Spark commented on SPARK-37584:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34836

> New SQL function: map_contains_key
> --
>
> Key: SPARK-37584
> URL: https://issues.apache.org/jira/browse/SPARK-37584
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Add a new function map_contains_key, which returns true if the map contains 
> the key
> Examples:
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
> true
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
> false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37584) New SQL function: map_contains_key

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37584:


Assignee: Gengliang Wang  (was: Apache Spark)

> New SQL function: map_contains_key
> --
>
> Key: SPARK-37584
> URL: https://issues.apache.org/jira/browse/SPARK-37584
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Add a new function map_contains_key, which returns true if the map contains 
> the key
> Examples:
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
> true
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
> false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37584) New SQL function: map_contains_key

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37584:


Assignee: Apache Spark  (was: Gengliang Wang)

> New SQL function: map_contains_key
> --
>
> Key: SPARK-37584
> URL: https://issues.apache.org/jira/browse/SPARK-37584
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Add a new function map_contains_key, which returns true if the map contains 
> the key
> Examples:
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
> true
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
> false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37584) New SQL function: map_contains_key

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37584:


Assignee: Apache Spark  (was: Gengliang Wang)

> New SQL function: map_contains_key
> --
>
> Key: SPARK-37584
> URL: https://issues.apache.org/jira/browse/SPARK-37584
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Add a new function map_contains_key, which returns true if the map contains 
> the key
> Examples:
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
> true
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
> false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2021-12-08 Thread Sandeep Katta (Jira)
Sandeep Katta created SPARK-37585:
-

 Summary: DSV2 InputMetrics are not getting update in corner case
 Key: SPARK-37585
 URL: https://issues.apache.org/jira/browse/SPARK-37585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2, 3.0.3
Reporter: Sandeep Katta


In some corner cases, DSV2 is not updating the input metrics.

 

This is very special case where the number of records read are less than 1000 
and *hasNext* is not called for last element(cz input.hasNext returns false so 
MetricsIterator.hasNext is not called)

 

hasNext implementation of MetricsIterator

 
{code:java}
override def hasNext: Boolean = {
  if (iter.hasNext) {
true
  } else {
metricsHandler.updateMetrics(0, force = true)
false
  } {code}
 

You reproduce this issue easily in spark-shell by running below code
{code:java}
import scala.collection.mutable
import org.apache.spark.scheduler.{SparkListener, 
SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
val dir = "Users/tmp1"
spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
val df = spark.read.format("parquet").load(dir)
val bytesReads = new mutable.ArrayBuffer[Long]()
val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
    bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
    recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
  }
}
spark.sparkContext.addSparkListener(bytesReadListener)
try {
df.limit(10).collect()
assert(recordsRead.sum > 0)
assert(bytesReads.sum > 0)
} finally {
spark.sparkContext.removeSparkListener(bytesReadListener)
} {code}
This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
updateMetrics API is not called

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-08 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455262#comment-17455262
 ] 

angerszhu commented on SPARK-37583:
---

I think this should be done together with contains.

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-08 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455290#comment-17455290
 ] 

Max Gekk commented on SPARK-37583:
--

[~angerszhuuu] Sure, you can combine it to one PR.

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37576) Support built-in K8s executor roll plugin

2021-12-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37576.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34832
[https://github.com/apache/spark/pull/34832]

> Support built-in K8s executor roll plugin
> -
>
> Key: SPARK-37576
> URL: https://issues.apache.org/jira/browse/SPARK-37576
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37576) Support built-in K8s executor roll plugin

2021-12-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37576:
-

Assignee: Dongjoon Hyun

> Support built-in K8s executor roll plugin
> -
>
> Key: SPARK-37576
> URL: https://issues.apache.org/jira/browse/SPARK-37576
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt

2021-12-08 Thread Max Gekk (Jira)
Max Gekk created SPARK-37586:


 Summary: Add cipher mode option and set default cipher mode for 
aes_encrypt and aes_decrypt
 Key: SPARK-37586
 URL: https://issues.apache.org/jira/browse/SPARK-37586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt 
functions to spark. However they rely on the jvm's configuration regarding 
which cipher mode to support, this is problematic as it is not fixed across 
versions and systems.

Let's hardcode a default cipher mode and also allow users to set a cipher mode 
as an argument to the function.

In the future, we can support other modes like GCM and CBC that have been 
already supported by other systems:
# Snowflake: https://docs.snowflake.com/en/sql-reference/functions/encrypt.html
# Bigquery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37586:


Assignee: (was: Apache Spark)

> Add cipher mode option and set default cipher mode for aes_encrypt and 
> aes_decrypt
> --
>
> Key: SPARK-37586
> URL: https://issues.apache.org/jira/browse/SPARK-37586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt 
> functions to spark. However they rely on the jvm's configuration regarding 
> which cipher mode to support, this is problematic as it is not fixed across 
> versions and systems.
> Let's hardcode a default cipher mode and also allow users to set a cipher 
> mode as an argument to the function.
> In the future, we can support other modes like GCM and CBC that have been 
> already supported by other systems:
> # Snowflake: 
> https://docs.snowflake.com/en/sql-reference/functions/encrypt.html
> # Bigquery: 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455351#comment-17455351
 ] 

Apache Spark commented on SPARK-37586:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34837

> Add cipher mode option and set default cipher mode for aes_encrypt and 
> aes_decrypt
> --
>
> Key: SPARK-37586
> URL: https://issues.apache.org/jira/browse/SPARK-37586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt 
> functions to spark. However they rely on the jvm's configuration regarding 
> which cipher mode to support, this is problematic as it is not fixed across 
> versions and systems.
> Let's hardcode a default cipher mode and also allow users to set a cipher 
> mode as an argument to the function.
> In the future, we can support other modes like GCM and CBC that have been 
> already supported by other systems:
> # Snowflake: 
> https://docs.snowflake.com/en/sql-reference/functions/encrypt.html
> # Bigquery: 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37586:


Assignee: Apache Spark

> Add cipher mode option and set default cipher mode for aes_encrypt and 
> aes_decrypt
> --
>
> Key: SPARK-37586
> URL: https://issues.apache.org/jira/browse/SPARK-37586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt 
> functions to spark. However they rely on the jvm's configuration regarding 
> which cipher mode to support, this is problematic as it is not fixed across 
> versions and systems.
> Let's hardcode a default cipher mode and also allow users to set a cipher 
> mode as an argument to the function.
> In the future, we can support other modes like GCM and CBC that have been 
> already supported by other systems:
> # Snowflake: 
> https://docs.snowflake.com/en/sql-reference/functions/encrypt.html
> # Bigquery: 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37451:
--
Priority: Blocker  (was: Major)

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Blocker
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37451:
--
Target Version/s: 3.2.0, 3.1.3
 Description: 
A performance improvement to how Spark casts Strings to Decimal in this [PR 
title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
regression


{noformat}

scala> :paste 
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
val data = Seq(Row("7.836725755512218E38"))
val schema=StructType(Array(StructField("a", StringType, false)))
val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.select(col("a").cast(DecimalType(37,-17))).show


// Exiting paste mode, now interpreting.

++
|                   a|
++
|7.836725755512218...|
++

scala> spark.version
res2: String = 3.0.1


scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
val data = Seq(Row("7.836725755512218E38"))
val schema=StructType(Array(StructField("a", StringType, false)))
val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.select(col("a").cast(DecimalType(37,-17))).show

// Exiting paste mode, now interpreting.

++
|   a|
++
|null|
++

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(a,StringType,false))
df: org.apache.spark.sql.DataFrame = [a: string]

scala> spark.version
res1: String = 3.1.1
{noformat}



  was:

A performance improvement to how Spark casts Strings to Decimal in this [PR 
title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
regression


{noformat}

scala> :paste 
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
val data = Seq(Row("7.836725755512218E38"))
val schema=StructType(Array(StructField("a", StringType, false)))
val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.select(col("a").cast(DecimalType(37,-17))).show


// Exiting paste mode, now interpreting.

++
|                   a|
++
|7.836725755512218...|
++

scala> spark.version
res2: String = 3.0.1


scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
val data = Seq(Row("7.836725755512218E38"))
val schema=StructType(Array(StructField("a", StringType, false)))
val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.select(col("a").cast(DecimalType(37,-17))).show

// Exiting paste mode, now interpreting.

++
|   a|
++
|null|
++

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(a,StringType,false))
df: org.apache.spark.sql.DataFrame = [a: string]

scala> spark.version
res1: String = 3.1.1
{noformat}




> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Blocker
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal",

[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37451:
--
Labels: correctness  (was: )

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Major
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455357#comment-17455357
 ] 

Dongjoon Hyun commented on SPARK-37451:
---

According to the JIRA description, I added a correctness label to this PR and 
set Target Versions: 3.1.3 and 3.2.0.
cc [~huaxingao]

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Blocker
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455357#comment-17455357
 ] 

Dongjoon Hyun edited comment on SPARK-37451 at 12/8/21, 4:16 PM:
-

According to the JIRA description, I added a correctness label to this PR and 
set Target Versions: 3.1.3 and 3.2.1.
cc [~huaxingao]


was (Author: dongjoon):
According to the JIRA description, I added a correctness label to this PR and 
set Target Versions: 3.1.3 and 3.2.0.
cc [~huaxingao]

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Blocker
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37451:
--
Target Version/s: 3.1.3, 3.2.1  (was: 3.2.0, 3.1.3)

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Blocker
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-08 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.3.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 
> 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
> --worker-url spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which require more flexible ways of launching 
> executors. In particular, our use case is that we run Spark in standalone 
> mode, Spark master and workers are running in VMs. We want to allow Spark app 
> developers to provide custom container images to customize the job runtime 
> environment (typically Java and Python dependencies), so executors (which run 
> the job code) need to run in Docker containers.
> After investigating in the source code, we found that the concept of Spark 
> Command Runner might be a good solution. Basically, we want to introduce an 
> optional Spark command runner in Spark, so that instead of running the 
> command to launch executor directly, it passes the command to the runner, 
> which the runner then runs the command with its own strategy which could be 
> running in Docker, or by default running the command directly.
> The runner should be customizable through an env variable 
> `SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
> {code:java}
> #!/bin/bash
> exec "$@" {code}
> or in the case of Docker container:
> {code:java}
> #!/bin/bash
> docker run ... – "$@" {code}
>  
> I already have a patch for this feature and have tested in our environment.
>  
> [1]: 
> [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-08 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Affects Version/s: 3.2.0
   (was: 3.3.0)

> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 
> 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
> --worker-url spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which require more flexible ways of launching 
> executors. In particular, our use case is that we run Spark in standalone 
> mode, Spark master and workers are running in VMs. We want to allow Spark app 
> developers to provide custom container images to customize the job runtime 
> environment (typically Java and Python dependencies), so executors (which run 
> the job code) need to run in Docker containers.
> After investigating in the source code, we found that the concept of Spark 
> Command Runner might be a good solution. Basically, we want to introduce an 
> optional Spark command runner in Spark, so that instead of running the 
> command to launch executor directly, it passes the command to the runner, 
> which the runner then runs the command with its own strategy which could be 
> running in Docker, or by default running the command directly.
> The runner should be customizable through an env variable 
> `SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
> {code:java}
> #!/bin/bash
> exec "$@" {code}
> or in the case of Docker container:
> {code:java}
> #!/bin/bash
> docker run ... – "$@" {code}
>  
> I already have a patch for this feature and have tested in our environment.
>  
> [1]: 
> [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37205:


Assignee: Chao Sun

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37205.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34635
[https://github.com/apache/spark/pull/34635]

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer

2021-12-08 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455897#comment-17455897
 ] 

Rafal Wojdyla commented on SPARK-37570:
---

A workaround in {{setup.cfg}}:

{noformat}
 [mypy-pyspark.pandas.plot.core]
 follow_imports = skip 
{noformat}

> mypy breaks on pyspark.pandas.plot.core.Bucketizer
> --
>
> Key: SPARK-37570
> URL: https://issues.apache.org/jira/browse/SPARK-37570
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Mypy: v0.910-1
> Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 
> 3.1.2), see stacktrace below:
> {noformat}
> Traceback (most recent call last):
>   File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", 
> line 8, in 
> sys.exit(console_entry())
>   File 
> "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py",
>  line 11, in console_entry
> main(None, sys.stdout, sys.stderr)
>   File "mypy/main.py", line 87, in main
>   File "mypy/main.py", line 165, in run_build
>   File "mypy/build.py", line 179, in build
>   File "mypy/build.py", line 254, in _build
>   File "mypy/build.py", line 2697, in dispatch
>   File "mypy/build.py", line 3021, in process_graph
>   File "mypy/build.py", line 3138, in process_stale_scc
>   File "mypy/build.py", line 2288, in write_cache
>   File "mypy/build.py", line 1475, in write_cache
>   File "mypy/nodes.py", line 313, in serialize
>   File "mypy/nodes.py", line 3149, in serialize
>   File "mypy/nodes.py", line 3083, in serialize
> AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is 
> unexpectedly incomplete
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37587) Forever-running streams get completed with finished status when k8s worker is lost

2021-12-08 Thread In-Ho Yi (Jira)
In-Ho Yi created SPARK-37587:


 Summary: Forever-running streams get completed with finished 
status when k8s worker is lost
 Key: SPARK-37587
 URL: https://issues.apache.org/jira/browse/SPARK-37587
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Structured Streaming
Affects Versions: 3.1.1
Reporter: In-Ho Yi


We have forever-running streaming jobs, defined as:

{{spark.readStream}}
{{.format("avro")}}
{{.option("maxFilesPerTrigger", 10)}}
{{.schema(schema)}}
{{.load(loadPath)}}
{{.as[T]}}
{{.writeStream}}
{{.option("checkpointLocation", checkpointPath)}}

We have several of these running, and at the end of main, I call

{{spark.streams.awaitAnyTermination()}}

To fail the job on any failing jobs.

Now, we are running on a k8s runner with arguments (showing only relevant ones):

{{- /opt/spark/bin/spark-submit}}
{{# From `kubectl cluster-info`}}
{{- --master}}
{{- k8s://https://SOMEADDRESS.gr7.us-east-1.eks.amazonaws.com:443}}
{{- --deploy-mode}}
{{- client}}
{{- --name}}
{{- ourstreamingjob}}
{{# Driver connection via headless service}}
{{- --conf}}
{{- spark.driver.host=spark-driver-service}}
{{- --conf}}
{{- spark.driver.port=31337}}
{{# Move ivy temp dirs under /tmp. See 
https://stackoverflow.com/a/55921242/760482}}
{{- --conf}}
{{- spark.jars.ivy=/tmp/.ivy}}
{{# JVM settings for ivy and log4j.}}
{{- --conf}}
{{- spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp 
-Dlog4j.configuration=file:///opt/spark/log4j.properties}}
{{- --conf}}
{{- 
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/log4j.properties}}
{{# Spark on k8s settings.}}
{{- --conf}}
{{- spark.executor.instances=10}}
{{- --conf}}
{{- spark.executor.memory=8g}}
{{- --conf}}
{{- spark.executor.memoryOverhead=2g}}
{{- --conf}}
{{{}- 
spark.kubernetes.container.image=xxx.dkr.ecr.us-east-1.amazonaws.com/{}}}{{{}ourstreamingjob{}}}{{{}:latesthash{}}}
{{- --conf}}
{{- spark.kubernetes.container.image.pullPolicy=Always}}



The container image is built with provided spark:3.1.1-hadoop3.2 image and we 
add our jar to it.

We have found issues with our monitoring that a couple of these streaming were 
not running anymore, and found that these streams were deemed "complete" with 
FINISHED status. I couldn't find any error logs from the streams themselves. We 
did find that the time when these jobs were completed happen to coincide with 
the time when one of the k8s workers were lost and a new worker was added to 
the cluster.

>From the console, it says:
{quote}Executor 7
Removed at 2021/12/07 22:26:12
Reason: The executor with ID 7 (registered at 1638858647668 ms) was not found 
in the cluster at the polling time (1638915971948 ms) which is after the 
accepted detect delta time (3 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
{quote}
I think the correct behavior would be to continue on the streaming from the 
last known checkpoint. If there is really an error, it should throw and the 
error should propagate so that eventually the pipeline will terminate.

I've checked out changes since 3.1.1 but couldn't find any fix to this specific 
issue. Also, is there any workaround you could recommend?

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455950#comment-17455950
 ] 

Apache Spark commented on SPARK-37569:
--

User 'shardulm94' has created a pull request for this issue:
https://github.com/apache/spark/pull/34839

> View Analysis incorrectly marks nested fields as nullable
> -
>
> Key: SPARK-37569
> URL: https://issues.apache.org/jira/browse/SPARK-37569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a view as follows with all fields non-nullable (required)
> {code:java}
> spark.sql("""
> CREATE OR REPLACE VIEW v AS 
> SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> """)
> {code}
> we can see that the view schema has been correctly stored as non-nullable
> {code:java}
> scala> 
> System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default",
>  "v2"))
> CatalogTable(
> Database: default
> Table: v2
> Owner: smahadik
> Created Time: Tue Dec 07 09:00:42 PST 2021
> Last Access: UNKNOWN
> Created By: Spark 3.3.0-SNAPSHOT
> Type: VIEW
> View Text: SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> View Original Text: SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> View Catalog and Namespace: spark_catalog.default
> View Query Output Columns: [id, nested]
> Table Properties: [transient_lastDdlTime=1638896442]
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Storage Properties: [serialization.format=1]
> Schema: root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- a: long (nullable = false)
> )
> {code}
> However, when trying to read this view, it incorrectly marks nested column 
> {{a}} as nullable
> {code:java}
> scala> spark.table("v2").printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- a: long (nullable = true)
> {code}
> This is caused by [this 
> line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546]
>  in Analyzer.scala. Going through the history of changes for this block of 
> code, it seems like {{asNullable}} is a remnant of a time before we added 
> [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543]
>  to ensure that the from and to types of the cast were compatible. As 
> nullability is already checked, it should be safe to add a cast without 
> converting the target datatype to nullable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37569:


Assignee: (was: Apache Spark)

> View Analysis incorrectly marks nested fields as nullable
> -
>
> Key: SPARK-37569
> URL: https://issues.apache.org/jira/browse/SPARK-37569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a view as follows with all fields non-nullable (required)
> {code:java}
> spark.sql("""
> CREATE OR REPLACE VIEW v AS 
> SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> """)
> {code}
> we can see that the view schema has been correctly stored as non-nullable
> {code:java}
> scala> 
> System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default",
>  "v2"))
> CatalogTable(
> Database: default
> Table: v2
> Owner: smahadik
> Created Time: Tue Dec 07 09:00:42 PST 2021
> Last Access: UNKNOWN
> Created By: Spark 3.3.0-SNAPSHOT
> Type: VIEW
> View Text: SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> View Original Text: SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> View Catalog and Namespace: spark_catalog.default
> View Query Output Columns: [id, nested]
> Table Properties: [transient_lastDdlTime=1638896442]
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Storage Properties: [serialization.format=1]
> Schema: root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- a: long (nullable = false)
> )
> {code}
> However, when trying to read this view, it incorrectly marks nested column 
> {{a}} as nullable
> {code:java}
> scala> spark.table("v2").printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- a: long (nullable = true)
> {code}
> This is caused by [this 
> line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546]
>  in Analyzer.scala. Going through the history of changes for this block of 
> code, it seems like {{asNullable}} is a remnant of a time before we added 
> [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543]
>  to ensure that the from and to types of the cast were compatible. As 
> nullability is already checked, it should be safe to add a cast without 
> converting the target datatype to nullable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37569:


Assignee: Apache Spark

> View Analysis incorrectly marks nested fields as nullable
> -
>
> Key: SPARK-37569
> URL: https://issues.apache.org/jira/browse/SPARK-37569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: Apache Spark
>Priority: Major
>
> Consider a view as follows with all fields non-nullable (required)
> {code:java}
> spark.sql("""
> CREATE OR REPLACE VIEW v AS 
> SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> """)
> {code}
> we can see that the view schema has been correctly stored as non-nullable
> {code:java}
> scala> 
> System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default",
>  "v2"))
> CatalogTable(
> Database: default
> Table: v2
> Owner: smahadik
> Created Time: Tue Dec 07 09:00:42 PST 2021
> Last Access: UNKNOWN
> Created By: Spark 3.3.0-SNAPSHOT
> Type: VIEW
> View Text: SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> View Original Text: SELECT id, named_struct('a', id) AS nested
> FROM RANGE(10)
> View Catalog and Namespace: spark_catalog.default
> View Query Output Columns: [id, nested]
> Table Properties: [transient_lastDdlTime=1638896442]
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Storage Properties: [serialization.format=1]
> Schema: root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- a: long (nullable = false)
> )
> {code}
> However, when trying to read this view, it incorrectly marks nested column 
> {{a}} as nullable
> {code:java}
> scala> spark.table("v2").printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- a: long (nullable = true)
> {code}
> This is caused by [this 
> line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546]
>  in Analyzer.scala. Going through the history of changes for this block of 
> code, it seems like {{asNullable}} is a remnant of a time before we added 
> [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543]
>  to ensure that the from and to types of the cast were compatible. As 
> nullability is already checked, it should be safe to add a cast without 
> converting the target datatype to nullable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ocean updated SPARK-37581:
--
Priority: Critical  (was: Major)

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Critical
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish normally.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,
> os string,
> net_work_type string,
> app_id string,
> app_name string,
> col_z string,
> page_url string,
> page_title string,
> olabel string,
> otitle string,
> source string,
> send_dt string,
> recv_dt string,
> request_time string,
> write_time string,
> client_ip string,
> col_a string,
> dt_hour varchar(12),
> product string,
> channelfrom string,
> customer_um string,
> kb_code string,
> col_b string,
> rectype string,
> errcode string,
> col_c string,
> pageid_merge string)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_b partition(dt=20211126)
> values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');
> create table if not exists test.test_c stored as ORCFILE as
> select calendar.day,calendar.week,calendar.weekday, a_kbs,
> b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
> j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
> from (select * from test.test_a where dt = '20211126') calendar
> left join
> (select dt,count(distinct kb_code) as a_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t1
> on calendar.dt = t1.dt
> left join
> (select dt,count(distinct kb_code) as b_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t2
> on calendar.dt = t2.dt
> left join
> (select dt,count(distinct kb_code) as c_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t3
> on calendar.dt = t3.dt
> left join
> (select dt,count(distinct kb_code) as d_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t4
> on calendar.dt = t4.dt
> left join
> (select dt,count(distinct kb_code) as e_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t5
> on calendar.dt = t5.dt
> left join
> (select dt,count(distinct kb_code) as f_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t6
> on calendar.dt = t6.dt
> left join
> (select dt,count(distinct kb_code) as g_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t7
> on calendar.dt = t7.dt
> left join
> (select dt,count(distinct kb_code) as h_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t8
> on calendar.dt = t8.dt
> left join
> (select dt,count(distinct kb_code) as i_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t9
> on calendar.dt = t9.dt
> left join
> (select dt,count(distinct kb_code) as j_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t10
> on calendar.dt = t10.dt
> left join
> (select dt,count(distinct kb_code) as k_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t11
> on calendar.dt = t11.dt
> left join
> (select dt,count(distinct kb_code) as l_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t12
> on calendar.dt = t12.dt
> left join
> (select dt,count(distinct kb_code) as m_kbs
> from test.test_b
> where dt = '20

[jira] [Resolved] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37561.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34822
[https://github.com/apache/spark/pull/34822]

> Avoid loading all functions when obtaining hive's DelegationToken
> -
>
> Key: SPARK-37561
> URL: https://issues.apache.org/jira/browse/SPARK-37561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
> Attachments: getDelegationToken_load_functions.png
>
>
> At present, when obtaining the delegationToken of hive, all functions will be 
> loaded.
> This is unnecessary, it takes time to load the function, and it also 
> increases the burden on the hive meta store.
>  
> !getDelegationToken_load_functions.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37561:


Assignee: dzcxzl

> Avoid loading all functions when obtaining hive's DelegationToken
> -
>
> Key: SPARK-37561
> URL: https://issues.apache.org/jira/browse/SPARK-37561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Attachments: getDelegationToken_load_functions.png
>
>
> At present, when obtaining the delegationToken of hive, all functions will be 
> loaded.
> This is unnecessary, it takes time to load the function, and it also 
> increases the burden on the hive meta store.
>  
> !getDelegationToken_load_functions.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31785) Add a helper function to test all parquet readers

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456088#comment-17456088
 ] 

Apache Spark commented on SPARK-31785:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34840

> Add a helper function to test all parquet readers
> -
>
> Key: SPARK-31785
> URL: https://issues.apache.org/jira/browse/SPARK-31785
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Add the method withAllParquetReaders {} which runs the block of code for all 
> supported parquet readers. And re-use it in test suites. This should de-dup 
> code, and allow OSS Spark based projects that have their own parquet readers 
> to re-use existing tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31785) Add a helper function to test all parquet readers

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456090#comment-17456090
 ] 

Apache Spark commented on SPARK-31785:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34840

> Add a helper function to test all parquet readers
> -
>
> Key: SPARK-31785
> URL: https://issues.apache.org/jira/browse/SPARK-31785
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Add the method withAllParquetReaders {} which runs the block of code for all 
> supported parquet readers. And re-use it in test suites. This should de-dup 
> code, and allow OSS Spark based projects that have their own parquet readers 
> to re-use existing tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37581:
-
Priority: Major  (was: Critical)

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Major
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish normally.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,
> os string,
> net_work_type string,
> app_id string,
> app_name string,
> col_z string,
> page_url string,
> page_title string,
> olabel string,
> otitle string,
> source string,
> send_dt string,
> recv_dt string,
> request_time string,
> write_time string,
> client_ip string,
> col_a string,
> dt_hour varchar(12),
> product string,
> channelfrom string,
> customer_um string,
> kb_code string,
> col_b string,
> rectype string,
> errcode string,
> col_c string,
> pageid_merge string)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_b partition(dt=20211126)
> values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');
> create table if not exists test.test_c stored as ORCFILE as
> select calendar.day,calendar.week,calendar.weekday, a_kbs,
> b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
> j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
> from (select * from test.test_a where dt = '20211126') calendar
> left join
> (select dt,count(distinct kb_code) as a_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t1
> on calendar.dt = t1.dt
> left join
> (select dt,count(distinct kb_code) as b_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t2
> on calendar.dt = t2.dt
> left join
> (select dt,count(distinct kb_code) as c_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t3
> on calendar.dt = t3.dt
> left join
> (select dt,count(distinct kb_code) as d_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t4
> on calendar.dt = t4.dt
> left join
> (select dt,count(distinct kb_code) as e_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t5
> on calendar.dt = t5.dt
> left join
> (select dt,count(distinct kb_code) as f_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t6
> on calendar.dt = t6.dt
> left join
> (select dt,count(distinct kb_code) as g_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t7
> on calendar.dt = t7.dt
> left join
> (select dt,count(distinct kb_code) as h_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t8
> on calendar.dt = t8.dt
> left join
> (select dt,count(distinct kb_code) as i_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t9
> on calendar.dt = t9.dt
> left join
> (select dt,count(distinct kb_code) as j_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t10
> on calendar.dt = t10.dt
> left join
> (select dt,count(distinct kb_code) as k_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t11
> on calendar.dt = t11.dt
> left join
> (select dt,count(distinct kb_code) as l_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t12
> on calendar.dt = t12.dt
> left join
> (select dt,count(distinct kb_code) as m_kbs
> from test.test_b
> whe

[jira] [Commented] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456091#comment-17456091
 ] 

Hyukjin Kwon commented on SPARK-37581:
--

[~oceaneast] do you mind narrowing down and make the reproducer smaller? The 
reproducer is too complicated to debug and investigate further.

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Critical
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish normally.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,
> os string,
> net_work_type string,
> app_id string,
> app_name string,
> col_z string,
> page_url string,
> page_title string,
> olabel string,
> otitle string,
> source string,
> send_dt string,
> recv_dt string,
> request_time string,
> write_time string,
> client_ip string,
> col_a string,
> dt_hour varchar(12),
> product string,
> channelfrom string,
> customer_um string,
> kb_code string,
> col_b string,
> rectype string,
> errcode string,
> col_c string,
> pageid_merge string)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_b partition(dt=20211126)
> values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');
> create table if not exists test.test_c stored as ORCFILE as
> select calendar.day,calendar.week,calendar.weekday, a_kbs,
> b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
> j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
> from (select * from test.test_a where dt = '20211126') calendar
> left join
> (select dt,count(distinct kb_code) as a_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t1
> on calendar.dt = t1.dt
> left join
> (select dt,count(distinct kb_code) as b_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t2
> on calendar.dt = t2.dt
> left join
> (select dt,count(distinct kb_code) as c_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t3
> on calendar.dt = t3.dt
> left join
> (select dt,count(distinct kb_code) as d_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t4
> on calendar.dt = t4.dt
> left join
> (select dt,count(distinct kb_code) as e_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t5
> on calendar.dt = t5.dt
> left join
> (select dt,count(distinct kb_code) as f_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t6
> on calendar.dt = t6.dt
> left join
> (select dt,count(distinct kb_code) as g_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t7
> on calendar.dt = t7.dt
> left join
> (select dt,count(distinct kb_code) as h_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t8
> on calendar.dt = t8.dt
> left join
> (select dt,count(distinct kb_code) as i_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t9
> on calendar.dt = t9.dt
> left join
> (select dt,count(distinct kb_code) as j_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t10
> on calendar.dt = t10.dt
> left join
> (select dt,count(distinct kb_code) as k_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t11
> on calendar.dt = t11.dt
> left join
> (select dt,count(distinct kb_code) as l_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) =

[jira] [Commented] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456092#comment-17456092
 ] 

Hyukjin Kwon commented on SPARK-37579:
--

Spark 2.4 is EOL. mind checking if that passes with Spark 3+?

> Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, 
> join other table view cause exception
> --
>
> Key: SPARK-37579
> URL: https://issues.apache.org/jira/browse/SPARK-37579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: page
>Priority: Major
>
> Possible steps to reproduce:
> 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4]
> 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling 
> Dataset#unionByName
> 3. Run 
> {code:java}
> d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code}
> ,produce DataFrame d6
> 4. DataFrame d6 join another DataFrame d7
> 5. Call function like count to trigger spark job
> 6. Exception happend
>  
> stack trace:
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2836)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal 
> numbers of partitions: List(2, 1)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:94)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112)
> at 
> org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
> at 
> org.a

[jira] [Updated] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37577:
-
Priority: Major  (was: Critical)

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$1

[jira] [Commented] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456093#comment-17456093
 ] 

Hyukjin Kwon commented on SPARK-37577:
--

It works if you disable:

{code}
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
{code}

cc [~viirya] FYI.

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions

[jira] [Comment Edited] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456093#comment-17456093
 ] 

Hyukjin Kwon edited comment on SPARK-37577 at 12/9/21, 2:16 AM:


It works if you disable:

{code}
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
{code}

cc [~viirya] FYI.

log:

{code}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning ===
 Aggregate [count(1) AS count#22L] Aggregate [count(1) AS 
count#22L]
!+- Project [col#9]+- Project
!   +- Generate explode(b#6), false, [col#9]  +- Project
!  +- Project [eo#3.s AS s#5, eo#3.b AS b#6] +- Generate 
explode(b#6), [0], false, [col#9]
! +- Project [eo#3] +- Project [b#6]
!+- Generate explode(o#0), false, [eo#3]   +- Project 
[eo#3.b AS b#6]
!   +- LogicalRDD [o#0], false+- 
Project [eo#3]
!+- 
Generate explode(o#0), [0], false, [eo#3]
!   +- 
LogicalRDD [o#0], false

21/12/09 11:14:50 TRACE PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
 Aggregate [count(1) AS count#22L]Aggregate 
[count(1) AS count#22L]
 +- Project   +- Project
!   +- Project   +- 
Generate explode(b#6), [0], false, [col#9]
!  +- Generate explode(b#6), [0], false, [col#9]+- 
Project [eo#3.b AS b#6]
! +- Project [b#6] +- 
Generate explode(o#0), [0], false, [eo#3]
!+- Project [eo#3.b AS b#6]   
+- LogicalRDD [o#0], false
!   +- Project [eo#3]
!  +- Generate explode(o#0), [0], false, [eo#3]
! +- LogicalRDD [o#0], false
{code}


was (Author: hyukjin.kwon):
It works if you disable:

{code}
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
{code}

cc [~viirya] FYI.

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.cataly

[jira] [Created] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-08 Thread ramakrishna chilaka (Jira)
ramakrishna chilaka created SPARK-37588:
---

 Summary: lot of strings get accumulated in the heap dump of spark 
thrift server
 Key: SPARK-37588
 URL: https://issues.apache.org/jira/browse/SPARK-37588
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.2.0
 Environment: Open JDK (8 build 1.8.0_312-b07) and scala 12.12

OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8
Reporter: ramakrishna chilaka


I am starting spark thrift server using the following options

```

/data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
"spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
"spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
spark.sql.adaptive.coalescePartitions.enabled=true --conf 
spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
--conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
--conf "spark.driver.maxResultSize=4G" --conf 
"spark.max.fetch.failures.per.stage=10" --conf 
"spark.sql.thriftServer.incrementalCollect=false" --conf 
"spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" --conf 
"spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
spark.sql.thriftServer.interruptOnCancel=true --conf 
spark.sql.thriftServer.queryTimeout=0 --hiveconf 
hive.server2.transport.mode=http --hiveconf 
hive.server2.thrift.http.path=spark_sql --hiveconf 
hive.server2.thrift.min.worker.threads=500 --hiveconf 
hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
--hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
hive.server2.thrift.bind.host=0.0.0.0 --conf 
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
"spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
"spark.worker.cleanup.enabled=true" --conf 
"spark.worker.cleanup.appDataTtl=3600" --hiveconf 
hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
--hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
--conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
-XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
-XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
"hive.server2.session.check.interval=6" --hiveconf 
"hive.server2.idle.session.timeout=90" --hiveconf 
"hive.server2.idle.session.check.operation=true" --conf 
"spark.eventLog.enabled=false" --conf "spark.cleaner.periodicGC.interval=5min" 
--conf "spark.appStateStore.asyncTracking.enable=false" --conf 
"spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
"spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" --conf 
"spark.ui.retainedDeadExecutors=10" --conf 
"spark.worker.ui.retainedExecutors=10" --conf 
"spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
--conf "spark.io.compression.codec=snappy" --conf 
"spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true --conf 
"spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" --conf 
"spark.memory.storageFraction=0.75"

```

the java heap dump after heavy usage is as follows
```
 1:  50465861 9745837152  [C
   2:  23337896 1924089944  [Ljava.lang.Object;
   3:  72524905 1740597720  java.lang.Long
   4:  50463694 1614838208  java.lang.String
   5:  22718029  726976928  
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
   6:   2259416  343483328  [Lscala.collection.mutable.HashEntry;
   7:16  141744616  [Lorg.apache.spark.sql.Row;
   8:532529  123546728  
org.apache.spark.sql.catalyst.expressions.Cast
   9:535418   72816848  
org.apache.spark.sql.catalyst.expressions.Literal
  10:   1105284   70738176  scala.collection.mutable.LinkedHashSet
  11:   1725833   70655016  [J
  12:   1154128   55398144  scala.collection.mutable.HashMap
  13:   1720740   55063680  org.apache.spark.util.collection.BitSet
  14:57   50355536  scala.collection.immutable.V

[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-08 Thread ramakrishna chilaka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramakrishna chilaka updated SPARK-37588:

Description: 
I am starting spark thrift server using the following options

```

/data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
"spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
"spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
spark.sql.adaptive.coalescePartitions.enabled=true --conf 
spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
--conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
--conf "spark.driver.maxResultSize=4G" --conf 
"spark.max.fetch.failures.per.stage=10" --conf 
"spark.sql.thriftServer.incrementalCollect=false" --conf 
"spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" --conf 
"spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
spark.sql.thriftServer.interruptOnCancel=true --conf 
spark.sql.thriftServer.queryTimeout=0 --hiveconf 
hive.server2.transport.mode=http --hiveconf 
hive.server2.thrift.http.path=spark_sql --hiveconf 
hive.server2.thrift.min.worker.threads=500 --hiveconf 
hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
--hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
hive.server2.thrift.bind.host=0.0.0.0 --conf 
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
"spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
"spark.worker.cleanup.enabled=true" --conf 
"spark.worker.cleanup.appDataTtl=3600" --hiveconf 
hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
--hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
--conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
-XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
-XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
"hive.server2.session.check.interval=6" --hiveconf 
"hive.server2.idle.session.timeout=90" --hiveconf 
"hive.server2.idle.session.check.operation=true" --conf 
"spark.eventLog.enabled=false" --conf "spark.cleaner.periodicGC.interval=5min" 
--conf "spark.appStateStore.asyncTracking.enable=false" --conf 
"spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
"spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" --conf 
"spark.ui.retainedDeadExecutors=10" --conf 
"spark.worker.ui.retainedExecutors=10" --conf 
"spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
--conf "spark.io.compression.codec=snappy" --conf 
"spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true --conf 
"spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" --conf 
"spark.memory.storageFraction=0.75"

```

the java heap dump after heavy usage is as follows
```
 1:  50465861 9745837152  [C
   2:  23337896 1924089944  [Ljava.lang.Object;
   3:  72524905 1740597720  java.lang.Long
   4:  50463694 1614838208  java.lang.String
   5:  22718029  726976928  
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
   6:   2259416  343483328  [Lscala.collection.mutable.HashEntry;
   7:16  141744616  [Lorg.apache.spark.sql.Row;
   8:532529  123546728  
org.apache.spark.sql.catalyst.expressions.Cast
   9:535418   72816848  
org.apache.spark.sql.catalyst.expressions.Literal
  10:   1105284   70738176  scala.collection.mutable.LinkedHashSet
  11:   1725833   70655016  [J
  12:   1154128   55398144  scala.collection.mutable.HashMap
  13:   1720740   55063680  org.apache.spark.util.collection.BitSet
  14:57   50355536  scala.collection.immutable.Vector
  15:   1602297   38455128  scala.Some
  16:   1154303   36937696  scala.collection.immutable.$colon$colon
  17:   1105284   26526816  
org.apache.spark.sql.catalyst.expressions.AttributeSet
  18:   1066442   25594608  java.lang.Integer
  19:735502   23536064  scala.collection.immutable.HashSet$H

[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-08 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Description: 
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner should be customizable through an env variable 
`SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]

  was:
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, which the runner then 
runs the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner should be customizable through an env variable 
`SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]


> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 
> 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
> --worker-url spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which requ

[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-08 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Description: 
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, which the runner then 
runs the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner should be customizable through an env variable 
`SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]

  was:
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After investigating in the source code, we found that the concept of Spark 
Command Runner might be a good solution. Basically, we want to introduce an 
optional Spark command runner in Spark, so that instead of running the command 
to launch executor directly, it passes the command to the runner, which the 
runner then runs the command with its own strategy which could be running in 
Docker, or by default running the command directly.

The runner should be customizable through an env variable 
`SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]


> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 
> 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
> --worker-url spark://Worker@100.116.124.193:45287 {code}
> But there are use c

[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-08 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Description: 
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner is set through an env variable `SPARK_COMMAND_RUNNER`, which by 
default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]

  was:
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner should be customizable through an env variable 
`SPARK_COMMAND_RUNNER`, which by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]


> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 
> 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
> --worker-url spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which require more flexible ways

[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-08 Thread ramakrishna chilaka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramakrishna chilaka updated SPARK-37588:

Environment: 
Open JDK (8 build 1.8.0_312-b07) and scala 2.12

OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8

  was:
Open JDK (8 build 1.8.0_312-b07) and scala 12.12

OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8


> lot of strings get accumulated in the heap dump of spark thrift server
> --
>
> Key: SPARK-37588
> URL: https://issues.apache.org/jira/browse/SPARK-37588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
> Environment: Open JDK (8 build 1.8.0_312-b07) and scala 2.12
> OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8
>Reporter: ramakrishna chilaka
>Priority: Major
>
> I am starting spark thrift server using the following options
> ```
> /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
> "spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
> "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
> spark.sql.adaptive.coalescePartitions.enabled=true --conf 
> spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
> --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
> --conf "spark.driver.maxResultSize=4G" --conf 
> "spark.max.fetch.failures.per.stage=10" --conf 
> "spark.sql.thriftServer.incrementalCollect=false" --conf 
> "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" 
> --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
> spark.sql.thriftServer.interruptOnCancel=true --conf 
> spark.sql.thriftServer.queryTimeout=0 --hiveconf 
> hive.server2.transport.mode=http --hiveconf 
> hive.server2.thrift.http.path=spark_sql --hiveconf 
> hive.server2.thrift.min.worker.threads=500 --hiveconf 
> hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
> hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
> hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
> hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
> --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
> hive.server2.thrift.bind.host=0.0.0.0 --conf 
> "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
> "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
> "spark.worker.cleanup.enabled=true" --conf 
> "spark.worker.cleanup.appDataTtl=3600" --hiveconf 
> hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
> hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
> hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
> --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
> --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
> -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
> -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
> -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
> "hive.server2.session.check.interval=6" --hiveconf 
> "hive.server2.idle.session.timeout=90" --hiveconf 
> "hive.server2.idle.session.check.operation=true" --conf 
> "spark.eventLog.enabled=false" --conf 
> "spark.cleaner.periodicGC.interval=5min" --conf 
> "spark.appStateStore.asyncTracking.enable=false" --conf 
> "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
> "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" 
> --conf "spark.ui.retainedDeadExecutors=10" --conf 
> "spark.worker.ui.retainedExecutors=10" --conf 
> "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
> spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
> --conf "spark.io.compression.codec=snappy" --conf 
> "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true 
> --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" 
> --conf "spark.memory.storageFraction=0.75"
> ```
> the java heap dump after heavy usage is as follows
> ```
>  1:  50465861 9745837152  [C
>2:  23337896 1924089944  [Ljava.lang.Object;
>3:  72524905 1740597720  java.lang.Long
>4:  50463694 1614838208  java.lang.String
>5:  22718029  726976928  
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>6:   2259416  343483328  [Lscala.colle

[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-08 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Description: 
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which 
by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]

  was:
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner is set through an env variable `SPARK_COMMAND_RUNNER`, which by 
default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]


> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 
> 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
> --worker-url spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which require more flexible ways of launch

[jira] [Commented] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-08 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456103#comment-17456103
 ] 

L. C. Hsieh commented on SPARK-37577:
-

Thanks [~hyukjin.kwon]. I will take a look.

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366)
>   at 
> org.apa

[jira] [Created] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-08 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-37589:
--

 Summary: Improve the performance of PlanParserSuite
 Key: SPARK-37589
 URL: https://issues.apache.org/jira/browse/SPARK-37589
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng


Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37589:


Assignee: Apache Spark

> Improve the performance of PlanParserSuite
> --
>
> Key: SPARK-37589
> URL: https://issues.apache.org/jira/browse/SPARK-37589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37589:


Assignee: (was: Apache Spark)

> Improve the performance of PlanParserSuite
> --
>
> Key: SPARK-37589
> URL: https://issues.apache.org/jira/browse/SPARK-37589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456116#comment-17456116
 ] 

Apache Spark commented on SPARK-37589:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34841

> Improve the performance of PlanParserSuite
> --
>
> Key: SPARK-37589
> URL: https://issues.apache.org/jira/browse/SPARK-37589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37589:


Assignee: Apache Spark

> Improve the performance of PlanParserSuite
> --
>
> Key: SPARK-37589
> URL: https://issues.apache.org/jira/browse/SPARK-37589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception

2021-12-08 Thread page (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456120#comment-17456120
 ] 

page commented on SPARK-37579:
--

No current plans to use Spark 3+. Can I still get advice or help on spark2.4+?

> Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, 
> join other table view cause exception
> --
>
> Key: SPARK-37579
> URL: https://issues.apache.org/jira/browse/SPARK-37579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: page
>Priority: Major
>
> Possible steps to reproduce:
> 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4]
> 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling 
> Dataset#unionByName
> 3. Run 
> {code:java}
> d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code}
> ,produce DataFrame d6
> 4. DataFrame d6 join another DataFrame d7
> 5. Call function like count to trigger spark job
> 6. Exception happend
>  
> stack trace:
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2836)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal 
> numbers of partitions: List(2, 1)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:94)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112)
> at 
> org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
> at 
> org.

[jira] [Commented] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456121#comment-17456121
 ] 

Hyukjin Kwon commented on SPARK-37579:
--

For that, you can leverage Spark dev mailing list instead of filing it as an 
issue.

> Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, 
> join other table view cause exception
> --
>
> Key: SPARK-37579
> URL: https://issues.apache.org/jira/browse/SPARK-37579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: page
>Priority: Major
>
> Possible steps to reproduce:
> 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4]
> 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling 
> Dataset#unionByName
> 3. Run 
> {code:java}
> d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code}
> ,produce DataFrame d6
> 4. DataFrame d6 join another DataFrame d7
> 5. Call function like count to trigger spark job
> 6. Exception happend
>  
> stack trace:
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2836)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal 
> numbers of partitions: List(2, 1)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:94)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112)
> at 
> org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.s

[jira] [Resolved] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception

2021-12-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37579.
--
Resolution: Incomplete

> Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, 
> join other table view cause exception
> --
>
> Key: SPARK-37579
> URL: https://issues.apache.org/jira/browse/SPARK-37579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: page
>Priority: Major
>
> Possible steps to reproduce:
> 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4]
> 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling 
> Dataset#unionByName
> 3. Run 
> {code:java}
> d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code}
> ,produce DataFrame d6
> 4. DataFrame d6 join another DataFrame d7
> 5. Call function like count to trigger spark job
> 6. Exception happend
>  
> stack trace:
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2836)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal 
> numbers of partitions: List(2, 1)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:94)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112)
> at 
> org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2$$anonfun$apply$3.

[jira] [Created] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests

2021-12-08 Thread Terry Kim (Jira)
Terry Kim created SPARK-37590:
-

 Summary: Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
 Key: SPARK-37590
 URL: https://issues.apache.org/jira/browse/SPARK-37590
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Terry Kim


Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456134#comment-17456134
 ] 

Apache Spark commented on SPARK-37590:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34842

> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
> 
>
> Key: SPARK-37590
> URL: https://issues.apache.org/jira/browse/SPARK-37590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37590:


Assignee: Apache Spark

> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
> 
>
> Key: SPARK-37590
> URL: https://issues.apache.org/jira/browse/SPARK-37590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests

2021-12-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37590:


Assignee: (was: Apache Spark)

> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
> 
>
> Key: SPARK-37590
> URL: https://issues.apache.org/jira/browse/SPARK-37590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456135#comment-17456135
 ] 

Apache Spark commented on SPARK-37590:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34842

> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
> 
>
> Key: SPARK-37590
> URL: https://issues.apache.org/jira/browse/SPARK-37590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-12-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456141#comment-17456141
 ] 

Apache Spark commented on SPARK-36137:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34843

> HiveShim always fallback to getAllPartitionsOf regardless of whether 
> directSQL is enabled in remote HMS
> ---
>
> Key: SPARK-36137
> URL: https://issues.apache.org/jira/browse/SPARK-36137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
> {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in 
> the remote HMS. However, in certain cases the remote HMS will fallback to use 
> ORM (which only support string type for partition columns) to query the 
> underlying RDBMS even if this config is set to true, and Spark will not be 
> able to recover from the error and will just fail the query. 
> For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, 
> and Spark was not able to pushdown filter for {{date}} column.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt

2021-12-08 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37586.

Fix Version/s: 3.3.0
 Assignee: Max Gekk
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34837

> Add cipher mode option and set default cipher mode for aes_encrypt and 
> aes_decrypt
> --
>
> Key: SPARK-37586
> URL: https://issues.apache.org/jira/browse/SPARK-37586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt 
> functions to spark. However they rely on the jvm's configuration regarding 
> which cipher mode to support, this is problematic as it is not fixed across 
> versions and systems.
> Let's hardcode a default cipher mode and also allow users to set a cipher 
> mode as an argument to the function.
> In the future, we can support other modes like GCM and CBC that have been 
> already supported by other systems:
> # Snowflake: 
> https://docs.snowflake.com/en/sql-reference/functions/encrypt.html
> # Bigquery: 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()

2021-12-08 Thread Max Gekk (Jira)
Max Gekk created SPARK-37591:


 Summary: Support the GCM mode by aes_encrypt()/aes_decrypt()
 Key: SPARK-37591
 URL: https://issues.apache.org/jira/browse/SPARK-37591
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Max Gekk


Implement the GCM mode in AES - aes_encrypt() and aes_decrypt()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()

2021-12-08 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456153#comment-17456153
 ] 

Max Gekk commented on SPARK-37591:
--

I am working on this.

> Support the GCM mode by aes_encrypt()/aes_decrypt()
> ---
>
> Key: SPARK-37591
> URL: https://issues.apache.org/jira/browse/SPARK-37591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Implement the GCM mode in AES - aes_encrypt() and aes_decrypt()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456154#comment-17456154
 ] 

ocean commented on SPARK-37581:
---

Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the 
same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 
join operators, costs 39s. when 12 join operators, costs 120s. when 13 join 
operators , it can not finish.  

 

I think, we can debug it at 11 join operators, to find why it is so slow.

===

drop table if exists test.test_c;create table if not exists test.test_c stored 
as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t2
on calendar.dt = t2.dt


left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t4
on calendar.dt = t4.dt

left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t5
on calendar.dt = t5.dt

left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t11
on calendar.dt = t11.dt







 

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Major
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish normally.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,
> os string,
> net_work_type string,
> app_id string,
> app_name string,
> col_z string,
> page_url string,
> page_title string,
> olabel string,
> otitle string,
> source string,
> send_dt string,
> recv_dt string,
> request_time string,
> write_time string,
> client_ip string,
> col_a string,
> dt_hour varchar(12),
> product string,
> channelfrom string,
> customer_um string,
> kb_code string,
> col_b string,
> rectype string,
> errcode string,
> col_c string,
> pageid_merge string)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_b partition(dt=20211126)
> values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');
> create table if not exists test.test_c stored as ORCFILE as
> select calendar.day,calendar.week,calendar.weekday, a_kbs,
> b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs

[jira] [Comment Edited] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456154#comment-17456154
 ] 

ocean edited comment on SPARK-37581 at 12/9/21, 6:06 AM:
-

Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the 
same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 
join operators, costs 39s. when 12 join operators, costs 120s. when 13 join 
operators , it can not finish.  

 

I think, we can debug it at 11 join operators, to find why it is so slow.

===

drop table if exists test.test_c;create table if not exists test.test_c stored 
as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
group by dt) t2
on calendar.dt = t2.dt


left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
group by dt) t4
on calendar.dt = t4.dt

left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
group by dt) t5
on calendar.dt = t5.dt

left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
group by dt) t11
on calendar.dt = t11.dt







 


was (Author: oceaneast):
Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the 
same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 
join operators, costs 39s. when 12 join operators, costs 120s. when 13 join 
operators , it can not finish.  

 

I think, we can debug it at 11 join operators, to find why it is so slow.

===

drop table if exists test.test_c;create table if not exists test.test_c stored 
as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t2
on calendar.dt = t2.dt


left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t4
on calendar.dt = t4.dt

left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t5
on calendar.dt = t5.dt

left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group

[jira] [Comment Edited] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456154#comment-17456154
 ] 

ocean edited comment on SPARK-37581 at 12/9/21, 6:12 AM:
-

Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the 
same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 
join operators, costs 39s. when 12 join operators, costs 120s. when 13 join 
operators , it can not finish.  

 

I think, we can debug it at 11 join operators, to find why it is so slow. I had 
narrowed down 

as below:

===

drop table if exists test.test_c;create table if not exists test.test_c stored 
as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
group by dt) t2
on calendar.dt = t2.dt

left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
group by dt) t4
on calendar.dt = t4.dt

left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
group by dt) t5
on calendar.dt = t5.dt

left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
group by dt) t11
on calendar.dt = t11.dt

 


was (Author: oceaneast):
Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the 
same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 
join operators, costs 39s. when 12 join operators, costs 120s. when 13 join 
operators , it can not finish.  

 

I think, we can debug it at 11 join operators, to find why it is so slow.

===

drop table if exists test.test_c;create table if not exists test.test_c stored 
as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
group by dt) t2
on calendar.dt = t2.dt


left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
group by dt) t4
on calendar.dt = t4.dt

left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
group by dt) t5
on calendar.dt = t5.dt

left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
group by dt) t11
on calendar.dt = t11.dt







 

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Major
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish 

[jira] [Updated] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ocean updated SPARK-37581:
--
Description: 
when exec a sql, this sql hang at planning stage.

when disable DPP, sql can finish very quickly.

we can reproduce this  problem through example below:

create table test.test_a (
day string,
week int,
weekday int)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_a partition (dt=20211126) values('1',1,2);

create table test.test_b (
session_id string,
device_id string,
brand string,
model string,
wx_version string,
os string,
net_work_type string,
app_id string,
app_name string,
col_z string,
page_url string,
page_title string,
olabel string,
otitle string,
source string,
send_dt string,
recv_dt string,
request_time string,
write_time string,
client_ip string,
col_a string,
dt_hour varchar(12),
product string,
channelfrom string,
customer_um string,
kb_code string,
col_b string,
rectype string,
errcode string,
col_c string,
pageid_merge string)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_b partition(dt=20211126)
values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');

create table if not exists test.test_c stored as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t2
on calendar.dt = t2.dt

left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t4
on calendar.dt = t4.dt
left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t5
on calendar.dt = t5.dt
left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t11
on calendar.dt = t11.dt

left join
(select dt,count(distinct kb_code) as l_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t12
on calendar.dt = t12.dt

left join
(select dt,count(distinct kb_code) as m_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t13
on calendar.dt = t13.dt

left join
(select dt,count(distinct kb_code) as n_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t14
on calendar.dt = t14.dt

left join
(select dt,count(distinct kb_code) as o_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t15
on calendar.dt = t15.dt

left join
(select dt,count(distinct kb_code) as p_kbs
from test.test_b
where dt = '20211126'
and app_

[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167
 ] 

Guo Wei commented on SPARK-37575:
-

I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:

 
{noformat}
tesla,,""{noformat}
The final result shows as follows:

 
||name||brand||comment||
|tesla|null|null|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167
 ] 

Guo Wei edited comment on SPARK-37575 at 12/9/21, 6:49 AM:
---

I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:
{noformat}
tesla,,""{noformat}
The final result shows as follows:
||name||brand||comment||
|tesla|null|null|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |


was (Author: wayne guo):
I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:

 
{noformat}
tesla,,""{noformat}
The final result shows as follows:

 
||name||brand||comment||
|tesla|null|null|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167
 ] 

Guo Wei edited comment on SPARK-37575 at 12/9/21, 6:52 AM:
---

I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:
{noformat}
name,brand,comment
tesla,,""{noformat}
The final result shows as follows:
||name||brand||comment||
|tesla|null|null|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |


was (Author: wayne guo):
I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:
{noformat}
tesla,,""{noformat}
The final result shows as follows:
||name||brand||comment||
|tesla|null|null|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Guo Wei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456168#comment-17456168
 ] 

Guo Wei commented on SPARK-37575:
-

I  have made some bug fix in my local repo and add some test cases that passed, 
can you assign this issue to me?[~hyukjin.kwon] 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456172#comment-17456172
 ] 

Hyukjin Kwon commented on SPARK-37575:
--

We don't usually assign issues to someone. Once you open a PR, that will 
automatically assign appropriate ones. You can just add a comment that saying 
you're working on this ticket.

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456193#comment-17456193
 ] 

ocean commented on SPARK-37581:
---

I have narrowed down in the comment.please have a look.

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Major
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish very quickly.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,
> os string,
> net_work_type string,
> app_id string,
> app_name string,
> col_z string,
> page_url string,
> page_title string,
> olabel string,
> otitle string,
> source string,
> send_dt string,
> recv_dt string,
> request_time string,
> write_time string,
> client_ip string,
> col_a string,
> dt_hour varchar(12),
> product string,
> channelfrom string,
> customer_um string,
> kb_code string,
> col_b string,
> rectype string,
> errcode string,
> col_c string,
> pageid_merge string)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_b partition(dt=20211126)
> values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');
> create table if not exists test.test_c stored as ORCFILE as
> select calendar.day,calendar.week,calendar.weekday, a_kbs,
> b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
> j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
> from (select * from test.test_a where dt = '20211126') calendar
> left join
> (select dt,count(distinct kb_code) as a_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t1
> on calendar.dt = t1.dt
> left join
> (select dt,count(distinct kb_code) as b_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t2
> on calendar.dt = t2.dt
> left join
> (select dt,count(distinct kb_code) as c_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t3
> on calendar.dt = t3.dt
> left join
> (select dt,count(distinct kb_code) as d_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t4
> on calendar.dt = t4.dt
> left join
> (select dt,count(distinct kb_code) as e_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t5
> on calendar.dt = t5.dt
> left join
> (select dt,count(distinct kb_code) as f_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t6
> on calendar.dt = t6.dt
> left join
> (select dt,count(distinct kb_code) as g_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t7
> on calendar.dt = t7.dt
> left join
> (select dt,count(distinct kb_code) as h_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t8
> on calendar.dt = t8.dt
> left join
> (select dt,count(distinct kb_code) as i_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t9
> on calendar.dt = t9.dt
> left join
> (select dt,count(distinct kb_code) as j_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t10
> on calendar.dt = t10.dt
> left join
> (select dt,count(distinct kb_code) as k_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t11
> on calendar.dt = t11.dt
> left join
> (select dt,count(distinct kb_code) as l_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t12
> on calendar.dt = t12.dt
> left join
> (

[jira] (SPARK-37581) sql hang at planning stage

2021-12-08 Thread ocean (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37581 ]


ocean deleted comment on SPARK-37581:
---

was (Author: oceaneast):
I have narrowed down in the comment.please have a look.

> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Major
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish very quickly.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,
> os string,
> net_work_type string,
> app_id string,
> app_name string,
> col_z string,
> page_url string,
> page_title string,
> olabel string,
> otitle string,
> source string,
> send_dt string,
> recv_dt string,
> request_time string,
> write_time string,
> client_ip string,
> col_a string,
> dt_hour varchar(12),
> product string,
> channelfrom string,
> customer_um string,
> kb_code string,
> col_b string,
> rectype string,
> errcode string,
> col_c string,
> pageid_merge string)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_b partition(dt=20211126)
> values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');
> create table if not exists test.test_c stored as ORCFILE as
> select calendar.day,calendar.week,calendar.weekday, a_kbs,
> b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
> j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
> from (select * from test.test_a where dt = '20211126') calendar
> left join
> (select dt,count(distinct kb_code) as a_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t1
> on calendar.dt = t1.dt
> left join
> (select dt,count(distinct kb_code) as b_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t2
> on calendar.dt = t2.dt
> left join
> (select dt,count(distinct kb_code) as c_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t3
> on calendar.dt = t3.dt
> left join
> (select dt,count(distinct kb_code) as d_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t4
> on calendar.dt = t4.dt
> left join
> (select dt,count(distinct kb_code) as e_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t5
> on calendar.dt = t5.dt
> left join
> (select dt,count(distinct kb_code) as f_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t6
> on calendar.dt = t6.dt
> left join
> (select dt,count(distinct kb_code) as g_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t7
> on calendar.dt = t7.dt
> left join
> (select dt,count(distinct kb_code) as h_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t8
> on calendar.dt = t8.dt
> left join
> (select dt,count(distinct kb_code) as i_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t9
> on calendar.dt = t9.dt
> left join
> (select dt,count(distinct kb_code) as j_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t10
> on calendar.dt = t10.dt
> left join
> (select dt,count(distinct kb_code) as k_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t11
> on calendar.dt = t11.dt
> left join
> (select dt,count(distinct kb_code) as l_kbs
> from test.test_b
> where dt = '20211126'
> and app_id in ('1','2')
> and substr(kb_code,1,6) = '66'
> and pageid_merge = 'a'
> group by dt) t12
> on calendar.dt = t12.dt
> left join
> (select dt,count(distinct kb_code) as m_kbs
> from test.test_b
> where dt = '

  1   2   >