[jira] [Created] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
Rafal Wojdyla created SPARK-37577: - Summary: ClassCastException: ArrayType cannot be cast to StructType Key: SPARK-37577 URL: https://issues.apache.org/jira/browse/SPARK-37577 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Environment: Py: 3.9 Reporter: Rafal Wojdyla Reproduction: {code:python} import pyspark.sql.functions as F from pyspark.sql.types import StructType, StructField, ArrayType, StringType t = StructType([StructField('o', ArrayType(StructType([StructField('s', StringType(), False), StructField('b', ArrayType(StructType([StructField('e', StringType(), False)]), True), False)]), True), False)]) ( spark.createDataFrame([], schema=t) .select(F.explode("o").alias("eo")) .select("eo.*") .select(F.explode("b")) .count() ) {code} Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note that if you remove, field {{s}}, the code works fine, which is a bit unexpected and likely a clue. {noformat} Py4JJavaError: An error occurred while calling o156.count. : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) at org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) at org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) at org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508) at org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368) at org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101) at org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366) at org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$14.applyOrElse(Optimizer.scala:826) at org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$14.applyOrElse(Optimizer.scala:783) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(Tree
[jira] [Updated] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rafal Wojdyla updated SPARK-37570: -- Environment: Mypy: v0.910-1 Py: 3.9 was: Mypy version: v0.910-1 Py: 3.9 > mypy breaks on pyspark.pandas.plot.core.Bucketizer > -- > > Key: SPARK-37570 > URL: https://issues.apache.org/jira/browse/SPARK-37570 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Mypy: v0.910-1 > Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for > 3.1.2), see stacktrace below: > {noformat} > Traceback (most recent call last): > File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", > line 8, in > sys.exit(console_entry()) > File > "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py", > line 11, in console_entry > main(None, sys.stdout, sys.stderr) > File "mypy/main.py", line 87, in main > File "mypy/main.py", line 165, in run_build > File "mypy/build.py", line 179, in build > File "mypy/build.py", line 254, in _build > File "mypy/build.py", line 2697, in dispatch > File "mypy/build.py", line 3021, in process_graph > File "mypy/build.py", line 3138, in process_stale_scc > File "mypy/build.py", line 2288, in write_cache > File "mypy/build.py", line 1475, in write_cache > File "mypy/nodes.py", line 313, in serialize > File "mypy/nodes.py", line 3149, in serialize > File "mypy/nodes.py", line 3083, in serialize > AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is > unexpectedly incomplete > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rafal Wojdyla updated SPARK-37570: -- Description: Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 3.1.2), see stacktrace below: {noformat} Traceback (most recent call last): File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", line 8, in sys.exit(console_entry()) File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py", line 11, in console_entry main(None, sys.stdout, sys.stderr) File "mypy/main.py", line 87, in main File "mypy/main.py", line 165, in run_build File "mypy/build.py", line 179, in build File "mypy/build.py", line 254, in _build File "mypy/build.py", line 2697, in dispatch File "mypy/build.py", line 3021, in process_graph File "mypy/build.py", line 3138, in process_stale_scc File "mypy/build.py", line 2288, in write_cache File "mypy/build.py", line 1475, in write_cache File "mypy/nodes.py", line 313, in serialize File "mypy/nodes.py", line 3149, in serialize File "mypy/nodes.py", line 3083, in serialize AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is unexpectedly incomplete {noformat} was: Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for 3.1.2), see stacktrace below: {noformat} Traceback (most recent call last): File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", line 8, in sys.exit(console_entry()) File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py", line 11, in console_entry main(None, sys.stdout, sys.stderr) File "mypy/main.py", line 87, in main File "mypy/main.py", line 165, in run_build File "mypy/build.py", line 179, in build File "mypy/build.py", line 254, in _build File "mypy/build.py", line 2697, in dispatch File "mypy/build.py", line 3021, in process_graph File "mypy/build.py", line 3138, in process_stale_scc File "mypy/build.py", line 2288, in write_cache File "mypy/build.py", line 1475, in write_cache File "mypy/nodes.py", line 313, in serialize File "mypy/nodes.py", line 3149, in serialize File "mypy/nodes.py", line 3083, in serialize AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is unexpectedly incomplete {noformat} Mypy version: v0.910-1 Py: 3.9 > mypy breaks on pyspark.pandas.plot.core.Bucketizer > -- > > Key: SPARK-37570 > URL: https://issues.apache.org/jira/browse/SPARK-37570 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Rafal Wojdyla >Priority: Major > > Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for > 3.1.2), see stacktrace below: > {noformat} > Traceback (most recent call last): > File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", > line 8, in > sys.exit(console_entry()) > File > "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py", > line 11, in console_entry > main(None, sys.stdout, sys.stderr) > File "mypy/main.py", line 87, in main > File "mypy/main.py", line 165, in run_build > File "mypy/build.py", line 179, in build > File "mypy/build.py", line 254, in _build > File "mypy/build.py", line 2697, in dispatch > File "mypy/build.py", line 3021, in process_graph > File "mypy/build.py", line 3138, in process_stale_scc > File "mypy/build.py", line 2288, in write_cache > File "mypy/build.py", line 1475, in write_cache > File "mypy/nodes.py", line 313, in serialize > File "mypy/nodes.py", line 3149, in serialize > File "mypy/nodes.py", line 3083, in serialize > AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is > unexpectedly incomplete > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rafal Wojdyla updated SPARK-37570: -- Environment: Mypy version: v0.910-1 Py: 3.9 > mypy breaks on pyspark.pandas.plot.core.Bucketizer > -- > > Key: SPARK-37570 > URL: https://issues.apache.org/jira/browse/SPARK-37570 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Mypy version: v0.910-1 > Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for > 3.1.2), see stacktrace below: > {noformat} > Traceback (most recent call last): > File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", > line 8, in > sys.exit(console_entry()) > File > "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py", > line 11, in console_entry > main(None, sys.stdout, sys.stderr) > File "mypy/main.py", line 87, in main > File "mypy/main.py", line 165, in run_build > File "mypy/build.py", line 179, in build > File "mypy/build.py", line 254, in _build > File "mypy/build.py", line 2697, in dispatch > File "mypy/build.py", line 3021, in process_graph > File "mypy/build.py", line 3138, in process_stale_scc > File "mypy/build.py", line 2288, in write_cache > File "mypy/build.py", line 1475, in write_cache > File "mypy/nodes.py", line 313, in serialize > File "mypy/nodes.py", line 3149, in serialize > File "mypy/nodes.py", line 3083, in serialize > AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is > unexpectedly incomplete > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37533) New SQL function: try_element_at
[ https://issues.apache.org/jira/browse/SPARK-37533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455039#comment-17455039 ] Apache Spark commented on SPARK-37533: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34833 > New SQL function: try_element_at > > > Key: SPARK-37533 > URL: https://issues.apache.org/jira/browse/SPARK-37533 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > Add New SQL functions `try_element_at`, which is identical to the > `element_at` except that it returns null if error occurs > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043 ] Guo Wei commented on SPARK-37575: - I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, values(i) has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator{color:#172b4d}?{color}{color} > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043 ] Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM: --- I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} was (Author: wayne guo): I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, values(i) has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator{color:#172b4d}?{color}{color} > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043 ] Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM: --- I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} was (Author: wayne guo): I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043 ] Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:41 AM: --- I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to "" in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} was (Author: wayne guo): I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455053#comment-17455053 ] Guo Wei commented on SPARK-37575: - I think I can make a fast repair to verify my conclusion and add some test cases. > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455055#comment-17455055 ] Hyukjin Kwon commented on SPARK-37575: -- Yeah, it would be great to fix it, and upgrade the version used in Apache Spark. > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37578) DSV2 is not updating Output Metrics
Sandeep Katta created SPARK-37578: - Summary: DSV2 is not updating Output Metrics Key: SPARK-37578 URL: https://issues.apache.org/jira/browse/SPARK-37578 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2, 3.0.3 Reporter: Sandeep Katta Repro code ./bin/spark-shell --master local --jars /Users/jars/iceberg-spark3-runtime-0.12.1.jar {code:java} import scala.collection.mutable import org.apache.spark.scheduler._val bytesWritten = new mutable.ArrayBuffer[Long]() val recordsWritten = new mutable.ArrayBuffer[Long]() val bytesWrittenListener = new SparkListener() { override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten } } spark.sparkContext.addSparkListener(bytesWrittenListener) try { val df = spark.range(1000).toDF("id") df.write.format("iceberg").save("Users/data/dsv2_test") assert(bytesWritten.sum > 0) assert(recordsWritten.sum > 0) } finally { spark.sparkContext.removeSparkListener(bytesWrittenListener) } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception
page created SPARK-37579: Summary: Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception Key: SPARK-37579 URL: https://issues.apache.org/jira/browse/SPARK-37579 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.7 Reporter: page Possible steps to reproduce: 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4] 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling Dataset#unionByName 3. Run {code:java} d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code} ,produce DataFrame d6 4. DataFrame d6 join another DataFrame d7 5. Call function like count to trigger spark job 6. Exception happend stack trace: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88) at org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136) at org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440) at org.apache.spark.sql.Dataset.count(Dataset.scala:2836) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(2, 1) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) at org.apache.spark.ShuffleDependency.(Dependency.scala:94) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284) at org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133) at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2$$anonfun$apply$3.apply(QueryStage.scala:81) at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2$$anonfun$apply$3.apply(QueryStage.scala:81) at org.apache.spark.sql.execution.SQLExecution$.withExecutionIdAndJobDesc(SQLExecution.scala:157) at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2.apply(QueryStage.scala:80) at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2.apply(QueryStage.scala:78)
[jira] [Created] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold
wangshengjie created SPARK-37580: Summary: Optimize current TaskSetManager abort logic when task failed count reach the threshold Key: SPARK-37580 URL: https://issues.apache.org/jira/browse/SPARK-37580 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: wangshengjie In production environment, we found some logic leak about TaskSetManager abort. For example: If one task has failed 3 times(max failed threshold is 4 in default), and there is a retry task and speculative task both in running state, then one of these 2 task attempts succeed and to cancel another. But executor which task need to be cancelled lost(oom in our situcation), this task marked as failed, and TaskSetManager handle this failed task attempt, it has failed 4 times so abort this stage and cause job failed. I created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold
[ https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455094#comment-17455094 ] Apache Spark commented on SPARK-37580: -- User 'wangshengjie123' has created a pull request for this issue: https://github.com/apache/spark/pull/34834 > Optimize current TaskSetManager abort logic when task failed count reach the > threshold > -- > > Key: SPARK-37580 > URL: https://issues.apache.org/jira/browse/SPARK-37580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: wangshengjie >Priority: Major > > In production environment, we found some logic leak about TaskSetManager > abort. For example: > If one task has failed 3 times(max failed threshold is 4 in default), and > there is a retry task and speculative task both in running state, then one of > these 2 task attempts succeed and to cancel another. But executor which task > need to be cancelled lost(oom in our situcation), this task marked as failed, > and TaskSetManager handle this failed task attempt, it has failed 4 times so > abort this stage and cause job failed. > I created the patch for this bug and will soon be sent the pull request. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold
[ https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37580: Assignee: (was: Apache Spark) > Optimize current TaskSetManager abort logic when task failed count reach the > threshold > -- > > Key: SPARK-37580 > URL: https://issues.apache.org/jira/browse/SPARK-37580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: wangshengjie >Priority: Major > > In production environment, we found some logic leak about TaskSetManager > abort. For example: > If one task has failed 3 times(max failed threshold is 4 in default), and > there is a retry task and speculative task both in running state, then one of > these 2 task attempts succeed and to cancel another. But executor which task > need to be cancelled lost(oom in our situcation), this task marked as failed, > and TaskSetManager handle this failed task attempt, it has failed 4 times so > abort this stage and cause job failed. > I created the patch for this bug and will soon be sent the pull request. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold
[ https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37580: Assignee: Apache Spark > Optimize current TaskSetManager abort logic when task failed count reach the > threshold > -- > > Key: SPARK-37580 > URL: https://issues.apache.org/jira/browse/SPARK-37580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: wangshengjie >Assignee: Apache Spark >Priority: Major > > In production environment, we found some logic leak about TaskSetManager > abort. For example: > If one task has failed 3 times(max failed threshold is 4 in default), and > there is a retry task and speculative task both in running state, then one of > these 2 task attempts succeed and to cancel another. But executor which task > need to be cancelled lost(oom in our situcation), this task marked as failed, > and TaskSetManager handle this failed task attempt, it has failed 4 times so > abort this stage and cause job failed. > I created the patch for this bug and will soon be sent the pull request. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37581) sql hang at planning stage
ocean created SPARK-37581: - Summary: sql hang at planning stage Key: SPARK-37581 URL: https://issues.apache.org/jira/browse/SPARK-37581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.1 Reporter: ocean when exec a sql, this sql hang at planning stage. when disable DPP, sql can finish normally. we can reproduce this problem through example below: create table test.test_a ( day string, week int, weekday int) partitioned by ( dt varchar(8)) stored as orc; insert into test.test_a partition (dt=20211126) values('1',1,2); create table test.test_b ( session_id string, device_id string, brand string, model string, wx_version string, os string, net_work_type string, app_id string, app_name string, col_z string, page_url string, page_title string, olabel string, otitle string, source string, send_dt string, recv_dt string, request_time string, write_time string, client_ip string, col_a string, dt_hour varchar(12), product string, channelfrom string, customer_um string, kb_code string, col_b string, rectype string, errcode string, col_c string, pageid_merge string) partitioned by ( dt varchar(8)) stored as orc; insert into test.test_b partition(dt=20211126) values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t9 on calendar.dt = t9.dt left join (select dt,count(distinct kb_code) as j_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t10 on calendar.dt = t10.dt left join (select dt,count(distinct kb_code) as k_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t11 on calendar.dt = t11.dt left join (select dt,count(distinct kb_code) as l_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t12 on calendar.dt = t12.dt left join (select dt,count(distinct kb_code) as m_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t13 on calendar.dt = t13.dt left join (select dt,count(distinct kb_code) as n_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t14 on calendar.dt = t14.dt left join (select dt,count(distinct kb_code) as o_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge
[jira] [Created] (SPARK-37582) Support the binary type by contains()
Max Gekk created SPARK-37582: Summary: Support the binary type by contains() Key: SPARK-37582 URL: https://issues.apache.org/jira/browse/SPARK-37582 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk New function which was exposed by https://github.com/apache/spark/pull/34761 accepts the string type only. The ticket aims to support the binary type as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37582) Support the binary type by contains()
[ https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455194#comment-17455194 ] Max Gekk commented on SPARK-37582: -- [~angerszhuuu] Would you like to work on this? If so, please, leave a comment here otherwise I will implement this myself. > Support the binary type by contains() > - > > Key: SPARK-37582 > URL: https://issues.apache.org/jira/browse/SPARK-37582 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > New function which was exposed by https://github.com/apache/spark/pull/34761 > accepts the string type only. The ticket aims to support the binary type as > well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37583) Support the binary type by startswith() and endswith()
Max Gekk created SPARK-37583: Summary: Support the binary type by startswith() and endswith() Key: SPARK-37583 URL: https://issues.apache.org/jira/browse/SPARK-37583 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk New function were exposed by https://github.com/apache/spark/pull/34782 but they accept the string type only. To make migration from other systems easier, it would be nice to support the binary type too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37582) Support the binary type by contains()
[ https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455200#comment-17455200 ] angerszhu commented on SPARK-37582: --- Sure, will raise a pr soon > Support the binary type by contains() > - > > Key: SPARK-37582 > URL: https://issues.apache.org/jira/browse/SPARK-37582 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > New function which was exposed by https://github.com/apache/spark/pull/34761 > accepts the string type only. The ticket aims to support the binary type as > well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37584) New function: map_contains_key
Gengliang Wang created SPARK-37584: -- Summary: New function: map_contains_key Key: SPARK-37584 URL: https://issues.apache.org/jira/browse/SPARK-37584 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Add a new function map_contains_key, which returns true if the map contains the key Examples: > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1); true > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3); false -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455202#comment-17455202 ] Apache Spark commented on SPARK-37445: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34835 > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37584) New SQL function: map_contains_key
[ https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-37584: --- Summary: New SQL function: map_contains_key (was: New function: map_contains_key) > New SQL function: map_contains_key > -- > > Key: SPARK-37584 > URL: https://issues.apache.org/jira/browse/SPARK-37584 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Add a new function map_contains_key, which returns true if the map contains > the key > Examples: > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1); > true > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3); > false -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37584) New SQL function: map_contains_key
[ https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455204#comment-17455204 ] Apache Spark commented on SPARK-37584: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34836 > New SQL function: map_contains_key > -- > > Key: SPARK-37584 > URL: https://issues.apache.org/jira/browse/SPARK-37584 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Add a new function map_contains_key, which returns true if the map contains > the key > Examples: > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1); > true > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3); > false -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37584) New SQL function: map_contains_key
[ https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37584: Assignee: Gengliang Wang (was: Apache Spark) > New SQL function: map_contains_key > -- > > Key: SPARK-37584 > URL: https://issues.apache.org/jira/browse/SPARK-37584 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Add a new function map_contains_key, which returns true if the map contains > the key > Examples: > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1); > true > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3); > false -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37584) New SQL function: map_contains_key
[ https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37584: Assignee: Apache Spark (was: Gengliang Wang) > New SQL function: map_contains_key > -- > > Key: SPARK-37584 > URL: https://issues.apache.org/jira/browse/SPARK-37584 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Add a new function map_contains_key, which returns true if the map contains > the key > Examples: > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1); > true > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3); > false -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37584) New SQL function: map_contains_key
[ https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37584: Assignee: Apache Spark (was: Gengliang Wang) > New SQL function: map_contains_key > -- > > Key: SPARK-37584 > URL: https://issues.apache.org/jira/browse/SPARK-37584 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Add a new function map_contains_key, which returns true if the map contains > the key > Examples: > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1); > true > > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3); > false -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case
Sandeep Katta created SPARK-37585: - Summary: DSV2 InputMetrics are not getting update in corner case Key: SPARK-37585 URL: https://issues.apache.org/jira/browse/SPARK-37585 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2, 3.0.3 Reporter: Sandeep Katta In some corner cases, DSV2 is not updating the input metrics. This is very special case where the number of records read are less than 1000 and *hasNext* is not called for last element(cz input.hasNext returns false so MetricsIterator.hasNext is not called) hasNext implementation of MetricsIterator {code:java} override def hasNext: Boolean = { if (iter.hasNext) { true } else { metricsHandler.updateMetrics(0, force = true) false } {code} You reproduce this issue easily in spark-shell by running below code {code:java} import scala.collection.mutable import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "") val dir = "Users/tmp1" spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir) val df = spark.read.format("parquet").load(dir) val bytesReads = new mutable.ArrayBuffer[Long]() val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new SparkListener() { override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead } } spark.sparkContext.addSparkListener(bytesReadListener) try { df.limit(10).collect() assert(recordsRead.sum > 0) assert(bytesReads.sum > 0) } finally { spark.sparkContext.removeSparkListener(bytesReadListener) } {code} This code generally fails at *assert(bytesReads.sum > 0)* which confirms that updateMetrics API is not called -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()
[ https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455262#comment-17455262 ] angerszhu commented on SPARK-37583: --- I think this should be done together with contains. > Support the binary type by startswith() and endswith() > -- > > Key: SPARK-37583 > URL: https://issues.apache.org/jira/browse/SPARK-37583 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > New function were exposed by https://github.com/apache/spark/pull/34782 but > they accept the string type only. To make migration from other systems > easier, it would be nice to support the binary type too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()
[ https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455290#comment-17455290 ] Max Gekk commented on SPARK-37583: -- [~angerszhuuu] Sure, you can combine it to one PR. > Support the binary type by startswith() and endswith() > -- > > Key: SPARK-37583 > URL: https://issues.apache.org/jira/browse/SPARK-37583 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > New function were exposed by https://github.com/apache/spark/pull/34782 but > they accept the string type only. To make migration from other systems > easier, it would be nice to support the binary type too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37576) Support built-in K8s executor roll plugin
[ https://issues.apache.org/jira/browse/SPARK-37576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37576. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34832 [https://github.com/apache/spark/pull/34832] > Support built-in K8s executor roll plugin > - > > Key: SPARK-37576 > URL: https://issues.apache.org/jira/browse/SPARK-37576 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37576) Support built-in K8s executor roll plugin
[ https://issues.apache.org/jira/browse/SPARK-37576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37576: - Assignee: Dongjoon Hyun > Support built-in K8s executor roll plugin > - > > Key: SPARK-37576 > URL: https://issues.apache.org/jira/browse/SPARK-37576 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt
Max Gekk created SPARK-37586: Summary: Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt Key: SPARK-37586 URL: https://issues.apache.org/jira/browse/SPARK-37586 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt functions to spark. However they rely on the jvm's configuration regarding which cipher mode to support, this is problematic as it is not fixed across versions and systems. Let's hardcode a default cipher mode and also allow users to set a cipher mode as an argument to the function. In the future, we can support other modes like GCM and CBC that have been already supported by other systems: # Snowflake: https://docs.snowflake.com/en/sql-reference/functions/encrypt.html # Bigquery: https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt
[ https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37586: Assignee: (was: Apache Spark) > Add cipher mode option and set default cipher mode for aes_encrypt and > aes_decrypt > -- > > Key: SPARK-37586 > URL: https://issues.apache.org/jira/browse/SPARK-37586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt > functions to spark. However they rely on the jvm's configuration regarding > which cipher mode to support, this is problematic as it is not fixed across > versions and systems. > Let's hardcode a default cipher mode and also allow users to set a cipher > mode as an argument to the function. > In the future, we can support other modes like GCM and CBC that have been > already supported by other systems: > # Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/encrypt.html > # Bigquery: > https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt
[ https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455351#comment-17455351 ] Apache Spark commented on SPARK-37586: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/34837 > Add cipher mode option and set default cipher mode for aes_encrypt and > aes_decrypt > -- > > Key: SPARK-37586 > URL: https://issues.apache.org/jira/browse/SPARK-37586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt > functions to spark. However they rely on the jvm's configuration regarding > which cipher mode to support, this is problematic as it is not fixed across > versions and systems. > Let's hardcode a default cipher mode and also allow users to set a cipher > mode as an argument to the function. > In the future, we can support other modes like GCM and CBC that have been > already supported by other systems: > # Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/encrypt.html > # Bigquery: > https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt
[ https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37586: Assignee: Apache Spark > Add cipher mode option and set default cipher mode for aes_encrypt and > aes_decrypt > -- > > Key: SPARK-37586 > URL: https://issues.apache.org/jira/browse/SPARK-37586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt > functions to spark. However they rely on the jvm's configuration regarding > which cipher mode to support, this is problematic as it is not fixed across > versions and systems. > Let's hardcode a default cipher mode and also allow users to set a cipher > mode as an argument to the function. > In the future, we can support other modes like GCM and CBC that have been > already supported by other systems: > # Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/encrypt.html > # Bigquery: > https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast
[ https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37451: -- Priority: Blocker (was: Major) > Performance improvement regressed String to Decimal cast > > > Key: SPARK-37451 > URL: https://issues.apache.org/jira/browse/SPARK-37451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1 >Reporter: Raza Jafri >Priority: Blocker > Labels: correctness > > A performance improvement to how Spark casts Strings to Decimal in this [PR > title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a > regression > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |7.836725755512218...| > ++ > scala> spark.version > res2: String = 3.0.1 > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |null| > ++ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(a,StringType,false)) > df: org.apache.spark.sql.DataFrame = [a: string] > scala> spark.version > res1: String = 3.1.1 > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast
[ https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37451: -- Target Version/s: 3.2.0, 3.1.3 Description: A performance improvement to how Spark casts Strings to Decimal in this [PR title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a regression {noformat} scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.types._ import org.apache.spark.sql.Row spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) val data = Seq(Row("7.836725755512218E38")) val schema=StructType(Array(StructField("a", StringType, false))) val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.select(col("a").cast(DecimalType(37,-17))).show // Exiting paste mode, now interpreting. ++ | a| ++ |7.836725755512218...| ++ scala> spark.version res2: String = 3.0.1 scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.types._ import org.apache.spark.sql.Row spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) val data = Seq(Row("7.836725755512218E38")) val schema=StructType(Array(StructField("a", StringType, false))) val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.select(col("a").cast(DecimalType(37,-17))).show // Exiting paste mode, now interpreting. ++ | a| ++ |null| ++ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,false)) df: org.apache.spark.sql.DataFrame = [a: string] scala> spark.version res1: String = 3.1.1 {noformat} was: A performance improvement to how Spark casts Strings to Decimal in this [PR title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a regression {noformat} scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.types._ import org.apache.spark.sql.Row spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) val data = Seq(Row("7.836725755512218E38")) val schema=StructType(Array(StructField("a", StringType, false))) val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.select(col("a").cast(DecimalType(37,-17))).show // Exiting paste mode, now interpreting. ++ | a| ++ |7.836725755512218...| ++ scala> spark.version res2: String = 3.0.1 scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.types._ import org.apache.spark.sql.Row spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) val data = Seq(Row("7.836725755512218E38")) val schema=StructType(Array(StructField("a", StringType, false))) val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.select(col("a").cast(DecimalType(37,-17))).show // Exiting paste mode, now interpreting. ++ | a| ++ |null| ++ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,false)) df: org.apache.spark.sql.DataFrame = [a: string] scala> spark.version res1: String = 3.1.1 {noformat} > Performance improvement regressed String to Decimal cast > > > Key: SPARK-37451 > URL: https://issues.apache.org/jira/browse/SPARK-37451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1 >Reporter: Raza Jafri >Priority: Blocker > Labels: correctness > > A performance improvement to how Spark casts Strings to Decimal in this [PR > title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a > regression > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal",
[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast
[ https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37451: -- Labels: correctness (was: ) > Performance improvement regressed String to Decimal cast > > > Key: SPARK-37451 > URL: https://issues.apache.org/jira/browse/SPARK-37451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1 >Reporter: Raza Jafri >Priority: Major > Labels: correctness > > A performance improvement to how Spark casts Strings to Decimal in this [PR > title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a > regression > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |7.836725755512218...| > ++ > scala> spark.version > res2: String = 3.0.1 > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |null| > ++ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(a,StringType,false)) > df: org.apache.spark.sql.DataFrame = [a: string] > scala> spark.version > res1: String = 3.1.1 > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37451) Performance improvement regressed String to Decimal cast
[ https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455357#comment-17455357 ] Dongjoon Hyun commented on SPARK-37451: --- According to the JIRA description, I added a correctness label to this PR and set Target Versions: 3.1.3 and 3.2.0. cc [~huaxingao] > Performance improvement regressed String to Decimal cast > > > Key: SPARK-37451 > URL: https://issues.apache.org/jira/browse/SPARK-37451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1 >Reporter: Raza Jafri >Priority: Blocker > Labels: correctness > > A performance improvement to how Spark casts Strings to Decimal in this [PR > title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a > regression > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |7.836725755512218...| > ++ > scala> spark.version > res2: String = 3.0.1 > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |null| > ++ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(a,StringType,false)) > df: org.apache.spark.sql.DataFrame = [a: string] > scala> spark.version > res1: String = 3.1.1 > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37451) Performance improvement regressed String to Decimal cast
[ https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455357#comment-17455357 ] Dongjoon Hyun edited comment on SPARK-37451 at 12/8/21, 4:16 PM: - According to the JIRA description, I added a correctness label to this PR and set Target Versions: 3.1.3 and 3.2.1. cc [~huaxingao] was (Author: dongjoon): According to the JIRA description, I added a correctness label to this PR and set Target Versions: 3.1.3 and 3.2.0. cc [~huaxingao] > Performance improvement regressed String to Decimal cast > > > Key: SPARK-37451 > URL: https://issues.apache.org/jira/browse/SPARK-37451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1 >Reporter: Raza Jafri >Priority: Blocker > Labels: correctness > > A performance improvement to how Spark casts Strings to Decimal in this [PR > title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a > regression > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |7.836725755512218...| > ++ > scala> spark.version > res2: String = 3.0.1 > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |null| > ++ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(a,StringType,false)) > df: org.apache.spark.sql.DataFrame = [a: string] > scala> spark.version > res1: String = 3.1.1 > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37451) Performance improvement regressed String to Decimal cast
[ https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37451: -- Target Version/s: 3.1.3, 3.2.1 (was: 3.2.0, 3.1.3) > Performance improvement regressed String to Decimal cast > > > Key: SPARK-37451 > URL: https://issues.apache.org/jira/browse/SPARK-37451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1 >Reporter: Raza Jafri >Priority: Blocker > Labels: correctness > > A performance improvement to how Spark casts Strings to Decimal in this [PR > title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a > regression > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |7.836725755512218...| > ++ > scala> spark.version > res2: String = 3.0.1 > scala> :paste > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) > spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true) > spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true) > val data = Seq(Row("7.836725755512218E38")) > val schema=StructType(Array(StructField("a", StringType, false))) > val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.select(col("a").cast(DecimalType(37,-17))).show > // Exiting paste mode, now interpreting. > ++ > | a| > ++ > |null| > ++ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38]) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(a,StringType,false)) > df: org.apache.spark.sql.DataFrame = [a: string] > scala> spark.version > res1: String = 3.1.1 > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Affects Version/s: 3.3.0 (was: 3.2.0) > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.3.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id > 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 > --worker-url spark://Worker@100.116.124.193:45287 {code} > But there are use cases which require more flexible ways of launching > executors. In particular, our use case is that we run Spark in standalone > mode, Spark master and workers are running in VMs. We want to allow Spark app > developers to provide custom container images to customize the job runtime > environment (typically Java and Python dependencies), so executors (which run > the job code) need to run in Docker containers. > After investigating in the source code, we found that the concept of Spark > Command Runner might be a good solution. Basically, we want to introduce an > optional Spark command runner in Spark, so that instead of running the > command to launch executor directly, it passes the command to the runner, > which the runner then runs the command with its own strategy which could be > running in Docker, or by default running the command directly. > The runner should be customizable through an env variable > `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: > {code:java} > #!/bin/bash > exec "$@" {code} > or in the case of Docker container: > {code:java} > #!/bin/bash > docker run ... – "$@" {code} > > I already have a patch for this feature and have tested in our environment. > > [1]: > [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Affects Version/s: 3.2.0 (was: 3.3.0) > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id > 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 > --worker-url spark://Worker@100.116.124.193:45287 {code} > But there are use cases which require more flexible ways of launching > executors. In particular, our use case is that we run Spark in standalone > mode, Spark master and workers are running in VMs. We want to allow Spark app > developers to provide custom container images to customize the job runtime > environment (typically Java and Python dependencies), so executors (which run > the job code) need to run in Docker containers. > After investigating in the source code, we found that the concept of Spark > Command Runner might be a good solution. Basically, we want to introduce an > optional Spark command runner in Spark, so that instead of running the > command to launch executor directly, it passes the command to the runner, > which the runner then runs the command with its own strategy which could be > running in Docker, or by default running the command directly. > The runner should be customizable through an env variable > `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: > {code:java} > #!/bin/bash > exec "$@" {code} > or in the case of Docker container: > {code:java} > #!/bin/bash > docker run ... – "$@" {code} > > I already have a patch for this feature and have tested in our environment. > > [1]: > [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37205: Assignee: Chao Sun > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37205. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34635 [https://github.com/apache/spark/pull/34635] > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37570) mypy breaks on pyspark.pandas.plot.core.Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-37570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455897#comment-17455897 ] Rafal Wojdyla commented on SPARK-37570: --- A workaround in {{setup.cfg}}: {noformat} [mypy-pyspark.pandas.plot.core] follow_imports = skip {noformat} > mypy breaks on pyspark.pandas.plot.core.Bucketizer > -- > > Key: SPARK-37570 > URL: https://issues.apache.org/jira/browse/SPARK-37570 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Mypy: v0.910-1 > Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Mypy breaks on a project with pyspark 3.2.0 dependency (worked fine for > 3.1.2), see stacktrace below: > {noformat} > Traceback (most recent call last): > File "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/bin/mypy", > line 8, in > sys.exit(console_entry()) > File > "/Users/rav/.cache/pre-commit/repoytd0jvc_/py_env-python3.9/lib/python3.9/site-packages/mypy/__main__.py", > line 11, in console_entry > main(None, sys.stdout, sys.stderr) > File "mypy/main.py", line 87, in main > File "mypy/main.py", line 165, in run_build > File "mypy/build.py", line 179, in build > File "mypy/build.py", line 254, in _build > File "mypy/build.py", line 2697, in dispatch > File "mypy/build.py", line 3021, in process_graph > File "mypy/build.py", line 3138, in process_stale_scc > File "mypy/build.py", line 2288, in write_cache > File "mypy/build.py", line 1475, in write_cache > File "mypy/nodes.py", line 313, in serialize > File "mypy/nodes.py", line 3149, in serialize > File "mypy/nodes.py", line 3083, in serialize > AssertionError: Definition of pyspark.pandas.plot.core.Bucketizer is > unexpectedly incomplete > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37587) Forever-running streams get completed with finished status when k8s worker is lost
In-Ho Yi created SPARK-37587: Summary: Forever-running streams get completed with finished status when k8s worker is lost Key: SPARK-37587 URL: https://issues.apache.org/jira/browse/SPARK-37587 Project: Spark Issue Type: Bug Components: Kubernetes, Structured Streaming Affects Versions: 3.1.1 Reporter: In-Ho Yi We have forever-running streaming jobs, defined as: {{spark.readStream}} {{.format("avro")}} {{.option("maxFilesPerTrigger", 10)}} {{.schema(schema)}} {{.load(loadPath)}} {{.as[T]}} {{.writeStream}} {{.option("checkpointLocation", checkpointPath)}} We have several of these running, and at the end of main, I call {{spark.streams.awaitAnyTermination()}} To fail the job on any failing jobs. Now, we are running on a k8s runner with arguments (showing only relevant ones): {{- /opt/spark/bin/spark-submit}} {{# From `kubectl cluster-info`}} {{- --master}} {{- k8s://https://SOMEADDRESS.gr7.us-east-1.eks.amazonaws.com:443}} {{- --deploy-mode}} {{- client}} {{- --name}} {{- ourstreamingjob}} {{# Driver connection via headless service}} {{- --conf}} {{- spark.driver.host=spark-driver-service}} {{- --conf}} {{- spark.driver.port=31337}} {{# Move ivy temp dirs under /tmp. See https://stackoverflow.com/a/55921242/760482}} {{- --conf}} {{- spark.jars.ivy=/tmp/.ivy}} {{# JVM settings for ivy and log4j.}} {{- --conf}} {{- spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp -Dlog4j.configuration=file:///opt/spark/log4j.properties}} {{- --conf}} {{- spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/log4j.properties}} {{# Spark on k8s settings.}} {{- --conf}} {{- spark.executor.instances=10}} {{- --conf}} {{- spark.executor.memory=8g}} {{- --conf}} {{- spark.executor.memoryOverhead=2g}} {{- --conf}} {{{}- spark.kubernetes.container.image=xxx.dkr.ecr.us-east-1.amazonaws.com/{}}}{{{}ourstreamingjob{}}}{{{}:latesthash{}}} {{- --conf}} {{- spark.kubernetes.container.image.pullPolicy=Always}} The container image is built with provided spark:3.1.1-hadoop3.2 image and we add our jar to it. We have found issues with our monitoring that a couple of these streaming were not running anymore, and found that these streams were deemed "complete" with FINISHED status. I couldn't find any error logs from the streams themselves. We did find that the time when these jobs were completed happen to coincide with the time when one of the k8s workers were lost and a new worker was added to the cluster. >From the console, it says: {quote}Executor 7 Removed at 2021/12/07 22:26:12 Reason: The executor with ID 7 (registered at 1638858647668 ms) was not found in the cluster at the polling time (1638915971948 ms) which is after the accepted detect delta time (3 ms) configured by `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been deleted but the driver missed the deletion event. Marking this executor as failed. {quote} I think the correct behavior would be to continue on the streaming from the last known checkpoint. If there is really an error, it should throw and the error should propagate so that eventually the pipeline will terminate. I've checked out changes since 3.1.1 but couldn't find any fix to this specific issue. Also, is there any workaround you could recommend? Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable
[ https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455950#comment-17455950 ] Apache Spark commented on SPARK-37569: -- User 'shardulm94' has created a pull request for this issue: https://github.com/apache/spark/pull/34839 > View Analysis incorrectly marks nested fields as nullable > - > > Key: SPARK-37569 > URL: https://issues.apache.org/jira/browse/SPARK-37569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shardul Mahadik >Priority: Major > > Consider a view as follows with all fields non-nullable (required) > {code:java} > spark.sql(""" > CREATE OR REPLACE VIEW v AS > SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > """) > {code} > we can see that the view schema has been correctly stored as non-nullable > {code:java} > scala> > System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", > "v2")) > CatalogTable( > Database: default > Table: v2 > Owner: smahadik > Created Time: Tue Dec 07 09:00:42 PST 2021 > Last Access: UNKNOWN > Created By: Spark 3.3.0-SNAPSHOT > Type: VIEW > View Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Original Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Catalog and Namespace: spark_catalog.default > View Query Output Columns: [id, nested] > Table Properties: [transient_lastDdlTime=1638896442] > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties: [serialization.format=1] > Schema: root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = false) > ) > {code} > However, when trying to read this view, it incorrectly marks nested column > {{a}} as nullable > {code:java} > scala> spark.table("v2").printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = true) > {code} > This is caused by [this > line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546] > in Analyzer.scala. Going through the history of changes for this block of > code, it seems like {{asNullable}} is a remnant of a time before we added > [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543] > to ensure that the from and to types of the cast were compatible. As > nullability is already checked, it should be safe to add a cast without > converting the target datatype to nullable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable
[ https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37569: Assignee: (was: Apache Spark) > View Analysis incorrectly marks nested fields as nullable > - > > Key: SPARK-37569 > URL: https://issues.apache.org/jira/browse/SPARK-37569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shardul Mahadik >Priority: Major > > Consider a view as follows with all fields non-nullable (required) > {code:java} > spark.sql(""" > CREATE OR REPLACE VIEW v AS > SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > """) > {code} > we can see that the view schema has been correctly stored as non-nullable > {code:java} > scala> > System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", > "v2")) > CatalogTable( > Database: default > Table: v2 > Owner: smahadik > Created Time: Tue Dec 07 09:00:42 PST 2021 > Last Access: UNKNOWN > Created By: Spark 3.3.0-SNAPSHOT > Type: VIEW > View Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Original Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Catalog and Namespace: spark_catalog.default > View Query Output Columns: [id, nested] > Table Properties: [transient_lastDdlTime=1638896442] > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties: [serialization.format=1] > Schema: root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = false) > ) > {code} > However, when trying to read this view, it incorrectly marks nested column > {{a}} as nullable > {code:java} > scala> spark.table("v2").printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = true) > {code} > This is caused by [this > line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546] > in Analyzer.scala. Going through the history of changes for this block of > code, it seems like {{asNullable}} is a remnant of a time before we added > [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543] > to ensure that the from and to types of the cast were compatible. As > nullability is already checked, it should be safe to add a cast without > converting the target datatype to nullable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable
[ https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37569: Assignee: Apache Spark > View Analysis incorrectly marks nested fields as nullable > - > > Key: SPARK-37569 > URL: https://issues.apache.org/jira/browse/SPARK-37569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shardul Mahadik >Assignee: Apache Spark >Priority: Major > > Consider a view as follows with all fields non-nullable (required) > {code:java} > spark.sql(""" > CREATE OR REPLACE VIEW v AS > SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > """) > {code} > we can see that the view schema has been correctly stored as non-nullable > {code:java} > scala> > System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", > "v2")) > CatalogTable( > Database: default > Table: v2 > Owner: smahadik > Created Time: Tue Dec 07 09:00:42 PST 2021 > Last Access: UNKNOWN > Created By: Spark 3.3.0-SNAPSHOT > Type: VIEW > View Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Original Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Catalog and Namespace: spark_catalog.default > View Query Output Columns: [id, nested] > Table Properties: [transient_lastDdlTime=1638896442] > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties: [serialization.format=1] > Schema: root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = false) > ) > {code} > However, when trying to read this view, it incorrectly marks nested column > {{a}} as nullable > {code:java} > scala> spark.table("v2").printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = true) > {code} > This is caused by [this > line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546] > in Analyzer.scala. Going through the history of changes for this block of > code, it seems like {{asNullable}} is a remnant of a time before we added > [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543] > to ensure that the from and to types of the cast were compatible. As > nullability is already checked, it should be safe to add a cast without > converting the target datatype to nullable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ocean updated SPARK-37581: -- Priority: Critical (was: Major) > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Critical > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish normally. > we can reproduce this problem through example below: > create table test.test_a ( > day string, > week int, > weekday int) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_a partition (dt=20211126) values('1',1,2); > create table test.test_b ( > session_id string, > device_id string, > brand string, > model string, > wx_version string, > os string, > net_work_type string, > app_id string, > app_name string, > col_z string, > page_url string, > page_title string, > olabel string, > otitle string, > source string, > send_dt string, > recv_dt string, > request_time string, > write_time string, > client_ip string, > col_a string, > dt_hour varchar(12), > product string, > channelfrom string, > customer_um string, > kb_code string, > col_b string, > rectype string, > errcode string, > col_c string, > pageid_merge string) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_b partition(dt=20211126) > values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); > create table if not exists test.test_c stored as ORCFILE as > select calendar.day,calendar.week,calendar.weekday, a_kbs, > b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, > j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs > from (select * from test.test_a where dt = '20211126') calendar > left join > (select dt,count(distinct kb_code) as a_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t1 > on calendar.dt = t1.dt > left join > (select dt,count(distinct kb_code) as b_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t2 > on calendar.dt = t2.dt > left join > (select dt,count(distinct kb_code) as c_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t3 > on calendar.dt = t3.dt > left join > (select dt,count(distinct kb_code) as d_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t4 > on calendar.dt = t4.dt > left join > (select dt,count(distinct kb_code) as e_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t5 > on calendar.dt = t5.dt > left join > (select dt,count(distinct kb_code) as f_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t6 > on calendar.dt = t6.dt > left join > (select dt,count(distinct kb_code) as g_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t7 > on calendar.dt = t7.dt > left join > (select dt,count(distinct kb_code) as h_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t8 > on calendar.dt = t8.dt > left join > (select dt,count(distinct kb_code) as i_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t9 > on calendar.dt = t9.dt > left join > (select dt,count(distinct kb_code) as j_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t10 > on calendar.dt = t10.dt > left join > (select dt,count(distinct kb_code) as k_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t11 > on calendar.dt = t11.dt > left join > (select dt,count(distinct kb_code) as l_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t12 > on calendar.dt = t12.dt > left join > (select dt,count(distinct kb_code) as m_kbs > from test.test_b > where dt = '20
[jira] [Resolved] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37561. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34822 [https://github.com/apache/spark/pull/34822] > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37561: Assignee: dzcxzl > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31785) Add a helper function to test all parquet readers
[ https://issues.apache.org/jira/browse/SPARK-31785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456088#comment-17456088 ] Apache Spark commented on SPARK-31785: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34840 > Add a helper function to test all parquet readers > - > > Key: SPARK-31785 > URL: https://issues.apache.org/jira/browse/SPARK-31785 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.0.0 > > > Add the method withAllParquetReaders {} which runs the block of code for all > supported parquet readers. And re-use it in test suites. This should de-dup > code, and allow OSS Spark based projects that have their own parquet readers > to re-use existing tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31785) Add a helper function to test all parquet readers
[ https://issues.apache.org/jira/browse/SPARK-31785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456090#comment-17456090 ] Apache Spark commented on SPARK-31785: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34840 > Add a helper function to test all parquet readers > - > > Key: SPARK-31785 > URL: https://issues.apache.org/jira/browse/SPARK-31785 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.0.0 > > > Add the method withAllParquetReaders {} which runs the block of code for all > supported parquet readers. And re-use it in test suites. This should de-dup > code, and allow OSS Spark based projects that have their own parquet readers > to re-use existing tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37581: - Priority: Major (was: Critical) > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Major > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish normally. > we can reproduce this problem through example below: > create table test.test_a ( > day string, > week int, > weekday int) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_a partition (dt=20211126) values('1',1,2); > create table test.test_b ( > session_id string, > device_id string, > brand string, > model string, > wx_version string, > os string, > net_work_type string, > app_id string, > app_name string, > col_z string, > page_url string, > page_title string, > olabel string, > otitle string, > source string, > send_dt string, > recv_dt string, > request_time string, > write_time string, > client_ip string, > col_a string, > dt_hour varchar(12), > product string, > channelfrom string, > customer_um string, > kb_code string, > col_b string, > rectype string, > errcode string, > col_c string, > pageid_merge string) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_b partition(dt=20211126) > values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); > create table if not exists test.test_c stored as ORCFILE as > select calendar.day,calendar.week,calendar.weekday, a_kbs, > b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, > j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs > from (select * from test.test_a where dt = '20211126') calendar > left join > (select dt,count(distinct kb_code) as a_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t1 > on calendar.dt = t1.dt > left join > (select dt,count(distinct kb_code) as b_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t2 > on calendar.dt = t2.dt > left join > (select dt,count(distinct kb_code) as c_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t3 > on calendar.dt = t3.dt > left join > (select dt,count(distinct kb_code) as d_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t4 > on calendar.dt = t4.dt > left join > (select dt,count(distinct kb_code) as e_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t5 > on calendar.dt = t5.dt > left join > (select dt,count(distinct kb_code) as f_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t6 > on calendar.dt = t6.dt > left join > (select dt,count(distinct kb_code) as g_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t7 > on calendar.dt = t7.dt > left join > (select dt,count(distinct kb_code) as h_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t8 > on calendar.dt = t8.dt > left join > (select dt,count(distinct kb_code) as i_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t9 > on calendar.dt = t9.dt > left join > (select dt,count(distinct kb_code) as j_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t10 > on calendar.dt = t10.dt > left join > (select dt,count(distinct kb_code) as k_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t11 > on calendar.dt = t11.dt > left join > (select dt,count(distinct kb_code) as l_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t12 > on calendar.dt = t12.dt > left join > (select dt,count(distinct kb_code) as m_kbs > from test.test_b > whe
[jira] [Commented] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456091#comment-17456091 ] Hyukjin Kwon commented on SPARK-37581: -- [~oceaneast] do you mind narrowing down and make the reproducer smaller? The reproducer is too complicated to debug and investigate further. > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Critical > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish normally. > we can reproduce this problem through example below: > create table test.test_a ( > day string, > week int, > weekday int) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_a partition (dt=20211126) values('1',1,2); > create table test.test_b ( > session_id string, > device_id string, > brand string, > model string, > wx_version string, > os string, > net_work_type string, > app_id string, > app_name string, > col_z string, > page_url string, > page_title string, > olabel string, > otitle string, > source string, > send_dt string, > recv_dt string, > request_time string, > write_time string, > client_ip string, > col_a string, > dt_hour varchar(12), > product string, > channelfrom string, > customer_um string, > kb_code string, > col_b string, > rectype string, > errcode string, > col_c string, > pageid_merge string) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_b partition(dt=20211126) > values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); > create table if not exists test.test_c stored as ORCFILE as > select calendar.day,calendar.week,calendar.weekday, a_kbs, > b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, > j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs > from (select * from test.test_a where dt = '20211126') calendar > left join > (select dt,count(distinct kb_code) as a_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t1 > on calendar.dt = t1.dt > left join > (select dt,count(distinct kb_code) as b_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t2 > on calendar.dt = t2.dt > left join > (select dt,count(distinct kb_code) as c_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t3 > on calendar.dt = t3.dt > left join > (select dt,count(distinct kb_code) as d_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t4 > on calendar.dt = t4.dt > left join > (select dt,count(distinct kb_code) as e_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t5 > on calendar.dt = t5.dt > left join > (select dt,count(distinct kb_code) as f_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t6 > on calendar.dt = t6.dt > left join > (select dt,count(distinct kb_code) as g_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t7 > on calendar.dt = t7.dt > left join > (select dt,count(distinct kb_code) as h_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t8 > on calendar.dt = t8.dt > left join > (select dt,count(distinct kb_code) as i_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t9 > on calendar.dt = t9.dt > left join > (select dt,count(distinct kb_code) as j_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t10 > on calendar.dt = t10.dt > left join > (select dt,count(distinct kb_code) as k_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t11 > on calendar.dt = t11.dt > left join > (select dt,count(distinct kb_code) as l_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) =
[jira] [Commented] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception
[ https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456092#comment-17456092 ] Hyukjin Kwon commented on SPARK-37579: -- Spark 2.4 is EOL. mind checking if that passes with Spark 3+? > Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, > join other table view cause exception > -- > > Key: SPARK-37579 > URL: https://issues.apache.org/jira/browse/SPARK-37579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: page >Priority: Major > > Possible steps to reproduce: > 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4] > 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling > Dataset#unionByName > 3. Run > {code:java} > d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code} > ,produce DataFrame d6 > 4. DataFrame d6 join another DataFrame d7 > 5. Call function like count to trigger spark job > 6. Exception happend > > stack trace: > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88) > at > org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2836) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal > numbers of partitions: List(2, 1) > at > org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at org.apache.spark.ShuffleDependency.(Dependency.scala:94) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112) > at > org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284) > at > org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133) > at > org.a
[jira] [Updated] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
[ https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37577: - Priority: Major (was: Critical) > ClassCastException: ArrayType cannot be cast to StructType > -- > > Key: SPARK-37577 > URL: https://issues.apache.org/jira/browse/SPARK-37577 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Reproduction: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([StructField('o', ArrayType(StructType([StructField('s', >StringType(), False), StructField('b', >ArrayType(StructType([StructField('e', StringType(), >False)]), True), False)]), True), False)]) > ( > spark.createDataFrame([], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.*") > .select(F.explode("b")) > .count() > ) > {code} > Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note > that if you remove, field {{s}}, the code works fine, which is a bit > unexpected and likely a clue. > {noformat} > Py4JJavaError: An error occurred while calling o156.count. > : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366) > at > org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$1
[jira] [Commented] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
[ https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456093#comment-17456093 ] Hyukjin Kwon commented on SPARK-37577: -- It works if you disable: {code} spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False) {code} cc [~viirya] FYI. > ClassCastException: ArrayType cannot be cast to StructType > -- > > Key: SPARK-37577 > URL: https://issues.apache.org/jira/browse/SPARK-37577 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Reproduction: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([StructField('o', ArrayType(StructType([StructField('s', >StringType(), False), StructField('b', >ArrayType(StructType([StructField('e', StringType(), >False)]), True), False)]), True), False)]) > ( > spark.createDataFrame([], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.*") > .select(F.explode("b")) > .count() > ) > {code} > Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note > that if you remove, field {{s}}, the code works fine, which is a bit > unexpected and likely a clue. > {noformat} > Py4JJavaError: An error occurred while calling o156.count. > : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions
[jira] [Comment Edited] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
[ https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456093#comment-17456093 ] Hyukjin Kwon edited comment on SPARK-37577 at 12/9/21, 2:16 AM: It works if you disable: {code} spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False) {code} cc [~viirya] FYI. log: {code} === Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning === Aggregate [count(1) AS count#22L] Aggregate [count(1) AS count#22L] !+- Project [col#9]+- Project ! +- Generate explode(b#6), false, [col#9] +- Project ! +- Project [eo#3.s AS s#5, eo#3.b AS b#6] +- Generate explode(b#6), [0], false, [col#9] ! +- Project [eo#3] +- Project [b#6] !+- Generate explode(o#0), false, [eo#3] +- Project [eo#3.b AS b#6] ! +- LogicalRDD [o#0], false+- Project [eo#3] !+- Generate explode(o#0), [0], false, [eo#3] ! +- LogicalRDD [o#0], false 21/12/09 11:14:50 TRACE PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject === Aggregate [count(1) AS count#22L]Aggregate [count(1) AS count#22L] +- Project +- Project ! +- Project +- Generate explode(b#6), [0], false, [col#9] ! +- Generate explode(b#6), [0], false, [col#9]+- Project [eo#3.b AS b#6] ! +- Project [b#6] +- Generate explode(o#0), [0], false, [eo#3] !+- Project [eo#3.b AS b#6] +- LogicalRDD [o#0], false ! +- Project [eo#3] ! +- Generate explode(o#0), [0], false, [eo#3] ! +- LogicalRDD [o#0], false {code} was (Author: hyukjin.kwon): It works if you disable: {code} spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False) {code} cc [~viirya] FYI. > ClassCastException: ArrayType cannot be cast to StructType > -- > > Key: SPARK-37577 > URL: https://issues.apache.org/jira/browse/SPARK-37577 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Reproduction: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([StructField('o', ArrayType(StructType([StructField('s', >StringType(), False), StructField('b', >ArrayType(StructType([StructField('e', StringType(), >False)]), True), False)]), True), False)]) > ( > spark.createDataFrame([], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.*") > .select(F.explode("b")) > .count() > ) > {code} > Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note > that if you remove, field {{s}}, the code works fine, which is a bit > unexpected and likely a clue. > {noformat} > Py4JJavaError: An error occurred while calling o156.count. > : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.cataly
[jira] [Created] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server
ramakrishna chilaka created SPARK-37588: --- Summary: lot of strings get accumulated in the heap dump of spark thrift server Key: SPARK-37588 URL: https://issues.apache.org/jira/browse/SPARK-37588 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.2.0 Environment: Open JDK (8 build 1.8.0_312-b07) and scala 12.12 OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8 Reporter: ramakrishna chilaka I am starting spark thrift server using the following options ``` /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf "spark.cores.max=320" --conf "spark.executor.cores=3" --conf "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf spark.sql.adaptive.coalescePartitions.enabled=true --conf spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 --conf "spark.driver.maxResultSize=4G" --conf "spark.max.fetch.failures.per.stage=10" --conf "spark.sql.thriftServer.incrementalCollect=false" --conf "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf spark.sql.thriftServer.interruptOnCancel=true --conf spark.sql.thriftServer.queryTimeout=0 --hiveconf hive.server2.transport.mode=http --hiveconf hive.server2.thrift.http.path=spark_sql --hiveconf hive.server2.thrift.min.worker.threads=500 --hiveconf hive.server2.thrift.max.worker.threads=2147483647 --hiveconf hive.server2.thrift.http.cookie.is.secure=false --hiveconf hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf hive.server2.thrift.bind.host=0.0.0.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.sql.cbo.joinReorder.enabled=true" --conf "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf "spark.worker.cleanup.enabled=true" --conf "spark.worker.cleanup.appDataTtl=3600" --hiveconf hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf "hive.server2.session.check.interval=6" --hiveconf "hive.server2.idle.session.timeout=90" --hiveconf "hive.server2.idle.session.check.operation=true" --conf "spark.eventLog.enabled=false" --conf "spark.cleaner.periodicGC.interval=5min" --conf "spark.appStateStore.asyncTracking.enable=false" --conf "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" --conf "spark.ui.retainedDeadExecutors=10" --conf "spark.worker.ui.retainedExecutors=10" --conf "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G --conf "spark.io.compression.codec=snappy" --conf "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" --conf "spark.memory.storageFraction=0.75" ``` the java heap dump after heavy usage is as follows ``` 1: 50465861 9745837152 [C 2: 23337896 1924089944 [Ljava.lang.Object; 3: 72524905 1740597720 java.lang.Long 4: 50463694 1614838208 java.lang.String 5: 22718029 726976928 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema 6: 2259416 343483328 [Lscala.collection.mutable.HashEntry; 7:16 141744616 [Lorg.apache.spark.sql.Row; 8:532529 123546728 org.apache.spark.sql.catalyst.expressions.Cast 9:535418 72816848 org.apache.spark.sql.catalyst.expressions.Literal 10: 1105284 70738176 scala.collection.mutable.LinkedHashSet 11: 1725833 70655016 [J 12: 1154128 55398144 scala.collection.mutable.HashMap 13: 1720740 55063680 org.apache.spark.util.collection.BitSet 14:57 50355536 scala.collection.immutable.V
[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server
[ https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramakrishna chilaka updated SPARK-37588: Description: I am starting spark thrift server using the following options ``` /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf "spark.cores.max=320" --conf "spark.executor.cores=3" --conf "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf spark.sql.adaptive.coalescePartitions.enabled=true --conf spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 --conf "spark.driver.maxResultSize=4G" --conf "spark.max.fetch.failures.per.stage=10" --conf "spark.sql.thriftServer.incrementalCollect=false" --conf "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf spark.sql.thriftServer.interruptOnCancel=true --conf spark.sql.thriftServer.queryTimeout=0 --hiveconf hive.server2.transport.mode=http --hiveconf hive.server2.thrift.http.path=spark_sql --hiveconf hive.server2.thrift.min.worker.threads=500 --hiveconf hive.server2.thrift.max.worker.threads=2147483647 --hiveconf hive.server2.thrift.http.cookie.is.secure=false --hiveconf hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf hive.server2.thrift.bind.host=0.0.0.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.sql.cbo.joinReorder.enabled=true" --conf "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf "spark.worker.cleanup.enabled=true" --conf "spark.worker.cleanup.appDataTtl=3600" --hiveconf hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf "hive.server2.session.check.interval=6" --hiveconf "hive.server2.idle.session.timeout=90" --hiveconf "hive.server2.idle.session.check.operation=true" --conf "spark.eventLog.enabled=false" --conf "spark.cleaner.periodicGC.interval=5min" --conf "spark.appStateStore.asyncTracking.enable=false" --conf "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" --conf "spark.ui.retainedDeadExecutors=10" --conf "spark.worker.ui.retainedExecutors=10" --conf "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G --conf "spark.io.compression.codec=snappy" --conf "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" --conf "spark.memory.storageFraction=0.75" ``` the java heap dump after heavy usage is as follows ``` 1: 50465861 9745837152 [C 2: 23337896 1924089944 [Ljava.lang.Object; 3: 72524905 1740597720 java.lang.Long 4: 50463694 1614838208 java.lang.String 5: 22718029 726976928 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema 6: 2259416 343483328 [Lscala.collection.mutable.HashEntry; 7:16 141744616 [Lorg.apache.spark.sql.Row; 8:532529 123546728 org.apache.spark.sql.catalyst.expressions.Cast 9:535418 72816848 org.apache.spark.sql.catalyst.expressions.Literal 10: 1105284 70738176 scala.collection.mutable.LinkedHashSet 11: 1725833 70655016 [J 12: 1154128 55398144 scala.collection.mutable.HashMap 13: 1720740 55063680 org.apache.spark.util.collection.BitSet 14:57 50355536 scala.collection.immutable.Vector 15: 1602297 38455128 scala.Some 16: 1154303 36937696 scala.collection.immutable.$colon$colon 17: 1105284 26526816 org.apache.spark.sql.catalyst.expressions.AttributeSet 18: 1066442 25594608 java.lang.Integer 19:735502 23536064 scala.collection.immutable.HashSet$H
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Description: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner should be customizable through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] was: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, which the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner should be customizable through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id > 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 > --worker-url spark://Worker@100.116.124.193:45287 {code} > But there are use cases which requ
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Description: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, which the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner should be customizable through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] was: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After investigating in the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, which the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner should be customizable through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id > 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 > --worker-url spark://Worker@100.116.124.193:45287 {code} > But there are use c
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Description: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner is set through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] was: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner should be customizable through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id > 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 > --worker-url spark://Worker@100.116.124.193:45287 {code} > But there are use cases which require more flexible ways
[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server
[ https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramakrishna chilaka updated SPARK-37588: Environment: Open JDK (8 build 1.8.0_312-b07) and scala 2.12 OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8 was: Open JDK (8 build 1.8.0_312-b07) and scala 12.12 OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8 > lot of strings get accumulated in the heap dump of spark thrift server > -- > > Key: SPARK-37588 > URL: https://issues.apache.org/jira/browse/SPARK-37588 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 > Environment: Open JDK (8 build 1.8.0_312-b07) and scala 2.12 > OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8 >Reporter: ramakrishna chilaka >Priority: Major > > I am starting spark thrift server using the following options > ``` > /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf > "spark.cores.max=320" --conf "spark.executor.cores=3" --conf > "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf > spark.sql.adaptive.coalescePartitions.enabled=true --conf > spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true > --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 > --conf "spark.driver.maxResultSize=4G" --conf > "spark.max.fetch.failures.per.stage=10" --conf > "spark.sql.thriftServer.incrementalCollect=false" --conf > "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" > --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf > spark.sql.thriftServer.interruptOnCancel=true --conf > spark.sql.thriftServer.queryTimeout=0 --hiveconf > hive.server2.transport.mode=http --hiveconf > hive.server2.thrift.http.path=spark_sql --hiveconf > hive.server2.thrift.min.worker.threads=500 --hiveconf > hive.server2.thrift.max.worker.threads=2147483647 --hiveconf > hive.server2.thrift.http.cookie.is.secure=false --hiveconf > hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf > hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false > --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf > hive.server2.thrift.bind.host=0.0.0.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf "spark.sql.cbo.joinReorder.enabled=true" --conf > "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf > "spark.worker.cleanup.enabled=true" --conf > "spark.worker.cleanup.appDataTtl=3600" --hiveconf > hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf > hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf > hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir > --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location > --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails > -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps > -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent > -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 > -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf > "hive.server2.session.check.interval=6" --hiveconf > "hive.server2.idle.session.timeout=90" --hiveconf > "hive.server2.idle.session.check.operation=true" --conf > "spark.eventLog.enabled=false" --conf > "spark.cleaner.periodicGC.interval=5min" --conf > "spark.appStateStore.asyncTracking.enable=false" --conf > "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf > "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" > --conf "spark.ui.retainedDeadExecutors=10" --conf > "spark.worker.ui.retainedExecutors=10" --conf > "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf > spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G > --conf "spark.io.compression.codec=snappy" --conf > "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true > --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" > --conf "spark.memory.storageFraction=0.75" > ``` > the java heap dump after heavy usage is as follows > ``` > 1: 50465861 9745837152 [C >2: 23337896 1924089944 [Ljava.lang.Object; >3: 72524905 1740597720 java.lang.Long >4: 50463694 1614838208 java.lang.String >5: 22718029 726976928 > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema >6: 2259416 343483328 [Lscala.colle
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Description: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] was: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner is set through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id > 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 > --worker-url spark://Worker@100.116.124.193:45287 {code} > But there are use cases which require more flexible ways of launch
[jira] [Commented] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
[ https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456103#comment-17456103 ] L. C. Hsieh commented on SPARK-37577: - Thanks [~hyukjin.kwon]. I will take a look. > ClassCastException: ArrayType cannot be cast to StructType > -- > > Key: SPARK-37577 > URL: https://issues.apache.org/jira/browse/SPARK-37577 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: Py: 3.9 >Reporter: Rafal Wojdyla >Priority: Major > > Reproduction: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([StructField('o', ArrayType(StructType([StructField('s', >StringType(), False), StructField('b', >ArrayType(StructType([StructField('e', StringType(), >False)]), True), False)]), True), False)]) > ( > spark.createDataFrame([], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.*") > .select(F.explode("b")) > .count() > ) > {code} > Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note > that if you remove, field {{s}}, the code works fine, which is a bit > unexpected and likely a clue. > {noformat} > Py4JJavaError: An error occurred while calling o156.count. > : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366) > at > org.apa
[jira] [Created] (SPARK-37589) Improve the performance of PlanParserSuite
jiaan.geng created SPARK-37589: -- Summary: Improve the performance of PlanParserSuite Key: SPARK-37589 URL: https://issues.apache.org/jira/browse/SPARK-37589 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng Reduce the time spent on executing PlanParserSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37589) Improve the performance of PlanParserSuite
[ https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37589: Assignee: Apache Spark > Improve the performance of PlanParserSuite > -- > > Key: SPARK-37589 > URL: https://issues.apache.org/jira/browse/SPARK-37589 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Reduce the time spent on executing PlanParserSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37589) Improve the performance of PlanParserSuite
[ https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37589: Assignee: (was: Apache Spark) > Improve the performance of PlanParserSuite > -- > > Key: SPARK-37589 > URL: https://issues.apache.org/jira/browse/SPARK-37589 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Reduce the time spent on executing PlanParserSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37589) Improve the performance of PlanParserSuite
[ https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456116#comment-17456116 ] Apache Spark commented on SPARK-37589: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34841 > Improve the performance of PlanParserSuite > -- > > Key: SPARK-37589 > URL: https://issues.apache.org/jira/browse/SPARK-37589 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Reduce the time spent on executing PlanParserSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37589) Improve the performance of PlanParserSuite
[ https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37589: Assignee: Apache Spark > Improve the performance of PlanParserSuite > -- > > Key: SPARK-37589 > URL: https://issues.apache.org/jira/browse/SPARK-37589 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Reduce the time spent on executing PlanParserSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception
[ https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456120#comment-17456120 ] page commented on SPARK-37579: -- No current plans to use Spark 3+. Can I still get advice or help on spark2.4+? > Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, > join other table view cause exception > -- > > Key: SPARK-37579 > URL: https://issues.apache.org/jira/browse/SPARK-37579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: page >Priority: Major > > Possible steps to reproduce: > 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4] > 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling > Dataset#unionByName > 3. Run > {code:java} > d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code} > ,produce DataFrame d6 > 4. DataFrame d6 join another DataFrame d7 > 5. Call function like count to trigger spark job > 6. Exception happend > > stack trace: > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88) > at > org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2836) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal > numbers of partitions: List(2, 1) > at > org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at org.apache.spark.ShuffleDependency.(Dependency.scala:94) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112) > at > org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284) > at > org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133) > at > org.
[jira] [Commented] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception
[ https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456121#comment-17456121 ] Hyukjin Kwon commented on SPARK-37579: -- For that, you can leverage Spark dev mailing list instead of filing it as an issue. > Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, > join other table view cause exception > -- > > Key: SPARK-37579 > URL: https://issues.apache.org/jira/browse/SPARK-37579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: page >Priority: Major > > Possible steps to reproduce: > 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4] > 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling > Dataset#unionByName > 3. Run > {code:java} > d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code} > ,produce DataFrame d6 > 4. DataFrame d6 join another DataFrame d7 > 5. Call function like count to trigger spark job > 6. Exception happend > > stack trace: > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88) > at > org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2836) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal > numbers of partitions: List(2, 1) > at > org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at org.apache.spark.ShuffleDependency.(Dependency.scala:94) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112) > at > org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284) > at > org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.s
[jira] [Resolved] (SPARK-37579) Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, join other table view cause exception
[ https://issues.apache.org/jira/browse/SPARK-37579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37579. -- Resolution: Incomplete > Called spark.sql multiple times,union multiple DataFrame, groupBy and pivot, > join other table view cause exception > -- > > Key: SPARK-37579 > URL: https://issues.apache.org/jira/browse/SPARK-37579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: page >Priority: Major > > Possible steps to reproduce: > 1. Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4] > 2. Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling > Dataset#unionByName > 3. Run > {code:java} > d5.groupBy("c1").pivot("c2").agg(concat_ws(", ", collect_list("value"))){code} > ,produce DataFrame d6 > 4. DataFrame d6 join another DataFrame d7 > 5. Call function like count to trigger spark job > 6. Exception happend > > stack trace: > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeChildStages(QueryStage.scala:88) > at > org.apache.spark.sql.execution.adaptive.QueryStage.prepareExecuteStage(QueryStage.scala:136) > at > org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:242) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2837) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3441) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:92) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:139) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3440) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2836) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal > numbers of partitions: List(2, 1) > at > org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) > at org.apache.spark.ShuffleDependency.(Dependency.scala:94) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:361) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:69) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112) > at > org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:284) > at > org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:236) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$8$$anonfun$apply$2$$anonfun$apply$3.
[jira] [Created] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
Terry Kim created SPARK-37590: - Summary: Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests Key: SPARK-37590 URL: https://issues.apache.org/jira/browse/SPARK-37590 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Terry Kim Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456134#comment-17456134 ] Apache Spark commented on SPARK-37590: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/34842 > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests > > > Key: SPARK-37590 > URL: https://issues.apache.org/jira/browse/SPARK-37590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37590: Assignee: Apache Spark > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests > > > Key: SPARK-37590 > URL: https://issues.apache.org/jira/browse/SPARK-37590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37590: Assignee: (was: Apache Spark) > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests > > > Key: SPARK-37590 > URL: https://issues.apache.org/jira/browse/SPARK-37590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456135#comment-17456135 ] Apache Spark commented on SPARK-37590: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/34842 > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests > > > Key: SPARK-37590 > URL: https://issues.apache.org/jira/browse/SPARK-37590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS
[ https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456141#comment-17456141 ] Apache Spark commented on SPARK-36137: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34843 > HiveShim always fallback to getAllPartitionsOf regardless of whether > directSQL is enabled in remote HMS > --- > > Key: SPARK-36137 > URL: https://issues.apache.org/jira/browse/SPARK-36137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use > {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in > the remote HMS. However, in certain cases the remote HMS will fallback to use > ORM (which only support string type for partition columns) to query the > underlying RDBMS even if this config is set to true, and Spark will not be > able to recover from the error and will just fail the query. > For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, > and Spark was not able to pushdown filter for {{date}} column. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37586) Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt
[ https://issues.apache.org/jira/browse/SPARK-37586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-37586. Fix Version/s: 3.3.0 Assignee: Max Gekk Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34837 > Add cipher mode option and set default cipher mode for aes_encrypt and > aes_decrypt > -- > > Key: SPARK-37586 > URL: https://issues.apache.org/jira/browse/SPARK-37586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > https://github.com/apache/spark/pull/32801 added aes_encrypt/aes_decrypt > functions to spark. However they rely on the jvm's configuration regarding > which cipher mode to support, this is problematic as it is not fixed across > versions and systems. > Let's hardcode a default cipher mode and also allow users to set a cipher > mode as an argument to the function. > In the future, we can support other modes like GCM and CBC that have been > already supported by other systems: > # Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/encrypt.html > # Bigquery: > https://cloud.google.com/bigquery/docs/reference/standard-sql/aead-encryption-concepts#block_cipher_modes -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()
Max Gekk created SPARK-37591: Summary: Support the GCM mode by aes_encrypt()/aes_decrypt() Key: SPARK-37591 URL: https://issues.apache.org/jira/browse/SPARK-37591 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Assignee: Max Gekk Implement the GCM mode in AES - aes_encrypt() and aes_decrypt() -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()
[ https://issues.apache.org/jira/browse/SPARK-37591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456153#comment-17456153 ] Max Gekk commented on SPARK-37591: -- I am working on this. > Support the GCM mode by aes_encrypt()/aes_decrypt() > --- > > Key: SPARK-37591 > URL: https://issues.apache.org/jira/browse/SPARK-37591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Implement the GCM mode in AES - aes_encrypt() and aes_decrypt() -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456154#comment-17456154 ] ocean commented on SPARK-37581: --- Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 join operators, costs 39s. when 12 join operators, costs 120s. when 13 join operators , it can not finish. I think, we can debug it at 11 join operators, to find why it is so slow. === drop table if exists test.test_c;create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t9 on calendar.dt = t9.dt left join (select dt,count(distinct kb_code) as j_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t10 on calendar.dt = t10.dt left join (select dt,count(distinct kb_code) as k_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t11 on calendar.dt = t11.dt > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Major > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish normally. > we can reproduce this problem through example below: > create table test.test_a ( > day string, > week int, > weekday int) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_a partition (dt=20211126) values('1',1,2); > create table test.test_b ( > session_id string, > device_id string, > brand string, > model string, > wx_version string, > os string, > net_work_type string, > app_id string, > app_name string, > col_z string, > page_url string, > page_title string, > olabel string, > otitle string, > source string, > send_dt string, > recv_dt string, > request_time string, > write_time string, > client_ip string, > col_a string, > dt_hour varchar(12), > product string, > channelfrom string, > customer_um string, > kb_code string, > col_b string, > rectype string, > errcode string, > col_c string, > pageid_merge string) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_b partition(dt=20211126) > values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); > create table if not exists test.test_c stored as ORCFILE as > select calendar.day,calendar.week,calendar.weekday, a_kbs, > b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs
[jira] [Comment Edited] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456154#comment-17456154 ] ocean edited comment on SPARK-37581 at 12/9/21, 6:06 AM: - Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 join operators, costs 39s. when 12 join operators, costs 120s. when 13 join operators , it can not finish. I think, we can debug it at 11 join operators, to find why it is so slow. === drop table if exists test.test_c;create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' group by dt) t9 on calendar.dt = t9.dt left join (select dt,count(distinct kb_code) as j_kbs from test.test_b where dt = '20211126' group by dt) t10 on calendar.dt = t10.dt left join (select dt,count(distinct kb_code) as k_kbs from test.test_b where dt = '20211126' group by dt) t11 on calendar.dt = t11.dt was (Author: oceaneast): Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 join operators, costs 39s. when 12 join operators, costs 120s. when 13 join operators , it can not finish. I think, we can debug it at 11 join operators, to find why it is so slow. === drop table if exists test.test_c;create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group
[jira] [Comment Edited] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456154#comment-17456154 ] ocean edited comment on SPARK-37581 at 12/9/21, 6:12 AM: - Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 join operators, costs 39s. when 12 join operators, costs 120s. when 13 join operators , it can not finish. I think, we can debug it at 11 join operators, to find why it is so slow. I had narrowed down as below: === drop table if exists test.test_c;create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' group by dt) t9 on calendar.dt = t9.dt left join (select dt,count(distinct kb_code) as j_kbs from test.test_b where dt = '20211126' group by dt) t10 on calendar.dt = t10.dt left join (select dt,count(distinct kb_code) as k_kbs from test.test_b where dt = '20211126' group by dt) t11 on calendar.dt = t11.dt was (Author: oceaneast): Hi [~hyukjin.kwon]. This sql have 19 join operators.But these join have the same pattern. I found that ,when have 10 join operators, it costs 17s. when 11 join operators, costs 39s. when 12 join operators, costs 120s. when 13 join operators , it can not finish. I think, we can debug it at 11 join operators, to find why it is so slow. === drop table if exists test.test_c;create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' group by dt) t9 on calendar.dt = t9.dt left join (select dt,count(distinct kb_code) as j_kbs from test.test_b where dt = '20211126' group by dt) t10 on calendar.dt = t10.dt left join (select dt,count(distinct kb_code) as k_kbs from test.test_b where dt = '20211126' group by dt) t11 on calendar.dt = t11.dt > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Major > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish
[jira] [Updated] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ocean updated SPARK-37581: -- Description: when exec a sql, this sql hang at planning stage. when disable DPP, sql can finish very quickly. we can reproduce this problem through example below: create table test.test_a ( day string, week int, weekday int) partitioned by ( dt varchar(8)) stored as orc; insert into test.test_a partition (dt=20211126) values('1',1,2); create table test.test_b ( session_id string, device_id string, brand string, model string, wx_version string, os string, net_work_type string, app_id string, app_name string, col_z string, page_url string, page_title string, olabel string, otitle string, source string, send_dt string, recv_dt string, request_time string, write_time string, client_ip string, col_a string, dt_hour varchar(12), product string, channelfrom string, customer_um string, kb_code string, col_b string, rectype string, errcode string, col_c string, pageid_merge string) partitioned by ( dt varchar(8)) stored as orc; insert into test.test_b partition(dt=20211126) values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); create table if not exists test.test_c stored as ORCFILE as select calendar.day,calendar.week,calendar.weekday, a_kbs, b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs from (select * from test.test_a where dt = '20211126') calendar left join (select dt,count(distinct kb_code) as a_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t1 on calendar.dt = t1.dt left join (select dt,count(distinct kb_code) as b_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t2 on calendar.dt = t2.dt left join (select dt,count(distinct kb_code) as c_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t3 on calendar.dt = t3.dt left join (select dt,count(distinct kb_code) as d_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t4 on calendar.dt = t4.dt left join (select dt,count(distinct kb_code) as e_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t5 on calendar.dt = t5.dt left join (select dt,count(distinct kb_code) as f_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t6 on calendar.dt = t6.dt left join (select dt,count(distinct kb_code) as g_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t7 on calendar.dt = t7.dt left join (select dt,count(distinct kb_code) as h_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t8 on calendar.dt = t8.dt left join (select dt,count(distinct kb_code) as i_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t9 on calendar.dt = t9.dt left join (select dt,count(distinct kb_code) as j_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t10 on calendar.dt = t10.dt left join (select dt,count(distinct kb_code) as k_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t11 on calendar.dt = t11.dt left join (select dt,count(distinct kb_code) as l_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t12 on calendar.dt = t12.dt left join (select dt,count(distinct kb_code) as m_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t13 on calendar.dt = t13.dt left join (select dt,count(distinct kb_code) as n_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t14 on calendar.dt = t14.dt left join (select dt,count(distinct kb_code) as o_kbs from test.test_b where dt = '20211126' and app_id in ('1','2') and substr(kb_code,1,6) = '66' and pageid_merge = 'a' group by dt) t15 on calendar.dt = t15.dt left join (select dt,count(distinct kb_code) as p_kbs from test.test_b where dt = '20211126' and app_
[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167 ] Guo Wei commented on SPARK-37575: - I also found that if emptyValueInRead set to "\"\"", reading csv data as show below: {noformat} tesla,,""{noformat} The final result shows as follows: ||name||brand||comment|| |tesla|null|null| But, the expected result should be: ||name||brand||comment|| |tesla|null| | > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167 ] Guo Wei edited comment on SPARK-37575 at 12/9/21, 6:49 AM: --- I also found that if emptyValueInRead set to "\"\"", reading csv data as show below: {noformat} tesla,,""{noformat} The final result shows as follows: ||name||brand||comment|| |tesla|null|null| But, the expected result should be: ||name||brand||comment|| |tesla|null| | was (Author: wayne guo): I also found that if emptyValueInRead set to "\"\"", reading csv data as show below: {noformat} tesla,,""{noformat} The final result shows as follows: ||name||brand||comment|| |tesla|null|null| But, the expected result should be: ||name||brand||comment|| |tesla|null| | > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167 ] Guo Wei edited comment on SPARK-37575 at 12/9/21, 6:52 AM: --- I also found that if emptyValueInRead set to "\"\"", reading csv data as show below: {noformat} name,brand,comment tesla,,""{noformat} The final result shows as follows: ||name||brand||comment|| |tesla|null|null| But, the expected result should be: ||name||brand||comment|| |tesla|null| | was (Author: wayne guo): I also found that if emptyValueInRead set to "\"\"", reading csv data as show below: {noformat} tesla,,""{noformat} The final result shows as follows: ||name||brand||comment|| |tesla|null|null| But, the expected result should be: ||name||brand||comment|| |tesla|null| | > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456168#comment-17456168 ] Guo Wei commented on SPARK-37575: - I have made some bug fix in my local repo and add some test cases that passed, can you assign this issue to me?[~hyukjin.kwon] > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456172#comment-17456172 ] Hyukjin Kwon commented on SPARK-37575: -- We don't usually assign issues to someone. Once you open a PR, that will automatically assign appropriate ones. You can just add a comment that saying you're working on this ticket. > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456193#comment-17456193 ] ocean commented on SPARK-37581: --- I have narrowed down in the comment.please have a look. > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Major > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish very quickly. > we can reproduce this problem through example below: > create table test.test_a ( > day string, > week int, > weekday int) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_a partition (dt=20211126) values('1',1,2); > create table test.test_b ( > session_id string, > device_id string, > brand string, > model string, > wx_version string, > os string, > net_work_type string, > app_id string, > app_name string, > col_z string, > page_url string, > page_title string, > olabel string, > otitle string, > source string, > send_dt string, > recv_dt string, > request_time string, > write_time string, > client_ip string, > col_a string, > dt_hour varchar(12), > product string, > channelfrom string, > customer_um string, > kb_code string, > col_b string, > rectype string, > errcode string, > col_c string, > pageid_merge string) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_b partition(dt=20211126) > values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); > create table if not exists test.test_c stored as ORCFILE as > select calendar.day,calendar.week,calendar.weekday, a_kbs, > b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, > j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs > from (select * from test.test_a where dt = '20211126') calendar > left join > (select dt,count(distinct kb_code) as a_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t1 > on calendar.dt = t1.dt > left join > (select dt,count(distinct kb_code) as b_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t2 > on calendar.dt = t2.dt > left join > (select dt,count(distinct kb_code) as c_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t3 > on calendar.dt = t3.dt > left join > (select dt,count(distinct kb_code) as d_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t4 > on calendar.dt = t4.dt > left join > (select dt,count(distinct kb_code) as e_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t5 > on calendar.dt = t5.dt > left join > (select dt,count(distinct kb_code) as f_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t6 > on calendar.dt = t6.dt > left join > (select dt,count(distinct kb_code) as g_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t7 > on calendar.dt = t7.dt > left join > (select dt,count(distinct kb_code) as h_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t8 > on calendar.dt = t8.dt > left join > (select dt,count(distinct kb_code) as i_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t9 > on calendar.dt = t9.dt > left join > (select dt,count(distinct kb_code) as j_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t10 > on calendar.dt = t10.dt > left join > (select dt,count(distinct kb_code) as k_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t11 > on calendar.dt = t11.dt > left join > (select dt,count(distinct kb_code) as l_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t12 > on calendar.dt = t12.dt > left join > (
[jira] (SPARK-37581) sql hang at planning stage
[ https://issues.apache.org/jira/browse/SPARK-37581 ] ocean deleted comment on SPARK-37581: --- was (Author: oceaneast): I have narrowed down in the comment.please have a look. > sql hang at planning stage > -- > > Key: SPARK-37581 > URL: https://issues.apache.org/jira/browse/SPARK-37581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: ocean >Priority: Major > > when exec a sql, this sql hang at planning stage. > when disable DPP, sql can finish very quickly. > we can reproduce this problem through example below: > create table test.test_a ( > day string, > week int, > weekday int) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_a partition (dt=20211126) values('1',1,2); > create table test.test_b ( > session_id string, > device_id string, > brand string, > model string, > wx_version string, > os string, > net_work_type string, > app_id string, > app_name string, > col_z string, > page_url string, > page_title string, > olabel string, > otitle string, > source string, > send_dt string, > recv_dt string, > request_time string, > write_time string, > client_ip string, > col_a string, > dt_hour varchar(12), > product string, > channelfrom string, > customer_um string, > kb_code string, > col_b string, > rectype string, > errcode string, > col_c string, > pageid_merge string) > partitioned by ( > dt varchar(8)) > stored as orc; > insert into test.test_b partition(dt=20211126) > values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2'); > create table if not exists test.test_c stored as ORCFILE as > select calendar.day,calendar.week,calendar.weekday, a_kbs, > b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs, > j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs > from (select * from test.test_a where dt = '20211126') calendar > left join > (select dt,count(distinct kb_code) as a_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t1 > on calendar.dt = t1.dt > left join > (select dt,count(distinct kb_code) as b_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t2 > on calendar.dt = t2.dt > left join > (select dt,count(distinct kb_code) as c_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t3 > on calendar.dt = t3.dt > left join > (select dt,count(distinct kb_code) as d_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t4 > on calendar.dt = t4.dt > left join > (select dt,count(distinct kb_code) as e_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t5 > on calendar.dt = t5.dt > left join > (select dt,count(distinct kb_code) as f_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t6 > on calendar.dt = t6.dt > left join > (select dt,count(distinct kb_code) as g_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t7 > on calendar.dt = t7.dt > left join > (select dt,count(distinct kb_code) as h_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t8 > on calendar.dt = t8.dt > left join > (select dt,count(distinct kb_code) as i_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t9 > on calendar.dt = t9.dt > left join > (select dt,count(distinct kb_code) as j_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t10 > on calendar.dt = t10.dt > left join > (select dt,count(distinct kb_code) as k_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t11 > on calendar.dt = t11.dt > left join > (select dt,count(distinct kb_code) as l_kbs > from test.test_b > where dt = '20211126' > and app_id in ('1','2') > and substr(kb_code,1,6) = '66' > and pageid_merge = 'a' > group by dt) t12 > on calendar.dt = t12.dt > left join > (select dt,count(distinct kb_code) as m_kbs > from test.test_b > where dt = '