[jira] [Updated] (SPARK-33831) Update Jetty to 9.4.34
[ https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33831: -- Issue Type: Bug (was: Improvement) > Update Jetty to 9.4.34 > -- > > Key: SPARK-33831 > URL: https://issues.apache.org/jira/browse/SPARK-33831 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.1 >Reporter: Sean R. Owen >Priority: Minor > > We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a > possible CVE fix. > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020 > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33822: -- Fix Version/s: 3.0.2 > TPCDS Q5 fails if spark.sql.adaptive.enabled=true > - > > Key: SPARK-33822 > URL: https://issues.apache.org/jira/browse/SPARK-33822 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Takeshi Yamamuro >Priority: Blocker > Fix For: 3.0.2, 3.1.0 > > > **PROBLEM STATEMENT** > {code} > >>> tables = ['call_center', 'catalog_page', 'catalog_returns', > >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', > >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', > >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', > >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', > >>> 'web_sales', 'web_site'] > >>> for t in tables: > ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION > '/Users/dongjoon/data/10g/%s'" % (t, t)) > >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1) > +---++-+---+-+ > |channel| id|sales|returns| profit| > +---++-+---+-+ > | null|null|1143646603.07|30617460.71|-317540732.87| > |catalog channel|null| 393609478.06| 9451732.79| -44801262.72| > |catalog channel|catalog_pageA...| 0.00| 39037.48|-25330.29| > ... > +---++-+---+-+ > >>> sql("set spark.sql.adaptive.enabled=true") > >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py", > line 440, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", > line 1305, in __call__ > File > "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py", > line 128, in deco > return f(*a, **kw) > File > "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString. > : java.lang.UnsupportedOperationException: BroadcastExchange does not support > the execute() code path. > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) > at > org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) > at > org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:392) > at > org.apache.spark.sql.execution.exchange.BroadcastExchang
[jira] [Assigned] (SPARK-33831) Update Jetty to 9.4.34
[ https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33831: - Assignee: Sean R. Owen > Update Jetty to 9.4.34 > -- > > Key: SPARK-33831 > URL: https://issues.apache.org/jira/browse/SPARK-33831 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.1 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a > possible CVE fix. > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020 > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33831) Update Jetty to 9.4.34
[ https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33831. --- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30828 [https://github.com/apache/spark/pull/30828] > Update Jetty to 9.4.34 > -- > > Key: SPARK-33831 > URL: https://issues.apache.org/jira/browse/SPARK-33831 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.1 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a > possible CVE fix. > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020 > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
Kent Yao created SPARK-33834: Summary: Verify ALTER TABLE CHANGE COLUMN with Char and Varchar Key: SPARK-33834 URL: https://issues.apache.org/jira/browse/SPARK-33834 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33835) Refector AbstractCommandBuilder
Xudingyu created SPARK-33835: Summary: Refector AbstractCommandBuilder Key: SPARK-33835 URL: https://issues.apache.org/jira/browse/SPARK-33835 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 3.0.0 Reporter: Xudingyu Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30186: Parent: SPARK-33828 Issue Type: Sub-task (was: Improvement) > support Dynamic Partition Pruning in Adaptive Execution > --- > > Key: SPARK-30186 > URL: https://issues.apache.org/jira/browse/SPARK-30186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiaoju Wu >Priority: Major > > Currently Adaptive Execution cannot work if Dynamic Partition Pruning is > applied. > private def supportAdaptive(plan: SparkPlan): Boolean = { > // TODO migrate dynamic-partition-pruning onto adaptive execution. > sanityCheck(plan) && > !plan.logicalLink.exists(_.isStreaming) && > > *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)* > && > plan.children.forall(supportAdaptive) > } > It means we cannot benefit the performance from both AE and DPP. > This ticket is target to make DPP + AE works together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33835) Refector AbstractCommandBuilder
[ https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33835: Assignee: (was: Apache Spark) > Refector AbstractCommandBuilder > --- > > Key: SPARK-33835 > URL: https://issues.apache.org/jira/browse/SPARK-33835 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.0.0 >Reporter: Xudingyu >Priority: Major > > Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33835) Refector AbstractCommandBuilder
[ https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251479#comment-17251479 ] Apache Spark commented on SPARK-33835: -- User 'offthewall123' has created a pull request for this issue: https://github.com/apache/spark/pull/30831 > Refector AbstractCommandBuilder > --- > > Key: SPARK-33835 > URL: https://issues.apache.org/jira/browse/SPARK-33835 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.0.0 >Reporter: Xudingyu >Priority: Major > > Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33835) Refector AbstractCommandBuilder
[ https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33835: Assignee: Apache Spark > Refector AbstractCommandBuilder > --- > > Key: SPARK-33835 > URL: https://issues.apache.org/jira/browse/SPARK-33835 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.0.0 >Reporter: Xudingyu >Assignee: Apache Spark >Priority: Major > > Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33489) Support null for conversion from and to Arrow type
[ https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33489: Assignee: (was: Apache Spark) > Support null for conversion from and to Arrow type > -- > > Key: SPARK-33489 > URL: https://issues.apache.org/jira/browse/SPARK-33489 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Yuya Kanai >Priority: Minor > > I got below error when using from_arrow_type() in pyspark.sql.pandas.types > {{Unsupported type in conversion from Arrow: null}} > I noticed NullType exists under pyspark.sql.types so it seems possible to > convert from pyarrow null to pyspark null type and vice versa. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33489) Support null for conversion from and to Arrow type
[ https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33489: Assignee: Apache Spark > Support null for conversion from and to Arrow type > -- > > Key: SPARK-33489 > URL: https://issues.apache.org/jira/browse/SPARK-33489 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Yuya Kanai >Assignee: Apache Spark >Priority: Minor > > I got below error when using from_arrow_type() in pyspark.sql.pandas.types > {{Unsupported type in conversion from Arrow: null}} > I noticed NullType exists under pyspark.sql.types so it seems possible to > convert from pyarrow null to pyspark null type and vice versa. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type
[ https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251480#comment-17251480 ] Apache Spark commented on SPARK-33489: -- User 'Cactice' has created a pull request for this issue: https://github.com/apache/spark/pull/30832 > Support null for conversion from and to Arrow type > -- > > Key: SPARK-33489 > URL: https://issues.apache.org/jira/browse/SPARK-33489 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Yuya Kanai >Priority: Minor > > I got below error when using from_arrow_type() in pyspark.sql.pandas.types > {{Unsupported type in conversion from Arrow: null}} > I noticed NullType exists under pyspark.sql.types so it seems possible to > convert from pyarrow null to pyspark null type and vice versa. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
[ https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251481#comment-17251481 ] Apache Spark commented on SPARK-33834: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/30833 > Verify ALTER TABLE CHANGE COLUMN with Char and Varchar > -- > > Key: SPARK-33834 > URL: https://issues.apache.org/jira/browse/SPARK-33834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
[ https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33834: Assignee: (was: Apache Spark) > Verify ALTER TABLE CHANGE COLUMN with Char and Varchar > -- > > Key: SPARK-33834 > URL: https://issues.apache.org/jira/browse/SPARK-33834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
[ https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33834: Assignee: Apache Spark > Verify ALTER TABLE CHANGE COLUMN with Char and Varchar > -- > > Key: SPARK-33834 > URL: https://issues.apache.org/jira/browse/SPARK-33834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
[ https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251482#comment-17251482 ] Apache Spark commented on SPARK-33834: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/30833 > Verify ALTER TABLE CHANGE COLUMN with Char and Varchar > -- > > Key: SPARK-33834 > URL: https://issues.apache.org/jira/browse/SPARK-33834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Summary: Vector reader got incorrect data with binary partition value (was: Parquet vector reader incorrect with binary partition value) > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Parquet vector reader incorrect with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Labels: correctness (was: ) > Parquet vector reader incorrect with binary partition value > --- > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Affects Version/s: 3.2.0 3.0.0 3.0.1 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Major > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33817) Use a logical plan to cache instead of dataframe
[ https://issues.apache.org/jira/browse/SPARK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33817. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30815 [https://github.com/apache/spark/pull/30815] > Use a logical plan to cache instead of dataframe > > > Key: SPARK-33817 > URL: https://issues.apache.org/jira/browse/SPARK-33817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.2.0 > > > When caching a query, we can use a logical plan instead of a dataframe > (current implementation) to avoid creating the dataframe. > This is also consistent with uncaching which uses a logical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33817) Use a logical plan to cache instead of dataframe
[ https://issues.apache.org/jira/browse/SPARK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33817: --- Assignee: Terry Kim > Use a logical plan to cache instead of dataframe > > > Key: SPARK-33817 > URL: https://issues.apache.org/jira/browse/SPARK-33817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > When caching a query, we can use a logical plan instead of a dataframe > (current implementation) to avoid creating the dataframe. > This is also consistent with uncaching which uses a logical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Affects Version/s: 2.4.7 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Major > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Affects Version/s: (was: 3.0.0) 2.3.4 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Major > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Priority: Blocker (was: Major) > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Blocker > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Affects Version/s: 2.2.3 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Major > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251502#comment-17251502 ] Dongjoon Hyun commented on SPARK-33593: --- Although this is not a regression, I marked this as a Blocker because this is a correctness issue. cc [~hyukjin.kwon] and [~cloud_fan] > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Blocker > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Affects Version/s: 2.1.3 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Blocker > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Target Version/s: 2.4.8, 3.0.2, 3.1.0 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Blocker > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value
[ https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33593: -- Affects Version/s: 2.0.2 > Vector reader got incorrect data with binary partition value > > > Key: SPARK-33593 > URL: https://issues.apache.org/jira/browse/SPARK-33593 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Blocker > Labels: correctness > > {code:java} > test("Parquet vector reader incorrect with binary partition value") { > Seq(false, true).foreach(tag => { > withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { > withTable("t1") { > sql( > """CREATE TABLE t1(name STRING, id BINARY, part BINARY) > | USING PARQUET PARTITIONED BY (part)""".stripMargin) > sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', > X'537061726B2053514C')") > if (tag) { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "")) > } else { > checkAnswer(sql("SELECT name, cast(id as string), cast(part as > string) FROM t1"), > Row("a", "Spark SQL", "Spark SQL")) > } > } > } > }) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
Jungtaek Lim created SPARK-33836: Summary: Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark Key: SPARK-33836 URL: https://issues.apache.org/jira/browse/SPARK-33836 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Jungtaek Lim >From SPARK-32885 and SPARK-32896 we added two public APIs to enable read/write >with table, but only in Scala side so only JVM languages could leverage them. Given there're lots of PySpark users, it would be great to expose these public APIs to PySpark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
[ https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251509#comment-17251509 ] Hyukjin Kwon commented on SPARK-33836: -- cc [~zero323] FYI > Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark > - > > Key: SPARK-33836 > URL: https://issues.apache.org/jira/browse/SPARK-33836 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-32885 and SPARK-32896 we added two public APIs to enable > read/write with table, but only in Scala side so only JVM languages could > leverage them. > Given there're lots of PySpark users, it would be great to expose these > public APIs to PySpark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
[ https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251510#comment-17251510 ] Apache Spark commented on SPARK-33836: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/30835 > Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark > - > > Key: SPARK-33836 > URL: https://issues.apache.org/jira/browse/SPARK-33836 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-32885 and SPARK-32896 we added two public APIs to enable > read/write with table, but only in Scala side so only JVM languages could > leverage them. > Given there're lots of PySpark users, it would be great to expose these > public APIs to PySpark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
[ https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33836: Assignee: (was: Apache Spark) > Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark > - > > Key: SPARK-33836 > URL: https://issues.apache.org/jira/browse/SPARK-33836 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-32885 and SPARK-32896 we added two public APIs to enable > read/write with table, but only in Scala side so only JVM languages could > leverage them. > Given there're lots of PySpark users, it would be great to expose these > public APIs to PySpark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
[ https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33836: Assignee: Apache Spark > Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark > - > > Key: SPARK-33836 > URL: https://issues.apache.org/jira/browse/SPARK-33836 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > From SPARK-32885 and SPARK-32896 we added two public APIs to enable > read/write with table, but only in Scala side so only JVM languages could > leverage them. > Given there're lots of PySpark users, it would be great to expose these > public APIs to PySpark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow
[ https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251512#comment-17251512 ] Hyukjin Kwon commented on SPARK-33833: -- Looks like it leverages listenters. Can you use QueryExecutionListener instead? > Allow Spark Structured Streaming report Kafka Lag through Burrow > > > Key: SPARK-33833 > URL: https://issues.apache.org/jira/browse/SPARK-33833 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Sam Davarnia >Priority: Major > > Because structured streaming tracks Kafka offset consumption by itself, > It is not possible to track total Kafka lag using Burrow similar to DStreams > We have used Stream hooks as mentioned > [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37] > > It would be great if Spark supports this feature out of the box. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33826) InsertIntoHiveTable generate HDFS file with invalid user
[ https://issues.apache.org/jira/browse/SPARK-33826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251513#comment-17251513 ] Hyukjin Kwon commented on SPARK-33826: -- [~AlberyZJG] are you able to show the self-contained reproducer so people can verify easily? > InsertIntoHiveTable generate HDFS file with invalid user > > > Key: SPARK-33826 > URL: https://issues.apache.org/jira/browse/SPARK-33826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.0.0 >Reporter: Zhang Jianguo >Priority: Minor > > *Arch:* Hive on Spark. > > *Version:* Spark 2.3.2 > > *Conf:* > Enable user impersonation > hive.server2.enable.doAs=true > > *Scenario:* > Thriftserver is running with loginUser A, and Task run as User A too. > Client execute SQL with user B > > Data generated by sql "insert into TABLE \[tbl\] select XXX form ." is > written to HDFS on executor, executor doesn't know B. > > *{color:#de350b}So the user file written to HDFS will be user A which should > be B.{color}* > > I also check the inplementation of Spark 3.0.0, It could have the same issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33795) gapply fails execution with rbind error
[ https://issues.apache.org/jira/browse/SPARK-33795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251515#comment-17251515 ] Hyukjin Kwon commented on SPARK-33795: -- [~n8shdw] can you check if this happens in Apache Spark instead of Databricks Runtime? > gapply fails execution with rbind error > --- > > Key: SPARK-33795 > URL: https://issues.apache.org/jira/browse/SPARK-33795 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 > Environment: Databricks runtime 7.3 LTS ML >Reporter: MvR >Priority: Major > Attachments: Rerror.log > > > Executing following code on databricks runtime 7.3 LTS ML errors out showing > some rbind error whereas it is successfully executed without enabling Arrow > in Spark session. Full error message attached. > > ``` > library(dplyr) > library(SparkR) > SparkR::sparkR.session(sparkConfig = > list(spark.sql.execution.arrow.sparkr.enabled = "true")) > mtcars %>% > SparkR::as.DataFrame() %>% > SparkR::gapply(x = ., > cols = c("cyl", "vs"), > > func = function(key, > data){ > > dt <- data[,c("mpg", "qsec")] > res <- apply(dt, 2, mean) > df <- data.frame(firstGroupKey = key[1], > secondGroupKey = key[2], > mean_mpg = res[1], > mean_cyl = res[2]) > return(df) > > }, > schema = structType(structField("cyl", "double"), > structField("vs", "double"), > structField("mpg_mean", "double"), > structField("qsec_mean", "double")) > ) %>% > display() > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3
[ https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251516#comment-17251516 ] Hyukjin Kwon commented on SPARK-33791: -- Can you document it in http://spark.apache.org/docs/latest/sql-migration-guide.html#compatibility-with-apache-hive? > grouping__id() result does not consistent with hive's version < 2.3 > --- > > Key: SPARK-33791 > URL: https://issues.apache.org/jira/browse/SPARK-33791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 3.0.1 >Reporter: Su Qilong >Priority: Minor > > See this > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup] > Hive's grouping__id method made a change since hive version 2.3.0. Now spark > does not declare this inconsistency with Hive, which may make user believe > they're safe from migrating their query from Hive 1.x to Spark, but which is > wrong. > I guess we should note this difference in Hive migration guide, and add a > configuration to let grouping__id to use hive 1.x compatible algorithm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33825) Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true
[ https://issues.apache.org/jira/browse/SPARK-33825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33825: - Target Version/s: (was: 3.0.1) > Is Spark SQL able to auto update partition stats like hive by setting > hive.stats.autogather=true > > > Key: SPARK-33825 > URL: https://issues.apache.org/jira/browse/SPARK-33825 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: yang >Priority: Major > Labels: partitionStat > > {{`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats > update.}}{{But for partition stats,I can only update it with `ANALYZE TABLE > tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto > update partition stats like hive by setting hive.stats.autogather=true?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33825) Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true
[ https://issues.apache.org/jira/browse/SPARK-33825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33825. -- Resolution: Invalid Let's ask questions to dev mailing list before filing it as an issue. See also https://spark.apache.org/community.html > Is Spark SQL able to auto update partition stats like hive by setting > hive.stats.autogather=true > > > Key: SPARK-33825 > URL: https://issues.apache.org/jira/browse/SPARK-33825 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: yang >Priority: Major > Labels: partitionStat > > {{`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats > update.}}{{But for partition stats,I can only update it with `ANALYZE TABLE > tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto > update partition stats like hive by setting hive.stats.autogather=true?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
[ https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-33836: - Affects Version/s: (was: 3.1.0) 3.2.0 > Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark > - > > Key: SPARK-33836 > URL: https://issues.apache.org/jira/browse/SPARK-33836 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-32885 and SPARK-32896 we added two public APIs to enable > read/write with table, but only in Scala side so only JVM languages could > leverage them. > Given there're lots of PySpark users, it would be great to expose these > public APIs to PySpark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26341) Expose executor memory metrics at the stage level, in the Stages tab
[ https://issues.apache.org/jira/browse/SPARK-26341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-26341. Fix Version/s: 3.2.0 3.1.0 Assignee: angerszhu Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/30573 > Expose executor memory metrics at the stage level, in the Stages tab > > > Key: SPARK-26341 > URL: https://issues.apache.org/jira/browse/SPARK-26341 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 3.0.1 >Reporter: Edward Lu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0, 3.2.0 > > > Sub-task SPARK-23431 will add stage level executor memory metrics (peak > values for each stage, and peak values for each executor for the stage). This > information should also be exposed the the web UI, so that users can see > which stages are memory intensive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3
[ https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33791: Assignee: (was: Apache Spark) > grouping__id() result does not consistent with hive's version < 2.3 > --- > > Key: SPARK-33791 > URL: https://issues.apache.org/jira/browse/SPARK-33791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 3.0.1 >Reporter: Su Qilong >Priority: Minor > > See this > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup] > Hive's grouping__id method made a change since hive version 2.3.0. Now spark > does not declare this inconsistency with Hive, which may make user believe > they're safe from migrating their query from Hive 1.x to Spark, but which is > wrong. > I guess we should note this difference in Hive migration guide, and add a > configuration to let grouping__id to use hive 1.x compatible algorithm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3
[ https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33791: Assignee: Apache Spark > grouping__id() result does not consistent with hive's version < 2.3 > --- > > Key: SPARK-33791 > URL: https://issues.apache.org/jira/browse/SPARK-33791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 3.0.1 >Reporter: Su Qilong >Assignee: Apache Spark >Priority: Minor > > See this > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup] > Hive's grouping__id method made a change since hive version 2.3.0. Now spark > does not declare this inconsistency with Hive, which may make user believe > they're safe from migrating their query from Hive 1.x to Spark, but which is > wrong. > I guess we should note this difference in Hive migration guide, and add a > configuration to let grouping__id to use hive 1.x compatible algorithm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3
[ https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251547#comment-17251547 ] Apache Spark commented on SPARK-33791: -- User 'sqlwindspeaker' has created a pull request for this issue: https://github.com/apache/spark/pull/30836 > grouping__id() result does not consistent with hive's version < 2.3 > --- > > Key: SPARK-33791 > URL: https://issues.apache.org/jira/browse/SPARK-33791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 3.0.1 >Reporter: Su Qilong >Priority: Minor > > See this > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup] > Hive's grouping__id method made a change since hive version 2.3.0. Now spark > does not declare this inconsistency with Hive, which may make user believe > they're safe from migrating their query from Hive 1.x to Spark, but which is > wrong. > I guess we should note this difference in Hive migration guide, and add a > configuration to let grouping__id to use hive 1.x compatible algorithm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey
[ https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31948. -- Resolution: Not A Problem > expose mapSideCombine in aggByKey/reduceByKey/foldByKey > --- > > Key: SPARK-31948 > URL: https://issues.apache.org/jira/browse/SPARK-31948 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > 1. {{aggregateByKey}}, {{reduceByKey}} and {{foldByKey}} will always perform > {{mapSideCombine}}; > However, this can be skiped sometime, specially in ML (RobustScaler): > {code:java} > vectors.mapPartitions { iter => > if (iter.hasNext) { > val summaries = Array.fill(numFeatures)( > new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, > relativeError)) > while (iter.hasNext) { > val vec = iter.next > vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = > summaries(i).insert(v) } > } > Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress)) > } else Iterator.empty > }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code} > > This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at > all, similar places exist in {{KMeans}}, {{GMM}}, etc; > To my knowledge, we do not need {{mapSideCombine}} if the reduction factor > isn't high; > > 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}}, the > {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided. > > SPARK-772: > {quote} > Map side combine in group by key case does not reduce the amount of data > shuffled. Instead, it forces a lot more objects to go into old gen, and leads > to worse GC. > {quote} > > So what about: > 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so > that user can disable unnecessary mapSideCombine > 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in > {{treeAggregate}} and {{treeReduce}} > 3. disabling the unnecessary {{mapSideCombine}} in ML; > Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] > [~viirya] > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org