[jira] [Updated] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33831:
--
Issue Type: Bug  (was: Improvement)

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33822:
--
Fix Version/s: 3.0.2

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.0.2, 3.1.0
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:392)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchang

[jira] [Assigned] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33831:
-

Assignee: Sean R. Owen

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33831.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30828
[https://github.com/apache/spark/pull/30828]

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Kent Yao (Jira)
Kent Yao created SPARK-33834:


 Summary: Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
 Key: SPARK-33834
 URL: https://issues.apache.org/jira/browse/SPARK-33834
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Xudingyu (Jira)
Xudingyu created SPARK-33835:


 Summary: Refector AbstractCommandBuilder
 Key: SPARK-33835
 URL: https://issues.apache.org/jira/browse/SPARK-33835
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.0.0
Reporter: Xudingyu


Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2020-12-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30186:

Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> support Dynamic Partition Pruning in Adaptive Execution
> ---
>
> Key: SPARK-30186
> URL: https://issues.apache.org/jira/browse/SPARK-30186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
> applied.
> private def supportAdaptive(plan: SparkPlan): Boolean = {
>  // TODO migrate dynamic-partition-pruning onto adaptive execution.
>  sanityCheck(plan) &&
>  !plan.logicalLink.exists(_.isStreaming) &&
>  
> *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
>  &&
>  plan.children.forall(supportAdaptive)
> }
> It means we cannot benefit the performance from both AE and DPP.
> This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33835:


Assignee: (was: Apache Spark)

> Refector AbstractCommandBuilder
> ---
>
> Key: SPARK-33835
> URL: https://issues.apache.org/jira/browse/SPARK-33835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Priority: Major
>
> Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251479#comment-17251479
 ] 

Apache Spark commented on SPARK-33835:
--

User 'offthewall123' has created a pull request for this issue:
https://github.com/apache/spark/pull/30831

> Refector AbstractCommandBuilder
> ---
>
> Key: SPARK-33835
> URL: https://issues.apache.org/jira/browse/SPARK-33835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Priority: Major
>
> Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33835:


Assignee: Apache Spark

> Refector AbstractCommandBuilder
> ---
>
> Key: SPARK-33835
> URL: https://issues.apache.org/jira/browse/SPARK-33835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Assignee: Apache Spark
>Priority: Major
>
> Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33489) Support null for conversion from and to Arrow type

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33489:


Assignee: (was: Apache Spark)

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33489) Support null for conversion from and to Arrow type

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33489:


Assignee: Apache Spark

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Assignee: Apache Spark
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type

2020-12-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251480#comment-17251480
 ] 

Apache Spark commented on SPARK-33489:
--

User 'Cactice' has created a pull request for this issue:
https://github.com/apache/spark/pull/30832

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251481#comment-17251481
 ] 

Apache Spark commented on SPARK-33834:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/30833

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33834:


Assignee: (was: Apache Spark)

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33834:


Assignee: Apache Spark

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251482#comment-17251482
 ] 

Apache Spark commented on SPARK-33834:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/30833

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Summary: Vector reader got incorrect data with binary partition value  
(was: Parquet vector reader incorrect with binary partition value)

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Parquet vector reader incorrect with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Labels: correctness  (was: )

> Parquet vector reader incorrect with binary partition value
> ---
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 3.2.0
   3.0.0
   3.0.1

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33817) Use a logical plan to cache instead of dataframe

2020-12-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33817.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30815
[https://github.com/apache/spark/pull/30815]

> Use a logical plan to cache instead of dataframe
> 
>
> Key: SPARK-33817
> URL: https://issues.apache.org/jira/browse/SPARK-33817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.2.0
>
>
> When caching a query, we can use a logical plan instead of a dataframe 
> (current implementation) to avoid creating the dataframe.
> This is also consistent with uncaching which uses a logical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33817) Use a logical plan to cache instead of dataframe

2020-12-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33817:
---

Assignee: Terry Kim

> Use a logical plan to cache instead of dataframe
> 
>
> Key: SPARK-33817
> URL: https://issues.apache.org/jira/browse/SPARK-33817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> When caching a query, we can use a logical plan instead of a dataframe 
> (current implementation) to avoid creating the dataframe.
> This is also consistent with uncaching which uses a logical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.4.7

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: (was: 3.0.0)
   2.3.4

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Priority: Blocker  (was: Major)

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.2.3

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251502#comment-17251502
 ] 

Dongjoon Hyun commented on SPARK-33593:
---

Although this is not a regression, I marked this as a Blocker because this is a 
correctness issue.

cc [~hyukjin.kwon] and [~cloud_fan]

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.1.3

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Target Version/s: 2.4.8, 3.0.2, 3.1.0

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.0.2

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-33836:


 Summary: Expose DataStreamReader.table and 
DataStreamWriter.toTable to PySpark
 Key: SPARK-33836
 URL: https://issues.apache.org/jira/browse/SPARK-33836
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Jungtaek Lim


>From SPARK-32885 and SPARK-32896 we added two public APIs to enable read/write 
>with table, but only in Scala side so only JVM languages could leverage them.

Given there're lots of PySpark users, it would be great to expose these public 
APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251509#comment-17251509
 ] 

Hyukjin Kwon commented on SPARK-33836:
--

cc [~zero323] FYI

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251510#comment-17251510
 ] 

Apache Spark commented on SPARK-33836:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30835

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33836:


Assignee: (was: Apache Spark)

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33836:


Assignee: Apache Spark

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow

2020-12-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251512#comment-17251512
 ] 

Hyukjin Kwon commented on SPARK-33833:
--

Looks like it leverages listenters. Can you use QueryExecutionListener instead?

> Allow Spark Structured Streaming report Kafka Lag through Burrow
> 
>
> Key: SPARK-33833
> URL: https://issues.apache.org/jira/browse/SPARK-33833
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Sam Davarnia
>Priority: Major
>
> Because structured streaming tracks Kafka offset consumption by itself, 
> It is not possible to track total Kafka lag using Burrow similar to DStreams
> We have used Stream hooks as mentioned 
> [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37]
>  
> It would be great if Spark supports this feature out of the box.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33826) InsertIntoHiveTable generate HDFS file with invalid user

2020-12-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251513#comment-17251513
 ] 

Hyukjin Kwon commented on SPARK-33826:
--

[~AlberyZJG] are you able to show the self-contained reproducer so people can 
verify easily?

> InsertIntoHiveTable generate HDFS file with invalid user
> 
>
> Key: SPARK-33826
> URL: https://issues.apache.org/jira/browse/SPARK-33826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> *Arch:* Hive on Spark.
>  
> *Version:* Spark 2.3.2
>  
> *Conf:*
> Enable user impersonation
> hive.server2.enable.doAs=true
>  
> *Scenario:*
> Thriftserver is running with loginUser A, and Task  run as User A too.
> Client execute SQL with user B
>  
> Data generated by sql "insert into TABLE  \[tbl\] select XXX form ." is 
> written to HDFS on executor, executor doesn't know B.
>  
> *{color:#de350b}So the user file written to HDFS will be user A which should 
> be B.{color}*
>  
> I also check the inplementation of Spark 3.0.0, It could have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33795) gapply fails execution with rbind error

2020-12-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251515#comment-17251515
 ] 

Hyukjin Kwon commented on SPARK-33795:
--

[~n8shdw] can you check if this happens in Apache Spark instead of Databricks 
Runtime?

> gapply fails execution with rbind error
> ---
>
> Key: SPARK-33795
> URL: https://issues.apache.org/jira/browse/SPARK-33795
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
> Environment: Databricks runtime 7.3 LTS ML
>Reporter: MvR
>Priority: Major
> Attachments: Rerror.log
>
>
> Executing following code on databricks runtime 7.3 LTS ML errors out showing 
> some rbind error whereas it is successfully executed without enabling Arrow 
> in Spark session. Full error message attached.
>  
> ```
> library(dplyr)
> library(SparkR)
> SparkR::sparkR.session(sparkConfig = 
> list(spark.sql.execution.arrow.sparkr.enabled = "true"))
> mtcars %>%
>  SparkR::as.DataFrame() %>%
> SparkR::gapply(x = .,
>  cols = c("cyl", "vs"),
>  
>  func = function(key,
>  data){
>  
>  dt <- data[,c("mpg", "qsec")]
>  res <- apply(dt, 2, mean)
>  df <- data.frame(firstGroupKey = key[1],
>  secondGroupKey = key[2],
>  mean_mpg = res[1],
>  mean_cyl = res[2])
>  return(df)
>  
>  }, 
>  schema = structType(structField("cyl", "double"),
>  structField("vs", "double"),
>  structField("mpg_mean", "double"),
>  structField("qsec_mean", "double"))
>  ) %>%
>  display()
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251516#comment-17251516
 ] 

Hyukjin Kwon commented on SPARK-33791:
--

Can you document it in 
http://spark.apache.org/docs/latest/sql-migration-guide.html#compatibility-with-apache-hive?

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33825) Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33825:
-
Target Version/s:   (was: 3.0.1)

> Is Spark SQL able to auto update partition stats like hive by setting 
> hive.stats.autogather=true
> 
>
> Key: SPARK-33825
> URL: https://issues.apache.org/jira/browse/SPARK-33825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: yang
>Priority: Major
>  Labels: partitionStat
>
> {{`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats 
> update.}}{{But for partition stats,I can only update it with `ANALYZE TABLE 
> tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto 
> update partition stats like hive by setting hive.stats.autogather=true?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33825) Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33825.
--
Resolution: Invalid

Let's ask questions to dev mailing list before filing it as an issue. See also 
https://spark.apache.org/community.html

> Is Spark SQL able to auto update partition stats like hive by setting 
> hive.stats.autogather=true
> 
>
> Key: SPARK-33825
> URL: https://issues.apache.org/jira/browse/SPARK-33825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: yang
>Priority: Major
>  Labels: partitionStat
>
> {{`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats 
> update.}}{{But for partition stats,I can only update it with `ANALYZE TABLE 
> tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto 
> update partition stats like hive by setting hive.stats.autogather=true?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33836:
-
Affects Version/s: (was: 3.1.0)
   3.2.0

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26341) Expose executor memory metrics at the stage level, in the Stages tab

2020-12-17 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-26341.

Fix Version/s: 3.2.0
   3.1.0
 Assignee: angerszhu
   Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/30573

> Expose executor memory metrics at the stage level, in the Stages tab
> 
>
> Key: SPARK-26341
> URL: https://issues.apache.org/jira/browse/SPARK-26341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Edward Lu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> Sub-task SPARK-23431 will add stage level executor memory metrics (peak 
> values for each stage, and peak values for each executor for the stage). This 
> information should also be exposed the the web UI, so that users can see 
> which stages are memory intensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33791:


Assignee: (was: Apache Spark)

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33791:


Assignee: Apache Spark

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Assignee: Apache Spark
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251547#comment-17251547
 ] 

Apache Spark commented on SPARK-33791:
--

User 'sqlwindspeaker' has created a pull request for this issue:
https://github.com/apache/spark/pull/30836

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey

2020-12-17 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31948.
--
Resolution: Not A Problem

> expose mapSideCombine in aggByKey/reduceByKey/foldByKey
> ---
>
> Key: SPARK-31948
> URL: https://issues.apache.org/jira/browse/SPARK-31948
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1. {{aggregateByKey}}, {{reduceByKey}} and  {{foldByKey}} will always perform 
> {{mapSideCombine}};
> However, this can be skiped sometime, specially in ML (RobustScaler):
> {code:java}
> vectors.mapPartitions { iter =>
>   if (iter.hasNext) {
> val summaries = Array.fill(numFeatures)(
>   new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
> relativeError))
> while (iter.hasNext) {
>   val vec = iter.next
>   vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = 
> summaries(i).insert(v) }
> }
> Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
>   } else Iterator.empty
> }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
>  
> This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at 
> all, similar places exist in {{KMeans}}, {{GMM}}, etc;
> To my knowledge, we do not need {{mapSideCombine}} if the reduction factor 
> isn't high;
>  
> 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}},  the 
> {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided.
>  
> SPARK-772:
> {quote}
> Map side combine in group by key case does not reduce the amount of data 
> shuffled. Instead, it forces a lot more objects to go into old gen, and leads 
> to worse GC.
> {quote}
>  
> So what about:
> 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so 
> that user can disable unnecessary mapSideCombine
> 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in  
> {{treeAggregate}} and {{treeReduce}}
> 3. disabling the unnecessary {{mapSideCombine}} in ML;
> Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] 
> [~viirya]  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2