[jira] [Assigned] (SPARK-32939) Avoid re-compute expensive expression

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32939:


Assignee: (was: Apache Spark)

> Avoid re-compute expensive expression
> -
>
> Key: SPARK-32939
> URL: https://issues.apache.org/jira/browse/SPARK-32939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
>   test("Pushdown demo") {
> withTable("t") {
>   withTempDir { loc =>
> sql(
>   s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
>  | LOCATION '${loc.getAbsolutePath}'
>  |""".stripMargin)
> sql(
>   """
> |SELECT c1,
> |case
> |  when get_json_object(s,'$.a')=1 then "a"
> |  when get_json_object(s,'$.a')=2 then "b"
> |end as s_type
> |FROM t
> |WHERE get_json_object(s,'$.a') in (1, 2)
>   """.stripMargin).explain(true)
>  }
> }
> }
> will got plan as 
> == Physical Plan ==
> *(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) 
> THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS 
> s_type#0]
> +- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
>+- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
> Statistics(sizeInBytes=8.0 EiB)
> we can see that  get_json_object(s#2, $.a) will be computed tree times
> Always there are expensive expressions are re-computed many times in such 
> grammar。
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32939) Avoid re-compute expensive expression

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198660#comment-17198660
 ] 

Apache Spark commented on SPARK-32939:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29807

> Avoid re-compute expensive expression
> -
>
> Key: SPARK-32939
> URL: https://issues.apache.org/jira/browse/SPARK-32939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
>   test("Pushdown demo") {
> withTable("t") {
>   withTempDir { loc =>
> sql(
>   s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
>  | LOCATION '${loc.getAbsolutePath}'
>  |""".stripMargin)
> sql(
>   """
> |SELECT c1,
> |case
> |  when get_json_object(s,'$.a')=1 then "a"
> |  when get_json_object(s,'$.a')=2 then "b"
> |end as s_type
> |FROM t
> |WHERE get_json_object(s,'$.a') in (1, 2)
>   """.stripMargin).explain(true)
>  }
> }
> }
> will got plan as 
> == Physical Plan ==
> *(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) 
> THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS 
> s_type#0]
> +- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
>+- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
> Statistics(sizeInBytes=8.0 EiB)
> we can see that  get_json_object(s#2, $.a) will be computed tree times
> Always there are expensive expressions are re-computed many times in such 
> grammar。
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32939) Avoid re-compute expensive expression

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32939:


Assignee: Apache Spark

> Avoid re-compute expensive expression
> -
>
> Key: SPARK-32939
> URL: https://issues.apache.org/jira/browse/SPARK-32939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
>   test("Pushdown demo") {
> withTable("t") {
>   withTempDir { loc =>
> sql(
>   s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
>  | LOCATION '${loc.getAbsolutePath}'
>  |""".stripMargin)
> sql(
>   """
> |SELECT c1,
> |case
> |  when get_json_object(s,'$.a')=1 then "a"
> |  when get_json_object(s,'$.a')=2 then "b"
> |end as s_type
> |FROM t
> |WHERE get_json_object(s,'$.a') in (1, 2)
>   """.stripMargin).explain(true)
>  }
> }
> }
> will got plan as 
> == Physical Plan ==
> *(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) 
> THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS 
> s_type#0]
> +- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
>+- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
> Statistics(sizeInBytes=8.0 EiB)
> we can see that  get_json_object(s#2, $.a) will be computed tree times
> Always there are expensive expressions are re-computed many times in such 
> grammar。
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32939) Avoid re-compute expensive expression

2020-09-18 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32939:
--
Description: 
{code:java}
  test("Pushdown demo") {
withTable("t") {
  withTempDir { loc =>
sql(
  s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
 | LOCATION '${loc.getAbsolutePath}'
 |""".stripMargin)
sql(
  """
|SELECT c1,
|case
|  when get_json_object(s,'$.a')=1 then "a"
|  when get_json_object(s,'$.a')=2 then "b"
|end as s_type
|FROM t
|WHERE get_json_object(s,'$.a') in (1, 2)
  """.stripMargin).explain(true)
 }
}
}
will got plan as 


== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN 
a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
Statistics(sizeInBytes=8.0 EiB)

we can see that  get_json_object(s#2, $.a) will be computed tree times
Always there are expensive expressions are re-computed many times in such 
grammar。

{code}

> Avoid re-compute expensive expression
> -
>
> Key: SPARK-32939
> URL: https://issues.apache.org/jira/browse/SPARK-32939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
>   test("Pushdown demo") {
> withTable("t") {
>   withTempDir { loc =>
> sql(
>   s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
>  | LOCATION '${loc.getAbsolutePath}'
>  |""".stripMargin)
> sql(
>   """
> |SELECT c1,
> |case
> |  when get_json_object(s,'$.a')=1 then "a"
> |  when get_json_object(s,'$.a')=2 then "b"
> |end as s_type
> |FROM t
> |WHERE get_json_object(s,'$.a') in (1, 2)
>   """.stripMargin).explain(true)
>  }
> }
> }
> will got plan as 
> == Physical Plan ==
> *(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) 
> THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS 
> s_type#0]
> +- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
>+- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
> Statistics(sizeInBytes=8.0 EiB)
> we can see that  get_json_object(s#2, $.a) will be computed tree times
> Always there are expensive expressions are re-computed many times in such 
> grammar。
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32939) Avoid re-compute expensive expression

2020-09-18 Thread angerszhu (Jira)
angerszhu created SPARK-32939:
-

 Summary: Avoid re-compute expensive expression
 Key: SPARK-32939
 URL: https://issues.apache.org/jira/browse/SPARK-32939
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2020-09-18 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198600#comment-17198600
 ] 

L. C. Hsieh commented on SPARK-30294:
-

As this doesn't cause error or correctness issue, it is more like an 
improvement instead of a bug, it seems to me.

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-32738:

Fix Version/s: 2.4.8

> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal error happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to "hang".
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins

2020-09-18 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-17556:
---

Assignee: L. C. Hsieh

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: L. C. Hsieh
>Priority: Major
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2020-09-18 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198571#comment-17198571
 ] 

L. C. Hsieh commented on SPARK-17556:
-

We will recently try to pick this up again.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: L. C. Hsieh
>Priority: Major
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32898:
--
Affects Version/s: 2.4.7

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Linhong Liu
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
>  The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
>  However the query shows as SUCCESSful. Maybe this task is speculative. Not 
> sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32898.
---
Fix Version/s: 3.1.0
   3.0.2
 Assignee: wuyi
   Resolution: Fixed

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Linhong Liu
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
>  The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
>  However the query shows as SUCCESSful. Maybe this task is speculative. Not 
> sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32635:
--
Fix Version/s: 2.4.8

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: Vinod KC
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32348) Get tests working for Scala 2.13 build

2020-09-18 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198410#comment-17198410
 ] 

Sean R. Owen commented on SPARK-32348:
--

This is really going to be duplicated by N other JIRAs to fix subsets of the 
tests, but I'll leave it open for now as a reminder that there are still a few 
more modules to go.

> Get tests working for Scala 2.13 build
> --
>
> Key: SPARK-32348
> URL: https://issues.apache.org/jira/browse/SPARK-32348
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>
> This is a placeholder for the general task of getting the tests to pass in 
> the Scala 2.13 build, after it compiles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13

2020-09-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32808.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29711
[https://github.com/apache/spark/pull/29711]

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0
>
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13

2020-09-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-32808:


Assignee: Yang Jie

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32848) CostBasedJoinReorder should produce same result in Scala 2.12 and 2.13 with same input

2020-09-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32848.
--
Resolution: Duplicate

> CostBasedJoinReorder should produce same result in Scala 2.12 and 2.13 with 
> same input
> --
>
> Key: SPARK-32848
> URL: https://issues.apache.org/jira/browse/SPARK-32848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> The optimization result of {{CostBasedJoinReorder}} maybe different with same 
> input in Scala 2.12 and Scala 2.13 if there are more than one same cost 
> candidate plans.
> The test case named "Test 4: Star with several branches" in 
> StarJoinCostBasedReorderSuite is a typical case.
>  
> If the input is 
> {code:java}
> d1.join(t3).join(t4).join(f1).join(d2).join(t5).join(t6).join(d3).join(t1).join(t2)
>   .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
> (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
> (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
> (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
> (nameToAttr("d2_c2") === nameToAttr("t5_c1")) &&
> (nameToAttr("t5_c2") === nameToAttr("t6_c2")) &&
> (nameToAttr("f1_fk3") === nameToAttr("d3_pk")) &&
> (nameToAttr("d3_c2") === nameToAttr("t1_c1")) &&
> (nameToAttr("t1_c2") === nameToAttr("t2_c2")))
> {code}
> the optimization result  in Scala 2.12 is 
> {code:java}
>   f1.join(d3, Inner, Some(nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
> .join(d1, Inner, Some(nameToAttr("f1_fk1") === nameToAttr("d1_pk")))
> .join(d2, Inner, Some(nameToAttr("f1_fk2") === nameToAttr("d2_pk")))
> .
> .
> .{code}
> and the optimization result  in Scala 2.13 is 
> {code:java}
> f1.join(d3, Inner, Some(nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
> .join(d2, Inner, Some(nameToAttr("f1_fk2") === nameToAttr("d2_pk")))
> .join(d1, Inner, Some(nameToAttr("f1_fk1") === nameToAttr("d1_pk")))
> .
> .
> .
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198368#comment-17198368
 ] 

Thomas Graves commented on SPARK-32935:
---

sorry looks like my update went in same time as yours and overwrote yours, 
fixed description to have both.  It should be both writing and reading correct 
[~Gengliang.Wang]

> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads similar to Datasource 
> V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.
>  
> Support writing file data source with bucketing 
> {code:java} 
> fileDf.write.bucketBy(...).sortBy(..)... 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32935:
--
Description: 
Datasource V2 does not currently support bucketed reads or writes similar to 
Datasource V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

 

Support writing file data source with bucketing looks like:
{code:java}
 
fileDf.write.bucketBy(...).sortBy(..)... 
{code}

  was:
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

 

Support writing file data source with bucketing 

{code:java} 
fileDf.write.bucketBy(...).sortBy(..)... 
{code}


> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads or writes similar to 
> Datasource V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.
>  
> Support writing file data source with bucketing looks like:
> {code:java}
>  
> fileDf.write.bucketBy(...).sortBy(..)... 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32935:
--
Description: 
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

 

Support writing file data source with bucketing 

{code:java} 
fileDf.write.bucketBy(...).sortBy(..)... 
{code}

  was:
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.


> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads similar to Datasource 
> V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.
>  
> Support writing file data source with bucketing 
> {code:java} 
> fileDf.write.bucketBy(...).sortBy(..)... 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198363#comment-17198363
 ] 

Apache Spark commented on SPARK-32187:
--

User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/29806

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32938) Spark can not cast long value from Kafka

2020-09-18 Thread Matthias Seiler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Seiler updated SPARK-32938:

Environment: 
Debian 10 (Buster), AMD64

Spark 3.0.0

Kafka 2.5.0

spark-sql-kafka-0-10_2.12

  was:
Debian 10 (Buster), AMD64

Spark 3.0.0

Kafka 2.5.0

**spark-sql-kafka-0-10_2.12


> Spark can not cast long value from Kafka
> 
>
> Key: SPARK-32938
> URL: https://issues.apache.org/jira/browse/SPARK-32938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: Debian 10 (Buster), AMD64
> Spark 3.0.0
> Kafka 2.5.0
> spark-sql-kafka-0-10_2.12
>Reporter: Matthias Seiler
>Priority: Blocker
>
> Spark seems to be unable to cast the key (or value) part from Kafka to a 
> _{color:#172b4d}long{color}_ value and throws
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS 
> BIGINT)' due to data type mismatch: cannot cast binary to bigint;;{code}
>  
> {color:#172b4d}See this repo for further investigation:{color} 
> [https://github.com/maseiler/spark-kafka- 
> casting-bug|https://github.com/maseiler/spark-kafka- casting-bug]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198362#comment-17198362
 ] 

Fabian Höring commented on SPARK-32187:
---

[~hyukjin.kwon]

Voilà: [https://github.com/apache/spark/pull/29806]

I spent some time getting Spark 3.0.1 to work on our cluster, testing all the 
examples with Spark 3.0.1. and getting it more concise.

I had some issues with pyspark 3.0.1, latest pyarrow and latest pandas. So I 
fixed the versions for now to get something merged and then we can still see.

Some other recent blog post if your are interested 
[https://www.inovex.de/blog/isolated-virtual-environments-pyspark/ 
|https://www.inovex.de/blog/isolated-virtual-environments-pyspark/]It is all 
covered in the doc I would say.
 
>From my point of view it looks really good now.

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198360#comment-17198360
 ] 

Apache Spark commented on SPARK-32187:
--

User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/29806

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32183) User Guide - PySpark Usage Guide for Pandas with Apache Arrow

2020-09-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32183:
-
Comment: was deleted

(was: User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/29806)

> User Guide - PySpark Usage Guide for Pandas with Apache Arrow
> -
>
> Key: SPARK-32183
> URL: https://issues.apache.org/jira/browse/SPARK-32183
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html 
> to examples.
> See also 
> https://hyukjin-spark.readthedocs.io/en/latest/user_guide/arrow-spark.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32938) Spark can not cast long value from Kafka

2020-09-18 Thread Matthias Seiler (Jira)
Matthias Seiler created SPARK-32938:
---

 Summary: Spark can not cast long value from Kafka
 Key: SPARK-32938
 URL: https://issues.apache.org/jira/browse/SPARK-32938
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL, Structured Streaming
Affects Versions: 3.0.0
 Environment: Debian 10 (Buster), AMD64

Spark 3.0.0

Kafka 2.5.0

**spark-sql-kafka-0-10_2.12
Reporter: Matthias Seiler


Spark seems to be unable to cast the key (or value) part from Kafka to a 
_{color:#172b4d}long{color}_ value and throws
{code:java}
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS BIGINT)' 
due to data type mismatch: cannot cast binary to bigint;;{code}
 

{color:#172b4d}See this repo for further investigation:{color} 
[https://github.com/maseiler/spark-kafka- 
casting-bug|https://github.com/maseiler/spark-kafka- casting-bug]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32183) User Guide - PySpark Usage Guide for Pandas with Apache Arrow

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198354#comment-17198354
 ] 

Apache Spark commented on SPARK-32183:
--

User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/29806

> User Guide - PySpark Usage Guide for Pandas with Apache Arrow
> -
>
> Key: SPARK-32183
> URL: https://issues.apache.org/jira/browse/SPARK-32183
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Port https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html 
> to examples.
> See also 
> https://hyukjin-spark.readthedocs.io/en/latest/user_guide/arrow-spark.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-09-18 Thread Mark Kegel (DSS) (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198348#comment-17198348
 ] 

Mark Kegel (DSS) commented on SPARK-31754:
--

We are seeing this same problem in our data pipeline. As an experiment we tried 
swapping out the default state store for the RocksDB one. We still get an 
exception, but its a very different one. Hopefully this might point folks 
towards what the issue is. Here is a sample stacktrace:

 
{code:java}
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2362)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2350)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2349)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2349)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1102)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1102)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1102)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2582)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2529)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2517)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2280)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:170)
... 56 more
Caused by: java.lang.IllegalStateException: RocksDB instance could not be 
acquired by [ThreadId: 15124, task: 185.3 in stage 19413, TID 1537636] as it 
was not released by [ThreadId: 12648, task: 185.1 in stage 19413, TID 1535907] 
after 10002 ms
StateStoreId(opId=2,partId=185,name=left-keyToNumValues)
at com.databricks.sql.streaming.state.RocksDB.acquire(RocksDB.scala:332)
at com.databricks.sql.streaming.state.RocksDB.load(RocksDB.scala:103)
at 
com.databricks.sql.streaming.state.RocksDBStateStoreProvider.getStore(RocksDBStateStoreProvider.scala:161)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:372)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$StateStoreHandler.getStateStore(SymmetricHashJoinStateManager.scala:321)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$KeyToNumValuesStore.(SymmetricHashJoinStateManager.scala:347)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager.(SymmetricHashJoinStateManager.scala:294)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner.(StreamingSymmetricHashJoinExec.scala:397)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$processPartitions(StreamingSymmetricHashJoinExec.scala:229)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$doExecute$1.apply(StreamingSymmetricHashJoinExec.scala:205)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$doExecute$1.apply(StreamingSymmetricHashJoinExec.scala:205)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)

[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32935:
--
Description: 
Datasource V2 does not currently support bucketed reads similar to Datasource 
V1 does.  See DatasourceScanExec and config 

spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.

  was:
Support writing file data source with bucketing

{code:java}
fileDf.write.bucketBy(...).sortBy(..)...
{code}



> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Datasource V2 does not currently support bucketed reads similar to Datasource 
> V1 does.  See DatasourceScanExec and config 
> spark.sql.sources.bucketing.enabled.  We need to add support to V2 as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32936.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29801
[https://github.com/apache/spark/pull/29801]

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0
>
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32936:


Assignee: Yang Jie

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-18 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198339#comment-17198339
 ] 

Thomas Graves commented on SPARK-27589:
---

thanks for confirming and filing the jira, wanted to make sure I wasn't missing 
something.

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-32935:
---
Description: 
Support writing file data source with bucketing

{code:java}
fileDf.write.bucketBy(...).sortBy(..)...
{code}


> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Support writing file data source with bucketing
> {code:java}
> fileDf.write.bucketBy(...).sortBy(..)...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198338#comment-17198338
 ] 

Gengliang Wang commented on SPARK-32935:


[~rohitmishr1484] Sure

> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Support writing file data source with bucketing
> {code:java}
> fileDf.write.bucketBy(...).sortBy(..)...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32911) UnsafeExternalSorter.SpillableIterator cannot spill after reading all records

2020-09-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32911:
---

Assignee: Tom van Bussel

> UnsafeExternalSorter.SpillableIterator cannot spill after reading all records
> -
>
> Key: SPARK-32911
> URL: https://issues.apache.org/jira/browse/SPARK-32911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
>
> No memory is freed after calling 
> {{UnsafeExternalSorter.SpillableIterator.spill()}} when all records have been 
> read, even though it is still holding onto some memory. This may starve other 
> {{MemoryConsumer}}s of memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32911) UnsafeExternalSorter.SpillableIterator cannot spill after reading all records

2020-09-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32911.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29787
[https://github.com/apache/spark/pull/29787]

> UnsafeExternalSorter.SpillableIterator cannot spill after reading all records
> -
>
> Key: SPARK-32911
> URL: https://issues.apache.org/jira/browse/SPARK-32911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
> Fix For: 3.1.0
>
>
> No memory is freed after calling 
> {{UnsafeExternalSorter.SpillableIterator.spill()}} when all records have been 
> read, even though it is still holding onto some memory. This may starve other 
> {{MemoryConsumer}}s of memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32937) DecomissionSuite in k8s integration tests is failing.

2020-09-18 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-32937:
---

 Summary: DecomissionSuite in k8s integration tests is failing.
 Key: SPARK-32937
 URL: https://issues.apache.org/jira/browse/SPARK-32937
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Prashant Sharma



Logs from the failing test, copied from jenkins. As of now, it is always 
failing. 

{code}
- Test basic decommissioning *** FAILED ***
  The code passed to eventually never returned normally. Attempted 182 times 
over 3.00377927275 minutes. Last failure message: "++ id -u
  + myuid=185
  ++ id -g
  + mygid=0
  + set +e
  ++ getent passwd 185
  + uidentry=
  + set -e
  + '[' -z '' ']'
  + '[' -w /etc/passwd ']'
  + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
  + SPARK_CLASSPATH=':/opt/spark/jars/*'
  + env
  + grep SPARK_JAVA_OPT_
  + sort -t_ -k4 -n
  + sed 's/[^=]*=\(.*\)/\1/g'
  + readarray -t SPARK_EXECUTOR_JAVA_OPTS
  + '[' -n '' ']'
  + '[' 3 == 2 ']'
  + '[' 3 == 3 ']'
  ++ python3 -V
  + pyv3='Python 3.7.3'
  + export PYTHON_VERSION=3.7.3
  + PYTHON_VERSION=3.7.3
  + export PYSPARK_PYTHON=python3
  + PYSPARK_PYTHON=python3
  + export PYSPARK_DRIVER_PYTHON=python3
  + PYSPARK_DRIVER_PYTHON=python3
  + '[' -n '' ']'
  + '[' -z ']'
  + '[' -z x ']'
  + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
  + case "$1" in
  + shift 1
  + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
  + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
local:///opt/spark/tests/decommissioning.py
  20/09/17 11:06:56 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
  Starting decom test
  Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
  20/09/17 11:06:56 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT
  20/09/17 11:06:57 INFO ResourceUtils: 
==
  20/09/17 11:06:57 INFO ResourceUtils: No custom resources configured for 
spark.driver.
  20/09/17 11:06:57 INFO ResourceUtils: 
==
  20/09/17 11:06:57 INFO SparkContext: Submitted application: PyMemoryTest
  20/09/17 11:06:57 INFO ResourceProfile: Default ResourceProfile created, 
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
memory -> name: memory, amount: 1024, script: , vendor: ), task resources: 
Map(cpus -> name: cpus, amount: 1.0)
  20/09/17 11:06:57 INFO ResourceProfile: Limiting resource is cpus at 1 tasks 
per executor
  20/09/17 11:06:57 INFO ResourceProfileManager: Added ResourceProfile id: 0
  20/09/17 11:06:57 INFO SecurityManager: Changing view acls to: 185,jenkins
  20/09/17 11:06:57 INFO SecurityManager: Changing modify acls to: 185,jenkins
  20/09/17 11:06:57 INFO SecurityManager: Changing view acls groups to: 
  20/09/17 11:06:57 INFO SecurityManager: Changing modify acls groups to: 
  20/09/17 11:06:57 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls disabled; users  with view permissions: Set(185, jenkins); 
groups with view permissions: Set(); users  with modify permissions: Set(185, 
jenkins); groups with modify permissions: Set()
  20/09/17 11:06:57 INFO Utils: Successfully started service 'sparkDriver' on 
port 7078.
  20/09/17 11:06:57 INFO SparkEnv: Registering MapOutputTracker
  20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMaster
  20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
  20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
up
  20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
  20/09/17 11:06:57 INFO DiskBlockManager: Created local directory at 
/var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/blockmgr-3bd403d0-6689-46be-997e-5bc699ecefd3
  20/09/17 11:06:57 INFO MemoryStore: MemoryStore started with capacity 593.9 
MiB
  20/09/17 11:06:57 INFO SparkEnv: Registering OutputCommitCoordinator
  20/09/17 11:06:58 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
  20/09/17 11:06:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:4040
  20/09/17 11:06:58 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
client using current context from users K8S config file
  20/09/17 11:06:59 INFO ExecutorPodsAllocator: Going to request 3 executors 
from Kubernetes.
  20/09/17 11:06:59 INFO KubernetesClientUtils: Spark configuration files 

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198256#comment-17198256
 ] 

Apache Spark commented on SPARK-32635:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29805

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: Vinod KC
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32859) Introduce SQL physical plan rule to decide enable/disable bucketing

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32859:


Assignee: Apache Spark

> Introduce SQL physical plan rule to decide enable/disable bucketing 
> 
>
> Key: SPARK-32859
> URL: https://issues.apache.org/jira/browse/SPARK-32859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> Discussed with [~cloud_fan] offline, it would be better that we can decide 
> enable/disable SQL bucketing automatically according to query plan. Currently 
> bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), 
> so for all bucketed tables in the query plan, we will use bucket table scan 
> (all input files per the bucket will be read by same task). This has the 
> drawback that if the bucket table scan is not benefitting at all (no 
> join/groupby/etc in the query), we don't need to use bucket table scan as it 
> would restrict the # of tasks to be # of buckets and might hurt parallelism.
>  
> The proposed change is to introduce a physical plan rule (right before 
> `ensureRequirements`):
> (1).transformUp() physical plan, matching SparkPlan operator which is 
> FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket 
> filter in this case).
> (2).transformUp() physical plan, matching SparkPlan operator which is 
> SparkPlanWithInterestingPartitioning.
> SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec, 
> ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec, etc, which has 
> HashClusteredDistribution/ClusteredDistribution in 
> requiredChildDistribution}, and its requiredChildDistribution 
> HashClusteredDistribution/ClusteredDistribution on its underlying 
> FileSourceScanExec's bucketed columns.
> (3).for any child of SparkPlanWithInterestingPartitioning, which does not 
> satisfy the plan's requiredChildDistribution: go though the child's sub query 
> plan tree.
>  if (3.1).all node's outputPartitioning is same as child, and all node's 
> requiredChildDistribution is UnspecifiedDistribution.
>  and (3.2).the leaf node is FileSourceScanExec on bucketed table and
>  and (3.3).if enabling bucket scan for this FileSourceScanExec, the 
> outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution 
> of SparkPlanWithInterestingPartitioning.
>  If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this 
> FileSourceScanExec. And double check the new child of 
> SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
>  
> The idea of SparkPlanWithInterestingPartitioning, is inspired from 
> "interesting order" in "Access Path Selection in a Relational Database 
> Management 
> System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32859) Introduce SQL physical plan rule to decide enable/disable bucketing

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198245#comment-17198245
 ] 

Apache Spark commented on SPARK-32859:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29804

> Introduce SQL physical plan rule to decide enable/disable bucketing 
> 
>
> Key: SPARK-32859
> URL: https://issues.apache.org/jira/browse/SPARK-32859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Discussed with [~cloud_fan] offline, it would be better that we can decide 
> enable/disable SQL bucketing automatically according to query plan. Currently 
> bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), 
> so for all bucketed tables in the query plan, we will use bucket table scan 
> (all input files per the bucket will be read by same task). This has the 
> drawback that if the bucket table scan is not benefitting at all (no 
> join/groupby/etc in the query), we don't need to use bucket table scan as it 
> would restrict the # of tasks to be # of buckets and might hurt parallelism.
>  
> The proposed change is to introduce a physical plan rule (right before 
> `ensureRequirements`):
> (1).transformUp() physical plan, matching SparkPlan operator which is 
> FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket 
> filter in this case).
> (2).transformUp() physical plan, matching SparkPlan operator which is 
> SparkPlanWithInterestingPartitioning.
> SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec, 
> ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec, etc, which has 
> HashClusteredDistribution/ClusteredDistribution in 
> requiredChildDistribution}, and its requiredChildDistribution 
> HashClusteredDistribution/ClusteredDistribution on its underlying 
> FileSourceScanExec's bucketed columns.
> (3).for any child of SparkPlanWithInterestingPartitioning, which does not 
> satisfy the plan's requiredChildDistribution: go though the child's sub query 
> plan tree.
>  if (3.1).all node's outputPartitioning is same as child, and all node's 
> requiredChildDistribution is UnspecifiedDistribution.
>  and (3.2).the leaf node is FileSourceScanExec on bucketed table and
>  and (3.3).if enabling bucket scan for this FileSourceScanExec, the 
> outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution 
> of SparkPlanWithInterestingPartitioning.
>  If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this 
> FileSourceScanExec. And double check the new child of 
> SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
>  
> The idea of SparkPlanWithInterestingPartitioning, is inspired from 
> "interesting order" in "Access Path Selection in a Relational Database 
> Management 
> System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32859) Introduce SQL physical plan rule to decide enable/disable bucketing

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32859:


Assignee: (was: Apache Spark)

> Introduce SQL physical plan rule to decide enable/disable bucketing 
> 
>
> Key: SPARK-32859
> URL: https://issues.apache.org/jira/browse/SPARK-32859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Discussed with [~cloud_fan] offline, it would be better that we can decide 
> enable/disable SQL bucketing automatically according to query plan. Currently 
> bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), 
> so for all bucketed tables in the query plan, we will use bucket table scan 
> (all input files per the bucket will be read by same task). This has the 
> drawback that if the bucket table scan is not benefitting at all (no 
> join/groupby/etc in the query), we don't need to use bucket table scan as it 
> would restrict the # of tasks to be # of buckets and might hurt parallelism.
>  
> The proposed change is to introduce a physical plan rule (right before 
> `ensureRequirements`):
> (1).transformUp() physical plan, matching SparkPlan operator which is 
> FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket 
> filter in this case).
> (2).transformUp() physical plan, matching SparkPlan operator which is 
> SparkPlanWithInterestingPartitioning.
> SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec, 
> ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec, etc, which has 
> HashClusteredDistribution/ClusteredDistribution in 
> requiredChildDistribution}, and its requiredChildDistribution 
> HashClusteredDistribution/ClusteredDistribution on its underlying 
> FileSourceScanExec's bucketed columns.
> (3).for any child of SparkPlanWithInterestingPartitioning, which does not 
> satisfy the plan's requiredChildDistribution: go though the child's sub query 
> plan tree.
>  if (3.1).all node's outputPartitioning is same as child, and all node's 
> requiredChildDistribution is UnspecifiedDistribution.
>  and (3.2).the leaf node is FileSourceScanExec on bucketed table and
>  and (3.3).if enabling bucket scan for this FileSourceScanExec, the 
> outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution 
> of SparkPlanWithInterestingPartitioning.
>  If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this 
> FileSourceScanExec. And double check the new child of 
> SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
>  
> The idea of SparkPlanWithInterestingPartitioning, is inspired from 
> "interesting order" in "Access Path Selection in a Relational Database 
> Management 
> System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32859) Introduce SQL physical plan rule to decide enable/disable bucketing

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198244#comment-17198244
 ] 

Apache Spark commented on SPARK-32859:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29804

> Introduce SQL physical plan rule to decide enable/disable bucketing 
> 
>
> Key: SPARK-32859
> URL: https://issues.apache.org/jira/browse/SPARK-32859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Discussed with [~cloud_fan] offline, it would be better that we can decide 
> enable/disable SQL bucketing automatically according to query plan. Currently 
> bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), 
> so for all bucketed tables in the query plan, we will use bucket table scan 
> (all input files per the bucket will be read by same task). This has the 
> drawback that if the bucket table scan is not benefitting at all (no 
> join/groupby/etc in the query), we don't need to use bucket table scan as it 
> would restrict the # of tasks to be # of buckets and might hurt parallelism.
>  
> The proposed change is to introduce a physical plan rule (right before 
> `ensureRequirements`):
> (1).transformUp() physical plan, matching SparkPlan operator which is 
> FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket 
> filter in this case).
> (2).transformUp() physical plan, matching SparkPlan operator which is 
> SparkPlanWithInterestingPartitioning.
> SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec, 
> ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec, etc, which has 
> HashClusteredDistribution/ClusteredDistribution in 
> requiredChildDistribution}, and its requiredChildDistribution 
> HashClusteredDistribution/ClusteredDistribution on its underlying 
> FileSourceScanExec's bucketed columns.
> (3).for any child of SparkPlanWithInterestingPartitioning, which does not 
> satisfy the plan's requiredChildDistribution: go though the child's sub query 
> plan tree.
>  if (3.1).all node's outputPartitioning is same as child, and all node's 
> requiredChildDistribution is UnspecifiedDistribution.
>  and (3.2).the leaf node is FileSourceScanExec on bucketed table and
>  and (3.3).if enabling bucket scan for this FileSourceScanExec, the 
> outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution 
> of SparkPlanWithInterestingPartitioning.
>  If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this 
> FileSourceScanExec. And double check the new child of 
> SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
>  
> The idea of SparkPlanWithInterestingPartitioning, is inspired from 
> "interesting order" in "Access Path Selection in a Relational Database 
> Management 
> System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198242#comment-17198242
 ] 

Rohit Mishra commented on SPARK-32935:
--

[~Gengliang.Wang], Can you please add a description?

> File source V2: support bucketing
> -
>
> Key: SPARK-32935
> URL: https://issues.apache.org/jira/browse/SPARK-32935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32930) Replace deprecated isFile/isDirectory methods

2020-09-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198237#comment-17198237
 ] 

Hyukjin Kwon commented on SPARK-32930:
--

The JIRA is closed so cannot be reopen - this is fixed in Spark 3.0.2 and 3.1.0.

> Replace deprecated isFile/isDirectory methods
> -
>
> Key: SPARK-32930
> URL: https://issues.apache.org/jira/browse/SPARK-32930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32874) Enhance result set meta data check for execute statement operation for thrift server

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198227#comment-17198227
 ] 

Apache Spark commented on SPARK-32874:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29803

> Enhance result set meta data check for execute statement operation for thrift 
> server
> 
>
> Key: SPARK-32874
> URL: https://issues.apache.org/jira/browse/SPARK-32874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> Add test cases to ensure stability for JDBC api



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198219#comment-17198219
 ] 

Apache Spark commented on SPARK-32635:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29802

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: Vinod KC
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198218#comment-17198218
 ] 

Apache Spark commented on SPARK-32936:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29801

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198216#comment-17198216
 ] 

Apache Spark commented on SPARK-32936:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29801

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32936:


Assignee: (was: Apache Spark)

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32936:


Assignee: Apache Spark

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32936:
-
Description: 
There are 14 test failed of `external/avro` module as follow:
 * AvroV1Suite(7 FAILED)
 * AvroV2Suite(7 FAILED

  was:
I  found SPARK-32926 add Add Scala 2.13 build test in GitHub Action and there 
are 14 test failed of `external/avro` module as follow:
 * AvroV1Suite(7 FAILED)
 * AvroV2Suite(7 FAILED


> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> There are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32936) Pass all `external/avro` module UTs in Scala 2.13

2020-09-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32936:
-
Summary: Pass all `external/avro` module UTs in Scala 2.13  (was: Pass all 
`external/avro` module test)

> Pass all `external/avro` module UTs in Scala 2.13
> -
>
> Key: SPARK-32936
> URL: https://issues.apache.org/jira/browse/SPARK-32936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> I  found SPARK-32926 add Add Scala 2.13 build test in GitHub Action and there 
> are 14 test failed of `external/avro` module as follow:
>  * AvroV1Suite(7 FAILED)
>  * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32936) Pass all `external/avro` module test

2020-09-18 Thread Yang Jie (Jira)
Yang Jie created SPARK-32936:


 Summary: Pass all `external/avro` module test
 Key: SPARK-32936
 URL: https://issues.apache.org/jira/browse/SPARK-32936
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yang Jie


I  found SPARK-32926 add Add Scala 2.13 build test in GitHub Action and there 
are 14 test failed of `external/avro` module as follow:
 * AvroV1Suite(7 FAILED)
 * AvroV2Suite(7 FAILED



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-18 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198203#comment-17198203
 ] 

Gengliang Wang commented on SPARK-27589:


[~tgraves] I am really sorry that I missed your question.
Yes bucketing is not supported yet. I have just created 
https://issues.apache.org/jira/browse/SPARK-32935

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32935) File source V2: support bucketing

2020-09-18 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-32935:
--

 Summary: File source V2: support bucketing
 Key: SPARK-32935
 URL: https://issues.apache.org/jira/browse/SPARK-32935
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2

2020-09-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198199#comment-17198199
 ] 

Dongjoon Hyun commented on SPARK-28396:
---

Got it. Thank you for reply, [~Gengliang.Wang].

> Add PathCatalog for data source V2
> --
>
> Key: SPARK-28396
> URL: https://issues.apache.org/jira/browse/SPARK-28396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add PathCatalog for data source V2, so that:
> 1. We can convert SaveMode in DataFrameWriter into catalog table operations, 
> instead of supporting SaveMode in file source V2.
> 2. Support create-table SQL statements like "CREATE TABLE orc.'path'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32934) Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198195#comment-17198195
 ] 

Apache Spark commented on SPARK-32934:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29800

> Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE
> 
>
> Key: SPARK-32934
> URL: https://issues.apache.org/jira/browse/SPARK-32934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL support some window function like  NTH_VALUE,FIRST_VALUE and 
> LAST_VALUE
> If we specify window frame like
> {code:java}
> UNBOUNDED PRECEDING AND CURRENT ROW
> {code}
> or
> {code:java}
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> We can elimate some calculations.
> For example: if we execute the SQL show below:
> {code:java}
> SELECT NTH_VALUE(col,
>  2) OVER(ORDER BY rank UNBOUNDED PRECEDING
> AND CURRENT ROW)
> FROM tab;
> {code}
> The output for row number greater than 1, return the fixed value. otherwise, 
> return null. So we just calculate the value once and notice whether the row 
> number less than 2.
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32934) Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198193#comment-17198193
 ] 

Apache Spark commented on SPARK-32934:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29800

> Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE
> 
>
> Key: SPARK-32934
> URL: https://issues.apache.org/jira/browse/SPARK-32934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL support some window function like  NTH_VALUE,FIRST_VALUE and 
> LAST_VALUE
> If we specify window frame like
> {code:java}
> UNBOUNDED PRECEDING AND CURRENT ROW
> {code}
> or
> {code:java}
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> We can elimate some calculations.
> For example: if we execute the SQL show below:
> {code:java}
> SELECT NTH_VALUE(col,
>  2) OVER(ORDER BY rank UNBOUNDED PRECEDING
> AND CURRENT ROW)
> FROM tab;
> {code}
> The output for row number greater than 1, return the fixed value. otherwise, 
> return null. So we just calculate the value once and notice whether the row 
> number less than 2.
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32934) Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32934:


Assignee: (was: Apache Spark)

> Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE
> 
>
> Key: SPARK-32934
> URL: https://issues.apache.org/jira/browse/SPARK-32934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL support some window function like  NTH_VALUE,FIRST_VALUE and 
> LAST_VALUE
> If we specify window frame like
> {code:java}
> UNBOUNDED PRECEDING AND CURRENT ROW
> {code}
> or
> {code:java}
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> We can elimate some calculations.
> For example: if we execute the SQL show below:
> {code:java}
> SELECT NTH_VALUE(col,
>  2) OVER(ORDER BY rank UNBOUNDED PRECEDING
> AND CURRENT ROW)
> FROM tab;
> {code}
> The output for row number greater than 1, return the fixed value. otherwise, 
> return null. So we just calculate the value once and notice whether the row 
> number less than 2.
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32934) Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32934:


Assignee: Apache Spark

> Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE
> 
>
> Key: SPARK-32934
> URL: https://issues.apache.org/jira/browse/SPARK-32934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL support some window function like  NTH_VALUE,FIRST_VALUE and 
> LAST_VALUE
> If we specify window frame like
> {code:java}
> UNBOUNDED PRECEDING AND CURRENT ROW
> {code}
> or
> {code:java}
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> We can elimate some calculations.
> For example: if we execute the SQL show below:
> {code:java}
> SELECT NTH_VALUE(col,
>  2) OVER(ORDER BY rank UNBOUNDED PRECEDING
> AND CURRENT ROW)
> FROM tab;
> {code}
> The output for row number greater than 1, return the fixed value. otherwise, 
> return null. So we just calculate the value once and notice whether the row 
> number less than 2.
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32934) Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE

2020-09-18 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-32934:
---
Summary: Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE  
(was: Support a new window frame could improve the performance for 
NTH_VALUE,FIRST_VALUE,LAST_VALUE)

> Improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE
> 
>
> Key: SPARK-32934
> URL: https://issues.apache.org/jira/browse/SPARK-32934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL support some window function like  NTH_VALUE,FIRST_VALUE and 
> LAST_VALUE
> If we specify window frame like
> {code:java}
> UNBOUNDED PRECEDING AND CURRENT ROW
> {code}
> or
> {code:java}
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> We can elimate some calculations.
> For example: if we execute the SQL show below:
> {code:java}
> SELECT NTH_VALUE(col,
>  2) OVER(ORDER BY rank UNBOUNDED PRECEDING
> AND CURRENT ROW)
> FROM tab;
> {code}
> The output for row number greater than 1, return the fixed value. otherwise, 
> return null. So we just calculate the value once and notice whether the row 
> number less than 2.
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32934) Support a new window frame could improve the performance for NTH_VALUE,FIRST_VALUE,LAST_VALUE

2020-09-18 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-32934:
--

 Summary: Support a new window frame could improve the performance 
for NTH_VALUE,FIRST_VALUE,LAST_VALUE
 Key: SPARK-32934
 URL: https://issues.apache.org/jira/browse/SPARK-32934
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


Spark SQL support some window function like  NTH_VALUE,FIRST_VALUE and 
LAST_VALUE
If we specify window frame like


{code:java}
UNBOUNDED PRECEDING AND CURRENT ROW
{code}

or

{code:java}
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
{code}


We can elimate some calculations.
For example: if we execute the SQL show below:

{code:java}
SELECT NTH_VALUE(col,
 2) OVER(ORDER BY rank UNBOUNDED PRECEDING
AND CURRENT ROW)
FROM tab;
{code}
The output for row number greater than 1, return the fixed value. otherwise, 
return null. So we just calculate the value once and notice whether the row 
number less than 2.
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message

2020-09-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32905.
-
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29777
[https://github.com/apache/spark/pull/29777]

> ApplicationMaster fails to receive UpdateDelegationTokens message
> -
>
> Key: SPARK-32905
> URL: https://issues.apache.org/jira/browse/SPARK-32905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> {code:java}
> 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, 
> launching executors on 0 of them.
> 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> With a long-running application in kerberized mode, the AMEndpiont handles 
> the token updating message wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message

2020-09-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32905:
---

Assignee: Kent Yao

> ApplicationMaster fails to receive UpdateDelegationTokens message
> -
>
> Key: SPARK-32905
> URL: https://issues.apache.org/jira/browse/SPARK-32905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> {code:java}
> 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, 
> launching executors on 0 of them.
> 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> With a long-running application in kerberized mode, the AMEndpiont handles 
> the token updating message wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198179#comment-17198179
 ] 

Apache Spark commented on SPARK-32933:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29799

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32933:


Assignee: Apache Spark

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32933:


Assignee: (was: Apache Spark)

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198178#comment-17198178
 ] 

Apache Spark commented on SPARK-32933:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29799

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-18 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-32933:
--

 Summary: Use keyword-only syntax for keyword_only methods
 Key: SPARK-32933
 URL: https://issues.apache.org/jira/browse/SPARK-32933
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
3102|https://www.python.org/dev/peps/pep-3102/]).

It is not a full replacement for our current usage of {{keyword_only}}, but it 
would allow us to make our expectations explicit:


{code:python}
@keyword_only
def __init__(self, degree=2, inputCol=None, outputCol=None):
{code}

{code:python}
@keyword_only
def __init__(self, *, degree=2, inputCol=None, outputCol=None):
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27951) ANSI SQL: NTH_VALUE function

2020-09-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27951.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29604
[https://github.com/apache/spark/pull/29604]

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27951) ANSI SQL: NTH_VALUE function

2020-09-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27951:
---

Assignee: jiaan.geng

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: jiaan.geng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-18 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Description: 
With AQE, local shuffle reader breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.

  was:
With AQE, local reader optimizer breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.


> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> With AQE, local shuffle reader breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-18 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Summary: AQE local shuffle reader breaks repartitioning for dynamic 
partition overwrite  (was: AQE local reader optimizer breaks repartitioning for 
dynamic partition overwrite)

> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> With AQE, local reader optimizer breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org