date:20200508

[jira] [Resolved] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-05-08 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-31334.
---
Resolution: Fixed

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31663:
--
Comment: was deleted

(was: cc [~dongjoon] [~XuanYuan]

Seems this problem  a little like  
https://issues.apache.org/jira/browse/SPARK-31334)

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.
>  
> For Apache Spark 2.0.2 ~ 2.3.4, the following query is tested.
> {code:java}
> spark-sql> select sum(a) as b from t group by b grouping sets(b) having b > 
> 10;
> Time taken: 0.194 seconds
> hive> select sum(a) as b from t group by b grouping sets(b) having b > 10;
> 2
> Time taken: 1.605 seconds, Fetched: 1 row(s) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31668) Saving and loading HashingTF leads to hash function changed

2020-05-08 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-31668:
--

Assignee: Weichen Xu

> Saving and loading HashingTF leads to hash function changed
> ---
>
> Key: SPARK-31668
> URL: https://issues.apache.org/jira/browse/SPARK-31668
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> If we use spark 2.x save HashingTF, and then use spark 3.0 load it, and then 
> use spark 3.0 to save it again, and then use spark 3.0 to load it again, the 
> hash function will be changed.
> This bug is hard to debug, we need to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31668) Saving and loading HashingTF leads to hash function changed

2020-05-08 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-31668:
--

 Summary: Saving and loading HashingTF leads to hash function 
changed
 Key: SPARK-31668
 URL: https://issues.apache.org/jira/browse/SPARK-31668
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Weichen Xu


If we use spark 2.x save HashingTF, and then use spark 3.0 load it, and then 
use spark 3.0 to save it again, and then use spark 3.0 to load it again, the 
hash function will be changed.

This bug is hard to debug, we need to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103039#comment-17103039
 ] 

angerszhu commented on SPARK-31663:
---

cc [~dongjoon] [~XuanYuan]

Seems this problem  a little like  
https://issues.apache.org/jira/browse/SPARK-31334

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.
>  
> For Apache Spark 2.0.2 ~ 2.3.4, the following query is tested.
> {code:java}
> spark-sql> select sum(a) as b from t group by b grouping sets(b) having b > 
> 10;
> Time taken: 0.194 seconds
> hive> select sum(a) as b from t group by b grouping sets(b) having b > 10;
> 2
> Time taken: 1.605 seconds, Fetched: 1 row(s) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFunc property in HashingTF

2020-05-08 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Issue Type: Bug  (was: Improvement)

> Expose hashFunc property in HashingTF
> -
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFunc property in HashingTF

2020-05-08 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Priority: Critical  (was: Major)

> Expose hashFunc property in HashingTF
> -
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31611.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28416
[https://github.com/apache/spark/pull/28416]

> Register NettyMemoryMetrics into Node Manager's metrics system
> --
>
> Key: SPARK-31611
> URL: https://issues.apache.org/jira/browse/SPARK-31611
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31611:
-

Assignee: Manu Zhang

> Register NettyMemoryMetrics into Node Manager's metrics system
> --
>
> Key: SPARK-31611
> URL: https://issues.apache.org/jira/browse/SPARK-31611
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31667) Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest

2020-05-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102948#comment-17102948
 ] 

Apache Spark commented on SPARK-31667:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28483

> Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
> --
>
> Key: SPARK-31667
> URL: https://issues.apache.org/jira/browse/SPARK-31667
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add Python version of
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31667) Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31667:


Assignee: Apache Spark

> Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
> --
>
> Key: SPARK-31667
> URL: https://issues.apache.org/jira/browse/SPARK-31667
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Add Python version of
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31667) Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest

2020-05-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102947#comment-17102947
 ] 

Apache Spark commented on SPARK-31667:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28483

> Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
> --
>
> Key: SPARK-31667
> URL: https://issues.apache.org/jira/browse/SPARK-31667
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add Python version of
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31667) Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31667:


Assignee: (was: Apache Spark)

> Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
> --
>
> Key: SPARK-31667
> URL: https://issues.apache.org/jira/browse/SPARK-31667
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add Python version of
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31627) Font style of Spark SQL DAG-viz is broken in Chrome on macOS

2020-05-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102944#comment-17102944
 ] 

Dongjoon Hyun commented on SPARK-31627:
---

Thank you, [~sarutak] and [~hyukjin.kwon]. +1, too.

> Font style of Spark SQL DAG-viz is broken in Chrome on macOS
> 
>
> Key: SPARK-31627
> URL: https://issues.apache.org/jira/browse/SPARK-31627
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
> Environment: * macOS
> * Chrome 81
>Reporter: Kousuke Saruta
>Priority: Minor
> Attachments: font-weight-does-not-work.png
>
>
> If all the following condition is true, font style of Spark SQL DAG-viz can 
> be broken.
>  More specifically, plan name will not be bold style if the plan is 
> WholeStageCodegen.
>  * macOS
>  * Chrome (version 81)
> The current master uses Bootstrap4, which defines the default font family as 
> follows.
> {code:java}
> -apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,"Helvetica 
> Neue",Arial,"Noto Sans",sans-serif,"Apple Color Emoji","Segoe UI 
> Emoji","Segoe UI Symbol","Noto Color Emoji"
> {code}
> If we use Chrome, BlinkMacSystemFont is used but font-weight property doesn't 
> work with the font when the font is used in SVG tags.
>  This issue is reported here 
> [here|https://bugs.chromium.org/p/chromium/issues/detail?id=1057654]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31667) Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest

2020-05-08 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31667:
--

 Summary: Python side flatten the result dataframe of 
ANOVATest/ChisqTest/FValueTest
 Key: SPARK-31667
 URL: https://issues.apache.org/jira/browse/SPARK-31667
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add Python version of

{code:java}
@Since("3.1.0")
def test(
dataset: DataFrame,
featuresCol: String,
labelCol: String,
flatten: Boolean): DataFrame 
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31666) Cannot map hostPath volumes to container

2020-05-08 Thread Stephen Hopper (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hopper updated SPARK-31666:
---
Description: 
I'm trying to mount additional hostPath directories as seen in a couple of 
places:

[https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]

[https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

However, whenever I try to submit my job, I run into this error:


{code:java}
Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique. Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
 message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
additionalProperties={})], group=null, kind=Pod, 
name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Pod 
"spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).{code}
 

This is my spark-submit command (note: I've used my own build of spark for 
kubernetes as well as a few other images that I've seen floating around (such 
as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
{code:java}
bin/spark-submit \
 --master k8s://https://my-k8s-server:443 \
 --deploy-mode cluster \
 --name spark-pi \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.executor.instances=2 \
 --conf spark.kubernetes.container.image=my-spark-image:my-tag \
 --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
 --conf spark.kubernetes.namespace=my-spark-ns \
 --conf 
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 \
 --conf 
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 
\
 --conf spark.local.dir="/tmp1" \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
 local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
Any ideas on what's causing this?

 

  was:
I'm trying to mount additional hostPath directories as seen in a couple of 
places:

[https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]

[https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

However, whenever I try to submit my job, I run into this error:
 ```

Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
POST at: [https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods]. 
Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique. Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
 message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
additionalProperties={})], group=null, kind=Pod, 
name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Pod 
"spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).

```

 

This is my spark-submit command (note: I've used my own build of spark for 
kubernetes as well as a few other images that I've seen floating around (such 
as this one `seedjeffwan/spark:v2.4.5`) and they all have this same issue):

```

bin/spark-submit \
 --master k8s://[https://my-k8s-server:443|https://my-k8s-server/] \
 --deploy-mode cluster \
 --name spark-pi \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.executor.instances=2 \
 --conf spark.kubernetes.container.image=my-spark-image:my-tag \
 --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
 --conf spark.kubernetes.namespace=my-spark-ns \
 --conf 
spark.kuber

[jira] [Updated] (SPARK-31666) Cannot map hostPath volumes to container

2020-05-08 Thread Stephen Hopper (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hopper updated SPARK-31666:
---
Description: 
I'm trying to mount additional hostPath directories as seen in a couple of 
places:

[https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]

[https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

However, whenever I try to submit my job, I run into this error:
 ```

Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
POST at: [https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods]. 
Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique. Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
 message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
additionalProperties={})], group=null, kind=Pod, 
name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Pod 
"spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).

```

 

This is my spark-submit command (note: I've used my own build of spark for 
kubernetes as well as a few other images that I've seen floating around (such 
as this one `seedjeffwan/spark:v2.4.5`) and they all have this same issue):

```

bin/spark-submit \
 --master k8s://[https://my-k8s-server:443|https://my-k8s-server/] \
 --deploy-mode cluster \
 --name spark-pi \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.executor.instances=2 \
 --conf spark.kubernetes.container.image=my-spark-image:my-tag \
 --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
 --conf spark.kubernetes.namespace=my-spark-ns \
 --conf 
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 \
 --conf 
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 
\
 --conf spark.local.dir="/tmp1" \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
 local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2

```

Any ideas on what's causing this?

 

  was:
I'm trying to mount additional hostPath directories as seen in a couple of 
places:

[https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]

[https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

However, whenever I try to submit my job, I run into this error:
```

Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. Message: 
Pod "spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique. Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
 message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
additionalProperties={})], group=null, kind=Pod, 
name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Pod 
"spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).

```

 

This is my spark-submit command (note: I've used my own build of spark for 
kubernetes as well as a few other images that I've seen floating around (such 
as this one `seedjeffwan/spark:v2.4.5`) and they all have this same issue):

```

bin/spark-submit \
--master k8s://https://my-k8s-server:443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=my-spark-image:my-tag \
--conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
--conf spark.kubernetes.namespace=my-spark-ns \
--conf 
spark.kubernetes.executor.volumes.

[jira] [Created] (SPARK-31666) Cannot map hostPath volumes to container

2020-05-08 Thread Stephen Hopper (Jira)

Stephen Hopper created SPARK-31666:
--

 Summary: Cannot map hostPath volumes to container
 Key: SPARK-31666
 URL: https://issues.apache.org/jira/browse/SPARK-31666
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.5
Reporter: Stephen Hopper


I'm trying to mount additional hostPath directories as seen in a couple of 
places:

[https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]

[https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

However, whenever I try to submit my job, I run into this error:
```

Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. Message: 
Pod "spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique. Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
 message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
additionalProperties={})], group=null, kind=Pod, 
name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Pod 
"spark-pi-1588970477877-exec-1" is invalid: 
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).

```

 

This is my spark-submit command (note: I've used my own build of spark for 
kubernetes as well as a few other images that I've seen floating around (such 
as this one `seedjeffwan/spark:v2.4.5`) and they all have this same issue):

```

bin/spark-submit \
--master k8s://https://my-k8s-server:443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=my-spark-image:my-tag \
--conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
--conf spark.kubernetes.namespace=my-spark-ns \
--conf 
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 \
--conf 
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 
\
--conf spark.local.dir="/tmp1" \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2

```

Any ideas on what's causing this?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array

2020-05-08 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102919#comment-17102919
 ] 

Gengliang Wang commented on SPARK-30267:


Hi [~tashoyan], 
could you provide a simple reproduce step?

> avro deserializer: ArrayList cannot be cast to GenericData$Array
> 
>
> Key: SPARK-30267
> URL: https://issues.apache.org/jira/browse/SPARK-30267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
> Fix For: 3.0.0
>
>
> On some more complex avro objects, the Avro Deserializer fails with the 
> following stack trace:
> {code}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> org.apache.avro.generic.GenericData$Array
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70)
> {code}
> This is because the Deserializer assumes that an array is always of the very 
> specific {{org.apache.avro.generic.GenericData$Array}} which is not always 
> the case.
> Making it a normal list works.
> A github PR is coming up to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array

2020-05-08 Thread Arseniy Tashoyan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan reopened SPARK-30267:
--

With Spark 3.0.0 preview 2, I have the following failure here:
{code:java}
java.lang.ClassCastException: scala.collection.convert.Wrappers$SeqWrapper 
cannot be cast to org.apache.avro.generic.GenericData$Array
  at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
{code}
This means that the fix here [https://github.com/apache/spark/pull/26907] is 
not actually a fix, because Scala Seq cannot be cast to 
java.util.Collection[Any].

I have Scala Seq, because my Avro GenericRecord is generated from a case class 
by Avro4s. We can expect, that everybody using Avro4s (or other Scala-written 
generator like Avrohugger) will face the same ClassCastException.

> avro deserializer: ArrayList cannot be cast to GenericData$Array
> 
>
> Key: SPARK-30267
> URL: https://issues.apache.org/jira/browse/SPARK-30267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
> Fix For: 3.0.0
>
>
> On some more complex avro objects, the Avro Deserializer fails with the 
> following stack trace:
> {code}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> org.apache.avro.generic.GenericData$Array
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70)
> {code}
> This is because the Deserializer assumes that an array is always of the very 
> specific {{org.apache.avro.generic.GenericData$Array}} which is not always 
> the case.
> Making it a normal list works.
> A github PR is coming up to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20732) Copy cache data when node is being shut down

2020-05-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102875#comment-17102875
 ] 

Apache Spark commented on SPARK-20732:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28482

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2020-05-08 Thread Afroz Baig (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102837#comment-17102837
 ] 

Afroz Baig edited comment on SPARK-29037 at 5/8/20, 7:11 PM:
-

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 

Does this really stop duplication of data getting committed to the final 
location?

See the below scenario,
Basically, I hit into 2 issues. One that was mentioned in the 
https://issues.apache.org/jira/browse/SPARK-18883
and other that is being discussed here in this scenario. 

 There was an issue with this one of the spark jobs and it failed in the first 
attempt with file not found exception but succeeded in the second attempt. 
 The problem it caused was, the first attempt wrote the data to final location 
and failed.

The second attempt also re-wrote the same data again and succeeded completely. 
 This caused business impact due to the duplication of data. The mode of 
writing was "append".

First attempt of the Job failed with file not found exception : 2020-05-03 
21:32:22,237 [Thread-10] ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 
null. java.io.FileNotFoundException: File 
hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_24/calendar_date=2020-05-03
 does not exist.

Do you think setting up this conf in spark submit command will help in avoiding 
the duplication?


was (Author: afrozbaig):
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Does this really stop duplication of data getting committed to the final 
location?

See the below scenario,
There was an issue with one of the spark jobs and it failed in the first 
attempt with file not found exception but succeeded in the second attempt. 
The problem it caused was, the first attempt wrote the data to final location 
and failed.

The second attempt also re-wrote the same data again and succeeded completely. 
This caused business impact due to the duplication of data. The mode of writing 
was "append". 

First attempt of the Job failed with file not found exception : 2020-05-03 
21:32:22,237 [Thread-10] ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 
null. java.io.FileNotFoundException: File 
hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_24/calendar_date=2020-05-03
 does not exist. 

Do you think setting up this conf in spark submit command will help in avoiding 
the duplication?

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> For InsertIntoHadoopFsRelation operations.
> Case A:
> Application appA insert overwrite table table_a with static partition 
> overwrite.
> But it was killed when committing tasks, because one task is hang.
> And parts of its committed tasks output is kept under 
> /path/table_a/_temporary/0/.
> Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
> It executes successfully.
> But it also commit the data reminded by killed application to destination dir.
> Case B:
> Application appA insert overwrite table table_a.
> Application appB insert overwrite table table_a, too.
> They execute concurrently, and they may all use /path/table_a/_temporary/0/ 
> as workPath.
> And their result may be corruptted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31663:
--
Description: 
Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:
{code:java}
select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10{code}
The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be
{code:java}
+---+
|  b|
+---+
|  2|
|  2|
+---+{code}
instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.

Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
2017.

 

For Apache Spark 2.0.2 ~ 2.3.4, the following query is tested.
{code:java}
spark-sql> select sum(a) as b from t group by b grouping sets(b) having b > 10;
Time taken: 0.194 seconds

hive> select sum(a) as b from t group by b grouping sets(b) having b > 10;
2
Time taken: 1.605 seconds, Fetched: 1 row(s) {code}

  was:
Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:
{code:java}
select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10{code}
The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be
{code:java}
+---+
|  b|
+---+
|  2|
|  2|
+---+{code}
instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.

Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
2017.


> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.
>  
> For Apache Spark 2.0.2 ~ 2.3.4, the following query is tested.
> {code:java}
> spark-sql> select sum(a) as b from t group by b grouping sets(b) having b > 
> 10;
> Time taken: 0.194 seconds
> hive> select sum(a) as b from t group by b grouping sets(b) having b > 10;
> 2
> Time taken: 1.605 seconds, Fetched: 1 row(s) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102850#comment-17102850
 ] 

Dongjoon Hyun commented on SPARK-31663:
---

All 2.x versions are added into the affected versions. (cc [~holden])

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31665) Test parquet dictionary encoding of random dates/timestamps

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31665:


Assignee: (was: Apache Spark)

> Test parquet dictionary encoding of random dates/timestamps
> ---
>
> Key: SPARK-31665
> URL: https://issues.apache.org/jira/browse/SPARK-31665
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, dictionary encoding is not tested in ParquetHadoopFsRelationSuite 
> test "test all data types" because dates and timestamps are uniformly 
> distributed, and dictionary encoding is not applied for the types in fact. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31663:
--
Affects Version/s: 2.0.2
   2.1.3

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31665) Test parquet dictionary encoding of random dates/timestamps

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31665:


Assignee: Apache Spark

> Test parquet dictionary encoding of random dates/timestamps
> ---
>
> Key: SPARK-31665
> URL: https://issues.apache.org/jira/browse/SPARK-31665
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, dictionary encoding is not tested in ParquetHadoopFsRelationSuite 
> test "test all data types" because dates and timestamps are uniformly 
> distributed, and dictionary encoding is not applied for the types in fact. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31665) Test parquet dictionary encoding of random dates/timestamps

2020-05-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102848#comment-17102848
 ] 

Apache Spark commented on SPARK-31665:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28481

> Test parquet dictionary encoding of random dates/timestamps
> ---
>
> Key: SPARK-31665
> URL: https://issues.apache.org/jira/browse/SPARK-31665
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, dictionary encoding is not tested in ParquetHadoopFsRelationSuite 
> test "test all data types" because dates and timestamps are uniformly 
> distributed, and dictionary encoding is not applied for the types in fact. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31663:
--
Affects Version/s: (was: 2.4.4)
   (was: 2.4.3)
   (was: 2.4.2)
   (was: 2.4.1)
   (was: 2.4.0)
   2.2.3
   2.3.4

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102844#comment-17102844
 ] 

Dongjoon Hyun commented on SPARK-31663:
---

Apache Spark 2.3.4 follows Hive syntaxes, but the result is also wrong while 
Hive is correct.
{code:java}
spark-sql> select sum(a) as b from t group by b grouping sets(b) having b > 10;
Time taken: 0.194 seconds

hive> select sum(a) as b from t group by b grouping sets(b) having b > 10;
2
Time taken: 1.605 seconds, Fetched: 1 row(s){code}

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102841#comment-17102841
 ] 

Dongjoon Hyun commented on SPARK-31663:
---

Also, with a changed syntax, this is reproduced in older versions like 2.3.x, 
too.

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2020-05-08 Thread Afroz Baig (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102837#comment-17102837
 ] 

Afroz Baig commented on SPARK-29037:


spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Does this really stop duplication of data getting committed to the final 
location?

See the below scenario,
There was an issue with one of the spark jobs and it failed in the first 
attempt with file not found exception but succeeded in the second attempt. 
The problem it caused was, the first attempt wrote the data to final location 
and failed.

The second attempt also re-wrote the same data again and succeeded completely. 
This caused business impact due to the duplication of data. The mode of writing 
was "append". 

First attempt of the Job failed with file not found exception : 2020-05-03 
21:32:22,237 [Thread-10] ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 
null. java.io.FileNotFoundException: File 
hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_24/calendar_date=2020-05-03
 does not exist. 

Do you think setting up this conf in spark submit command will help in avoiding 
the duplication?

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> For InsertIntoHadoopFsRelation operations.
> Case A:
> Application appA insert overwrite table table_a with static partition 
> overwrite.
> But it was killed when committing tasks, because one task is hang.
> And parts of its committed tasks output is kept under 
> /path/table_a/_temporary/0/.
> Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
> It executes successfully.
> But it also commit the data reminded by killed application to destination dir.
> Case B:
> Application appA insert overwrite table table_a.
> Application appB insert overwrite table table_a, too.
> They execute concurrently, and they may all use /path/table_a/_temporary/0/ 
> as workPath.
> And their result may be corruptted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31665) Test parquet dictionary encoding of random dates/timestamps

2020-05-08 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31665:
--

 Summary: Test parquet dictionary encoding of random 
dates/timestamps
 Key: SPARK-31665
 URL: https://issues.apache.org/jira/browse/SPARK-31665
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, dictionary encoding is not tested in ParquetHadoopFsRelationSuite 
test "test all data types" because dates and timestamps are uniformly 
distributed, and dictionary encoding is not applied for the types in fact. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102829#comment-17102829
 ] 

Dongjoon Hyun commented on SPARK-31663:
---

The issue is that `b` is interpreted differently in `group by b` and `group by 
grouping sets(b)`.
{code:java}
spark-sql> select sum(a) as b from t group by grouping sets(b) having b > 10;
Time taken: 0.231 seconds
spark-sql> select sum(a) as b from t group by b having b > 10;
2
Time taken: 0.174 seconds, Fetched 1 row(s) {code}

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102825#comment-17102825
 ] 

Dongjoon Hyun commented on SPARK-31663:
---

I confirmed that this is a correctness issue since 2.4.0.

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31663:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31663:
--
Labels: correctness  (was: )

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-08 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-31658.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> SQL UI doesn't show write commands of AQE plan
> --
>
> Key: SPARK-31658
> URL: https://issues.apache.org/jira/browse/SPARK-31658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-08 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-31658:
---

Assignee: Manu Zhang

> SQL UI doesn't show write commands of AQE plan
> --
>
> Key: SPARK-31658
> URL: https://issues.apache.org/jira/browse/SPARK-31658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31664) Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler"

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31664:


Assignee: Apache Spark

> Race in YARN scheduler shutdown leads to uncaught SparkException "Could not 
> find CoarseGrainedScheduler"
> 
>
> Key: SPARK-31664
> URL: https://issues.apache.org/jira/browse/SPARK-31664
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> I used this command to run SparkPi on a yarn cluster with dynamicAllocation 
> enabled: "$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster 
> --class org.apache.spark.examples.SparkPi ./spark-examples.jar 1000" and 
> received error log below every time.
>  
> {code:java}
> 20/05/06 16:31:44 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:169)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
>   at 
> org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   at

[jira] [Assigned] (SPARK-31664) Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler"

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31664:


Assignee: (was: Apache Spark)

> Race in YARN scheduler shutdown leads to uncaught SparkException "Could not 
> find CoarseGrainedScheduler"
> 
>
> Key: SPARK-31664
> URL: https://issues.apache.org/jira/browse/SPARK-31664
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Priority: Minor
>
> I used this command to run SparkPi on a yarn cluster with dynamicAllocation 
> enabled: "$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster 
> --class org.apache.spark.examples.SparkPi ./spark-examples.jar 1000" and 
> received error log below every time.
>  
> {code:java}
> 20/05/06 16:31:44 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:169)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
>   at 
> org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   at 
> io.netty.util.internal

[jira] [Commented] (SPARK-31664) Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler"

2020-05-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102757#comment-17102757
 ] 

Apache Spark commented on SPARK-31664:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28480

> Race in YARN scheduler shutdown leads to uncaught SparkException "Could not 
> find CoarseGrainedScheduler"
> 
>
> Key: SPARK-31664
> URL: https://issues.apache.org/jira/browse/SPARK-31664
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Priority: Minor
>
> I used this command to run SparkPi on a yarn cluster with dynamicAllocation 
> enabled: "$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster 
> --class org.apache.spark.examples.SparkPi ./spark-examples.jar 1000" and 
> received error log below every time.
>  
> {code:java}
> 20/05/06 16:31:44 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:169)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
>   at 
> org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
>   at 
> io.netty.ut

[jira] [Commented] (SPARK-31640) Support SHOW PARTITIONS for DataSource V2 tables

2020-05-08 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102746#comment-17102746
 ] 

Burak Yavuz commented on SPARK-31640:
-

Hi [~younggyuchun],

 

I'd take a look at how SHOW PARTITIONS works today:

 - 
[https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L1023]

which returns a list of string paths.

It would be great that with [DataSourceV2 
tables|[https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java#L43]],
 we can return the list of partitions 
([https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java#L60])
 where each Transform is a separate column. 

> Support SHOW PARTITIONS for DataSource V2 tables
> 
>
> Key: SPARK-31640
> URL: https://issues.apache.org/jira/browse/SPARK-31640
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> SHOW PARTITIONS is supported for V1 Hive tables. We can also support it for 
> V2 tables, where they return the transforms and the values of those 
> transforms as separate columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31640) Support SHOW PARTITIONS for DataSource V2 tables

2020-05-08 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102738#comment-17102738
 ] 

YoungGyu Chun commented on SPARK-31640:
---

Hi [~brkyvz],

I am trying to sort out but I have a question about Hive. Can you explain what 
you mean by "they return the transforms and the values of those transforms as 
separate columns"?

I am looking at Hive documentation but I can't find a difference between V1 and 
V2 hive table. 

> Support SHOW PARTITIONS for DataSource V2 tables
> 
>
> Key: SPARK-31640
> URL: https://issues.apache.org/jira/browse/SPARK-31640
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> SHOW PARTITIONS is supported for V1 Hive tables. We can also support it for 
> V2 tables, where they return the transforms and the values of those 
> transforms as separate columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31663:

Description: 
Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:
{code:java}
select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10{code}
The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be
{code:java}
+---+
|  b|
+---+
|  2|
|  2|
+---+{code}
instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.

Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
2017.

  was:
Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:
{code:java}
select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10{code}
The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be
{code:java}
+---+
|  b|
+---+
|  2|
|  2|
+---+{code}
instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.


> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.
> Other systems worked as expected, I checked PostgreSQL 9.6 and MS SQL Server 
> 2017.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31663:

Description: 
Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:
{code:java}
select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10{code}
The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be
{code:java}
+---+
|  b|
+---+
|  2|
|  2|
+---+{code}
instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.

  was:
Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:

{quote}

select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10

{quote}

The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be

{quote}

+---+
| b|
+---+
| 2|
| 2|
+---+

{quote}

instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.


> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {code:java}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10{code}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {code:java}
> +---+
> |  b|
> +---+
> |  2|
> |  2|
> +---+{code}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31663:

Labels:   (was: correct)

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {quote}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10
> {quote}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {quote}
> +---+
> | b|
> +---+
> | 2|
> | 2|
> +---+
> {quote}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31654) sequence producing inconsistent intervals for month step

2020-05-08 Thread Ramesh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102712#comment-17102712
 ] 

Ramesh commented on SPARK-31654:


[~roman_y]  ,  [~Ankitraj] 

It is working as expected ..

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), 
interval 1 month)").rdd.collect()


spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), 
interval 1 month)").rdd.collect()
[Row(sequence(to_date('2018-01-01'), to_date('2019-01-01'), INTERVAL '1 
months')=[datetime.date(2018, 1, 1), datetime.date(2018, 2, 1), 
datetime.date(2018, 3, 1), datetime.date(2018, 4, 1), datetime.date(2018, 5, 
1), datetime.date(2018, 6, 1), datetime.date(2018, 7, 1), datetime.date(2018, 
8, 1), datetime.date(2018, 9, 1), datetime.date(2018, 10, 1), 
datetime.date(2018, 11, 1), datetime.date(2018, 12, 1), datetime.date(2019, 1, 
1)])]

 

 

> sequence producing inconsistent intervals for month step
> 
>
> Key: SPARK-31654
> URL: https://issues.apache.org/jira/browse/SPARK-31654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Roman Yalki
>Priority: Major
>
> Taking an example from [https://spark.apache.org/docs/latest/api/sql/]
> {code:java}
> > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 
> > month);{code}
> [2018-01-01,2018-02-01,2018-03-01]
> if one is to expand `stop` till the end of the year some intervals are 
> returned as the last day of the month whereas first day of the month is 
> expected
> {code:java}
> > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 
> > month){code}
> [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 
> 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 
> 2019-01-01]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-31663:
---

 Summary: Grouping sets with having clause returns the wrong result
 Key: SPARK-31663
 URL: https://issues.apache.org/jira/browse/SPARK-31663
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Yuanjian Li


Grouping sets with having clause returns the wrong result when the condition of 
having contained conflicting naming. See the below example:

{quote}

select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
SETS ((b), (a, b)) having b > 10

{quote}

The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
right result should be

{quote}

+---+
| b|
+---+
| 2|
| 2|
+---+

{quote}

instead of an empty result.

The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
Filter(..., Agg(...)) and resolved these two parts in different rules. The CUBE 
and ROLLUP have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31663) Grouping sets with having clause returns the wrong result

2020-05-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31663:

Labels: correct  (was: )

> Grouping sets with having clause returns the wrong result
> -
>
> Key: SPARK-31663
> URL: https://issues.apache.org/jira/browse/SPARK-31663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correct
>
> Grouping sets with having clause returns the wrong result when the condition 
> of having contained conflicting naming. See the below example:
> {quote}
> select sum(a) as b FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING 
> SETS ((b), (a, b)) having b > 10
> {quote}
> The `b` in `having b > 10` should be resolved as `T.b` not `sum(a)`, so the 
> right result should be
> {quote}
> +---+
> | b|
> +---+
> | 2|
> | 2|
> +---+
> {quote}
> instead of an empty result.
> The root cause is similar to SPARK-31519, it's caused by we parsed HAVING as 
> Filter(..., Agg(...)) and resolved these two parts in different rules. The 
> CUBE and ROLLUP have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31664) Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler"

2020-05-08 Thread Baohe Zhang (Jira)

Baohe Zhang created SPARK-31664:
---

 Summary: Race in YARN scheduler shutdown leads to uncaught 
SparkException "Could not find CoarseGrainedScheduler"
 Key: SPARK-31664
 URL: https://issues.apache.org/jira/browse/SPARK-31664
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Baohe Zhang


I used this command to run SparkPi on a yarn cluster with dynamicAllocation 
enabled: "$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster 
--class org.apache.spark.examples.SparkPi ./spark-examples.jar 1000" and 
received error log below every time.

 
{code:java}
20/05/06 16:31:44 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:169)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
at 
org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
20/05/06 16:31:45 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
20/05/06 16:31:45 INFO MemoryStore: MemoryStore cleared
20/05/06 16:31:45 INFO BlockManager

[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement

2020-05-08 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102635#comment-17102635
 ] 

Yuming Wang commented on SPARK-31470:
-

[~rakson] I'm not sure if it can be 100% accepted by community. You can work on 
this if you like.

> Introduce SORTED BY clause in CREATE TABLE statement
> 
>
> Key: SPARK-31470
> URL: https://issues.apache.org/jira/browse/SPARK-31470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We usually sort on frequently filtered columns when writing data to improve 
> query performance. But there is no these info in the table information.
>  
> {code:sql}
> CREATE TABLE t(day INT, hour INT, year INT, month INT)
> USING parquet
> PARTITIONED BY (year, month)
> SORTED BY (day, hour);
> {code}
>  
> Impala, Oracle and redshift support this clause:
> https://issues.apache.org/jira/browse/IMPALA-4166
> https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B
> https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31662:
---
Description: 
Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

  Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+--+
|date  |
+--+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+--+
{code}

*Expected values must be 1000-01-01.*

I checked that the date column is encoded by dictionary via:
{code}
➜  parquet-date-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump 
./part-0-84a77214-0c8c-45e9-ac41-5ca863b9dd94-c000.snappy.parquet
row group 0

date:  INT32 SNAPPY DO:0 FPO:4 SZ:74/70/0.95 VC:8 ENC:BIT_PACKED,RLE,P [more]...
date TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY

page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY 
[more]... VC:8
INT32 date

*** row group 1 of 1, values 1 to 8 ***
value 1: R:0 D:1 V:1001-01-07
value 2: R:0 D:1 V:1001-01-07
value 3: R:0 D:1 V:1001-01-07
value 4: R:0 D:1 V:1001-01-07
value 5: R:0 D:1 V:1001-01-07
value 6: R:0 D:1 V:1001-01-07
value 7: R:0 D:1 V:1001-01-07
value 8: R:0 D:1 V:1001-01-07
{code}

  was:
Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

  Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+--+
|date  |
+--+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+--+
{code}

*Expected values must be 1000-01-01.*


> Reading wrong dates from dictionary encoded columns in Parquet files
> 
>
> Key: SPARK-31662
> URL: https://issues.apache.org/jira/browse/SPARK-31662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates with dictionary encoding enabled to parquet files:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> 
> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>   Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
> .select($"dateS".cast("date").as("date"))
> .repartition(1)
> .write
> .option("parquet.enable.dictionary", true)
> .mode("overwrite")
> .parquet("/Users/maximgekk/tmp/parquet-date-dict")
> // Exiting paste mode, now interpreting.
> {code}
> Read them back:
>

[jira] [Commented] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102525#comment-17102525
 ] 

Apache Spark commented on SPARK-31662:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28479

> Reading wrong dates from dictionary encoded columns in Parquet files
> 
>
> Key: SPARK-31662
> URL: https://issues.apache.org/jira/browse/SPARK-31662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates with dictionary encoding enabled to parquet files:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> 
> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>   Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
> .select($"dateS".cast("date").as("date"))
> .repartition(1)
> .write
> .option("parquet.enable.dictionary", true)
> .mode("overwrite")
> .parquet("/Users/maximgekk/tmp/parquet-date-dict")
> // Exiting paste mode, now interpreting.
> {code}
> Read them back:
> {code:scala}
> scala> 
> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> +--+
> {code}
> *Expected values must be 1000-01-01.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31662:


Assignee: (was: Apache Spark)

> Reading wrong dates from dictionary encoded columns in Parquet files
> 
>
> Key: SPARK-31662
> URL: https://issues.apache.org/jira/browse/SPARK-31662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates with dictionary encoding enabled to parquet files:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> 
> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>   Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
> .select($"dateS".cast("date").as("date"))
> .repartition(1)
> .write
> .option("parquet.enable.dictionary", true)
> .mode("overwrite")
> .parquet("/Users/maximgekk/tmp/parquet-date-dict")
> // Exiting paste mode, now interpreting.
> {code}
> Read them back:
> {code:scala}
> scala> 
> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> +--+
> {code}
> *Expected values must be 1000-01-01.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31662:


Assignee: Apache Spark

> Reading wrong dates from dictionary encoded columns in Parquet files
> 
>
> Key: SPARK-31662
> URL: https://issues.apache.org/jira/browse/SPARK-31662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Write dates with dictionary encoding enabled to parquet files:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> 
> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>   Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
> .select($"dateS".cast("date").as("date"))
> .repartition(1)
> .write
> .option("parquet.enable.dictionary", true)
> .mode("overwrite")
> .parquet("/Users/maximgekk/tmp/parquet-date-dict")
> // Exiting paste mode, now interpreting.
> {code}
> Read them back:
> {code:scala}
> scala> 
> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> |1001-01-07|
> +--+
> {code}
> *Expected values must be 1000-01-01.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31662:
--

 Summary: Reading wrong dates from dictionary encoded columns in 
Parquet files
 Key: SPARK-31662
 URL: https://issues.apache.org/jira/browse/SPARK-31662
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

  Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+--+
|date  |
+--+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+--+
{code}

*Expected values must be 1000-01-01.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31622) Test-jar in the Spark distribution

2020-05-08 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102466#comment-17102466
 ] 

Ankit Raj Boudh commented on SPARK-31622:
-

[~hyukjin.kwon], please confirm this then i will raise PR for this.

> Test-jar in the Spark distribution
> --
>
> Key: SPARK-31622
> URL: https://issues.apache.org/jira/browse/SPARK-31622
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> The jar with classifier *tests* is delivered in the Spark distribution:
> {code:java}
> ls -1 spark-3.0.0-preview2-bin-hadoop2.7/jars/ | grep tests
> spark-tags_2.12-3.0.0-preview2-tests.jar
> {code}
> Normally, test-jars should not be used for production.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24193) Sort by disk when number of limit is big in TakeOrderedAndProjectExec

2020-05-08 Thread Xianjin YE (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102458#comment-17102458
 ] 

Xianjin YE commented on SPARK-24193:


I used `df.rdd.collect` intentionally to trigger the problem as `df.collect` is 
converted to `SparkPlan.executeTake` which is getting data correctly.

 

The problem can also be triggered with a slightly different version:
{code:java}
val spark = SparkSession
  .builder
  .appName("Spark TopK test")
  .master("local-cluster[8, 1, 1024]")
  .getOrCreate()
val temp1 = Utils.createTempDir()
val data = spark.range(10, 0, -1, 10).toDF("id").selectExpr("id + 1 as 
id")
spark.conf.set(SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD.key, 100)

data.orderBy("id").limit(200).write.mode("overwrite").parquet(temp1.toString)
val topKInSort = spark.read.parquet(temp1.toString).collect()
spark.conf.set(SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD.key, Int.MaxValue)

data.orderBy("id").limit(200).write.mode("overwrite").parquet(temp1.toString)
val topKInMemory = spark.read.parquet(temp1.toString).collect()
println(topKInMemory.map(_.getLong(0)).mkString("[", ",", "]"))
println(topKInSort.map(_.getLong(0)).mkString("[", ",", "]"))
assert(topKInMemory sameElements topKInSort)

{code}
The real problem is that if I am going to accessing the ordered and limited 
data such as joining or writing to external table, the data is incorrect when 
falling back into CollectLimitExec.

> Sort by disk when number of limit is big in TakeOrderedAndProjectExec
> -
>
> Key: SPARK-24193
> URL: https://issues.apache.org/jira/browse/SPARK-24193
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jin Xing
>Assignee: Jin Xing
>Priority: Major
> Fix For: 2.4.0
>
>
> Physical plan of  "_select colA from t order by colB limit M_" is 
> _TakeOrderedAndProject_;
> Currently _TakeOrderedAndProject_ sorts data in memory, see 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158
>  
> Shall we add a config -- if the number of limit (M) is too big, we can sort 
> by disk ? Thus memory issue can be resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24193) Sort by disk when number of limit is big in TakeOrderedAndProjectExec

2020-05-08 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102439#comment-17102439
 ] 

Wenchen Fan commented on SPARK-24193:
-

I think it's not a problem if you do `df.collect` instead of `df.rdd.collect`.

LIMIT only preserves the data order if it's the last operation. When you do 
`df.rdd`, it means you are going to add more operations.

> Sort by disk when number of limit is big in TakeOrderedAndProjectExec
> -
>
> Key: SPARK-24193
> URL: https://issues.apache.org/jira/browse/SPARK-24193
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jin Xing
>Assignee: Jin Xing
>Priority: Major
> Fix For: 2.4.0
>
>
> Physical plan of  "_select colA from t order by colB limit M_" is 
> _TakeOrderedAndProject_;
> Currently _TakeOrderedAndProject_ sorts data in memory, see 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158
>  
> Shall we add a config -- if the number of limit (M) is too big, we can sort 
> by disk ? Thus memory issue can be resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31104) Add documentation for all new Json Functions

2020-05-08 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102438#comment-17102438
 ] 

Rakesh Raushan commented on SPARK-31104:


[~hyukjin.kwon] We can mark this as resolved as this task has already been 
completed.

> Add documentation for all new Json Functions
> 
>
> Key: SPARK-31104
> URL: https://issues.apache.org/jira/browse/SPARK-31104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement

2020-05-08 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102437#comment-17102437
 ] 

Rakesh Raushan commented on SPARK-31470:


If this is required by community and [~yumwang] has not started working, I can 
work on this.

[~yumwang] What to do you say?

> Introduce SORTED BY clause in CREATE TABLE statement
> 
>
> Key: SPARK-31470
> URL: https://issues.apache.org/jira/browse/SPARK-31470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We usually sort on frequently filtered columns when writing data to improve 
> query performance. But there is no these info in the table information.
>  
> {code:sql}
> CREATE TABLE t(day INT, hour INT, year INT, month INT)
> USING parquet
> PARTITIONED BY (year, month)
> SORTED BY (day, hour);
> {code}
>  
> Impala, Oracle and redshift support this clause:
> https://issues.apache.org/jira/browse/IMPALA-4166
> https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B
> https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31654) sequence producing inconsistent intervals for month step

2020-05-08 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102421#comment-17102421
 ] 

Ankit Raj Boudh commented on SPARK-31654:
-

[~roman_y], I will raise pr for this.

> sequence producing inconsistent intervals for month step
> 
>
> Key: SPARK-31654
> URL: https://issues.apache.org/jira/browse/SPARK-31654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Roman Yalki
>Priority: Major
>
> Taking an example from [https://spark.apache.org/docs/latest/api/sql/]
> {code:java}
> > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 
> > month);{code}
> [2018-01-01,2018-02-01,2018-03-01]
> if one is to expand `stop` till the end of the year some intervals are 
> returned as the last day of the month whereas first day of the month is 
> expected
> {code:java}
> > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 
> > month){code}
> [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 
> 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 
> 2019-01-01]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31657) CSV Writer writes no header for empty DataFrames

2020-05-08 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102420#comment-17102420
 ] 

Ankit Raj Boudh commented on SPARK-31657:
-

[~fpin], I will raise PR for this

> CSV Writer writes no header for empty DataFrames
> 
>
> Key: SPARK-31657
> URL: https://issues.apache.org/jira/browse/SPARK-31657
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.1
> Environment: Local pyspark 2.41
>Reporter: Furcy Pin
>Priority: Minor
>
> When writing a DataFrame as csv with the Header option set to true,
> the header is not written when the DataFrame is empty.
> This creates failures for processes that read the csv back.
> Example (please notice the limit(0) in the second example):
> ```
>  
> {code:java}
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.4.1
>  /_/
> Using Python version 2.7.17 (default, Nov 7 2019 10:07:09)
> SparkSession available as 'spark'.
> >>> df1 = spark.sql("SELECT 1 as a")
> >>> df1.limit(1).write.mode("OVERWRITE").option("Header", 
> >>> True).csv("data/test/csv")
> >>> spark.read.option("Header", True).csv("data/test/csv").show()
> +---+
> | a|
> +---+
> | 1|
> +---+
> >>> 
> >>> df1.limit(0).write.mode("OVERWRITE").option("Header", 
> >>> True).csv("data/test/csv")
> >>> spark.read.option("Header", True).csv("data/test/csv").show()
> ++
> ||
> ++
> ++
> {code}
>  
> Expected behavior:
> {code:java}
> >>> df1.limit(0).write.mode("OVERWRITE").option("Header", 
> >>> True).csv("data/test/csv")
> >>> spark.read.option("Header", True).csv("data/test/csv").show()
> +---+
> | a|
> +---+
> +---+{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31588) merge small files may need more common setting

2020-05-08 Thread philipse (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102414#comment-17102414
 ] 

philipse commented on SPARK-31588:
--

yes, the block size can be controlled in HDFS.i mean we just take the block 
size as one the the condition. if we can control the target size in SPARK, we 
can control the real data in HDFS,instand using repartition control the hard 
limit.

> merge small files may need more common setting
> --
>
> Key: SPARK-31588
> URL: https://issues.apache.org/jira/browse/SPARK-31588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: spark:2.4.5
> hdp:2.7
>Reporter: philipse
>Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the 
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to 
> use COALESCE or REPARTITION,can we try a more common way to reduce the 
> decision by set the target size  as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be 
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out 
> files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30385) WebUI occasionally throw IOException on stop()

2020-05-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30385.
-
Fix Version/s: 3.1.0
 Assignee: Kousuke Saruta
   Resolution: Fixed

> WebUI occasionally throw IOException on stop()
> --
>
> Key: SPARK-30385
> URL: https://issues.apache.org/jira/browse/SPARK-30385
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
> Environment: MacOS 10.14.6
> Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231
> Scala version 2.12.10
>Reporter: wuyi
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> While using ./bin/spark-shell, recently, I occasionally see IOException when 
> I try to quit:
> {code:java}
> 19/12/30 17:33:21 WARN AbstractConnector:
> java.io.IOException: No such file or directory
>  at sun.nio.ch.NativeThread.signal(Native Method)
>  at 
> sun.nio.ch.ServerSocketChannelImpl.implCloseSelectableChannel(ServerSocketChannelImpl.java:292)
>  at 
> java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
>  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
>  at org.eclipse.jetty.server.ServerConnector.close(ServerConnector.java:368)
>  at 
> org.eclipse.jetty.server.AbstractNetworkConnector.shutdown(AbstractNetworkConnector.java:105)
>  at org.eclipse.jetty.server.Server.doStop(Server.java:439)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
>  
>  at org.apache.spark.ui.ServerInfo.stop(JettyUtils.scala:499)
>  at org.apache.spark.ui.WebUI.$anonfun$stop$2(WebUI.scala:173)
>  at org.apache.spark.ui.WebUI.$anonfun$stop$2$adapted(WebUI.scala:173)
>  at scala.Option.foreach(Option.scala:407)
>  at org.apache.spark.ui.WebUI.stop(WebUI.scala:173)
>  at org.apache.spark.ui.SparkUI.stop(SparkUI.scala:101)
>  at org.apache.spark.SparkContext.$anonfun$stop$6(SparkContext.scala:1972)
>  at 
> org.apache.spark.SparkContext.$anonfun$stop$6$adapted(SparkContext.scala:1972)
>  at scala.Option.foreach(Option.scala:407)
>  at org.apache.spark.SparkContext.$anonfun$stop$5(SparkContext.scala:1972)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
>  at org.apache.spark.SparkContext.stop(SparkContext.scala:1972)
>  at org.apache.spark.repl.Main$.$anonfun$doMain$3(Main.scala:79)
>  at org.apache.spark.repl.Main$.$anonfun$doMain$3$adapted(Main.scala:79)
>  at scala.Option.foreach(Option.scala:407)
>  at org.apache.spark.repl.Main$.doMain(Main.scala:79)
>  at org.apache.spark.repl.Main$.main(Main.scala:58)
>  at org.apache.spark.repl.Main.main(Main.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>  at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) 
>  at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> I don't find a way to reproduce it stably, but it will increase possibility 
> if you stay in spark-shell for not a short time.  
> A possible way to reproduce this is: start ./bin/spark-shell , wait for 5 
> min, then use :q or :quit to quit.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2020-05-08 Thread linna shuang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102336#comment-17102336
 ] 

linna shuang commented on SPARK-16951:
--

In TPC-H test, we met performance issue of Q16, which used NOT IN subquery and 
being translated into broadcast nested loop join. This query uses almost half 
time of total 22 queries. For example, 512GB data set, totally execution time 
is 1400 seconds, while Q16’s execution time is 630 seconds.

TPC-H is a common spark sql performance benchmark, this performance issue will 
be met usually. Do you have plan to reopen and fix this issue?

> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Priority: Major
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

67 matches

Mail list logo