[jira] [Updated] (SPARK-25067) Active tasks does not match the total cores of an executor in WebUI

2018-08-10 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25067:
-
Attachment: WX20180810-144212.png

> Active tasks does not match the total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: WX20180810-144212.png, WechatIMG1.jpeg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25067) Active tasks does not match the total cores of an executor in WebUI

2018-08-10 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25067:
-
Summary: Active tasks does not match the total cores of an executor in 
WebUI  (was: Active tasks exceed total cores of an executor in WebUI)

> Active tasks does not match the total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: WechatIMG1.jpeg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI

2018-08-09 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25067:
-
Attachment: (was: 1533128203469_2.png)

> Active tasks exceed total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: WechatIMG1.jpeg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI

2018-08-09 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25067:
-
Attachment: WechatIMG1.jpeg

> Active tasks exceed total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: 1533128203469_2.png, WechatIMG1.jpeg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI

2018-08-09 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25067:
-
Attachment: 1533128203469_2.png

> Active tasks exceed total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: 1533128203469_2.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI

2018-08-09 Thread StanZhai (JIRA)
StanZhai created SPARK-25067:


 Summary: Active tasks exceed total cores of an executor in WebUI
 Key: SPARK-25067
 URL: https://issues.apache.org/jira/browse/SPARK-25067
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1, 2.3.0, 2.2.2
Reporter: StanZhai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25064) Total Tasks in WebUI does not match Active+Failed+Complete Tasks

2018-08-09 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25064:
-
Attachment: 1533128402933_3.png

> Total Tasks in WebUI does not match Active+Failed+Complete Tasks
> 
>
> Key: SPARK-25064
> URL: https://issues.apache.org/jira/browse/SPARK-25064
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Minor
> Attachments: 1533128402933_3.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25064) Total Tasks in WebUI does not match Active+Failed+Complete Tasks

2018-08-09 Thread StanZhai (JIRA)
StanZhai created SPARK-25064:


 Summary: Total Tasks in WebUI does not match 
Active+Failed+Complete Tasks
 Key: SPARK-25064
 URL: https://issues.apache.org/jira/browse/SPARK-25064
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1, 2.3.0, 2.2.2
Reporter: StanZhai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24704) The order of stages in the DAG graph is incorrect

2018-06-30 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-24704:
-
Attachment: WX20180630-161907.png

> The order of stages in the DAG graph is incorrect
> -
>
> Key: SPARK-24704
> URL: https://issues.apache.org/jira/browse/SPARK-24704
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Minor
>  Labels: regression
> Attachments: WX20180630-161907.png
>
>
> The regression is introduced by Spark 2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24704) The order of stages in the DAG graph is incorrect

2018-06-30 Thread StanZhai (JIRA)
StanZhai created SPARK-24704:


 Summary: The order of stages in the DAG graph is incorrect
 Key: SPARK-24704
 URL: https://issues.apache.org/jira/browse/SPARK-24704
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1, 2.3.0
Reporter: StanZhai


The regression is introduced by Spark 2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode

2018-06-28 Thread StanZhai (JIRA)
StanZhai created SPARK-24680:


 Summary: spark.executorEnv.JAVA_HOME does not take effect in 
Standalone mode
 Key: SPARK-24680
 URL: https://issues.apache.org/jira/browse/SPARK-24680
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.1, 2.2.1, 2.1.1
Reporter: StanZhai


spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an 
Executor process in Standalone mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-22084:
-
Labels: performance  (was: )

> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>  Labels: performance
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-22084:
-
Description: 
{code:sql}
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
{code}

Two exactly the same SUM(b) expressions in the SQL, and the following is the 
physical plan in Spark 2.x.

{code}
== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 
as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), 
partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]
{code}

functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))]

  was:
{code:sql}
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
{code}

Two exactly the same SUM(b) expressions in the SQL, and the following is the 
physical plan in Spark 2.x.

== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 
as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), 
partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]

functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))]


> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread StanZhai (JIRA)
StanZhai created SPARK-22084:


 Summary: Performance regression in aggregation strategy
 Key: SPARK-22084
 URL: https://issues.apache.org/jira/browse/SPARK-22084
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: StanZhai


{code:sql}
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
{code}

Two exactly the same SUM(b) expressions in the SQL, and the following is the 
physical plan in Spark 2.x.

== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 
as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), 
partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]

functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-18 Thread StanZhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131921#comment-16131921
 ] 

StanZhai commented on SPARK-21774:
--

I've opened a PR (https://github.com/apache/spark/pull/18986) for this issue. 
This PR is not automatically associated with JIRA yet.

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
External issue URL:   (was: https://github.com/apache/spark/pull/18986)

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
External issue URL: https://github.com/apache/spark/pull/18986

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
External issue ID:   (was: SPARK-21646)

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
Description: 
Data
{code}
create temporary view tb as select * from values
("0", 1),
("-0.1", 2),
("1", 3)
as grouping(a, b)
{code}

SQL:
{code}
select a, b from tb where a=0
{code}

The result which is wrong:
{code}
++---+
|   a|  b|
++---+
|   0|  1|
|-0.1|  2|
++---+
{code}

Logical Plan:
{code}
== Parsed Logical Plan ==
'Project ['a]
+- 'Filter ('a = 0)
   +- 'UnresolvedRelation `src`

== Analyzed Logical Plan ==
a: string
Project [a#8528]
+- Filter (cast(a#8528 as int) = 0)
   +- SubqueryAlias src
  +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
 +- LocalRelation [_1#8525, _2#8526]
{code}

  was:
Data
{code}
create temporary view tb as select * from values
("0", 1),
("-0.1", 2),
("1", 3)
as grouping(a, b)
{code}

SQL:
{code}
select a, b from tb where a=0
{code}

The result which is wrong:
{code}
++---+
|   a|  b|
++---+
|   0|  1|
|-0.1|  2|
++---+
{code}


> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)
StanZhai created SPARK-21774:


 Summary: The rule PromoteStrings cast string to a wrong data type
 Key: SPARK-21774
 URL: https://issues.apache.org/jira/browse/SPARK-21774
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: StanZhai
Priority: Critical


Data
{code}
create temporary view tb as select * from values
("0", 1),
("-0.1", 2),
("1", 3)
as grouping(a, b)
{code}

SQL:
{code}
select a, b from tb where a=0
{code}

The result which is wrong:
{code}
++---+
|   a|  b|
++---+
|   0|  1|
|-0.1|  2|
++---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21758) `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*

2017-08-16 Thread StanZhai (JIRA)
StanZhai created SPARK-21758:


 Summary: `SHOW TBLPROPERTIES` can not get properties start with 
spark.sql.*
 Key: SPARK-21758
 URL: https://issues.apache.org/jira/browse/SPARK-21758
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0
Reporter: StanZhai
Priority: Critical


SQL: SHOW TBLPROPERTIES test_tb("spark.sql.sources.schema.numParts")
Exception: Table test_db.test.tb does not have property: 
spark.sql.sources.schema.numParts

The `spark.sql.sources.schema.numParts` property exactly exists in 
HiveMetastore.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.

2017-07-05 Thread StanZhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074840#comment-16074840
 ] 

StanZhai commented on SPARK-21318:
--

yes.
It has been registered into the `functionRegistry`, but it's not available for 
use.

> The exception message thrown by `lookupFunction` is ambiguous.
> --
>
> Key: SPARK-21318
> URL: https://issues.apache.org/jira/browse/SPARK-21318
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: StanZhai
>Priority: Minor
>
> The function actually exists in current selected database, but the exception 
> message is: 
> {code}
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.
> {code}
> My UDF has already been registered in the current database. But it's failed 
> to init during lookupFunction. 
> The exception message should be:
> {code}
> No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
> org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
> expected
> {code}
> This is not conducive to positioning problems.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.

2017-07-05 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21318:
-
Description: 
The function actually exists in current selected database, but the exception 
message is: 
{code}
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.
{code}

My UDF has already been registered in the current database. But it's failed to 
init during lookupFunction. 

The exception message should be:
{code}
No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
expected
{code}

This is not conducive to positioning problems.

  was:
The function actually exists, but the exception message is: 
{code}
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.
{code}

My UDF has already been registered in the current database. But it's failed to 
init during lookupFunction. 

The exception message should be:
{code}
No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
expected
{code}

This is not conducive to positioning problems.


> The exception message thrown by `lookupFunction` is ambiguous.
> --
>
> Key: SPARK-21318
> URL: https://issues.apache.org/jira/browse/SPARK-21318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: StanZhai
>Priority: Minor
>
> The function actually exists in current selected database, but the exception 
> message is: 
> {code}
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.
> {code}
> My UDF has already been registered in the current database. But it's failed 
> to init during lookupFunction. 
> The exception message should be:
> {code}
> No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
> org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
> expected
> {code}
> This is not conducive to positioning problems.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.

2017-07-05 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21318:
-
Description: 
The function actually exists, but the exception message is: 
{code}
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.
{code}

My UDF has already been registered in the current database. But it's failed to 
init during lookupFunction. 

The exception message should be:
{code}
No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
expected
{code}

This is not conducive to positioning problems.

  was:
The function actually exists, but the exception message is: 
{code}
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.
{code}

My UDF has already been registered in the current database. But it's failed to 
init during lookupFunction. 

The exception message should be:
{code}
No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
expected
{code}


> The exception message thrown by `lookupFunction` is ambiguous.
> --
>
> Key: SPARK-21318
> URL: https://issues.apache.org/jira/browse/SPARK-21318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: StanZhai
>Priority: Minor
>
> The function actually exists, but the exception message is: 
> {code}
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.
> {code}
> My UDF has already been registered in the current database. But it's failed 
> to init during lookupFunction. 
> The exception message should be:
> {code}
> No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
> org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
> expected
> {code}
> This is not conducive to positioning problems.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.

2017-07-05 Thread StanZhai (JIRA)
StanZhai created SPARK-21318:


 Summary: The exception message thrown by `lookupFunction` is 
ambiguous.
 Key: SPARK-21318
 URL: https://issues.apache.org/jira/browse/SPARK-21318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: StanZhai
Priority: Minor


The function actually exists, but the exception message is: 
{code}
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.
{code}

My UDF has already been registered in the current database. But it's failed to 
init during lookupFunction. 

The exception message should be:
{code}
No handler for Hive UDF 'site.stanzhai.UDAFXXX': 
org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is 
expected
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18683) REST APIs for standalone Master、Workers and Applications

2017-05-15 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-18683:
-
Summary: REST APIs for standalone Master、Workers and Applications  (was: 
REST APIs for standalone Master and Workers)

> REST APIs for standalone Master、Workers and Applications
> 
>
> Key: SPARK-18683
> URL: https://issues.apache.org/jira/browse/SPARK-18683
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> It would be great that we have some REST APIs to access Master、Workers and 
> Applications information. Right now the only way to get them is using the Web 
> UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18683) REST APIs for standalone Master and Workers

2017-05-15 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-18683:
-
Description: It would be great that we have some REST APIs to access 
Master、Workers and Applications information. Right now the only way to get them 
is using the Web UI.  (was: It would be great that we have some REST APIs to 
access Master and Workers information. Right now the only way to get them is 
using the Web UI.)

> REST APIs for standalone Master and Workers
> ---
>
> Key: SPARK-18683
> URL: https://issues.apache.org/jira/browse/SPARK-18683
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> It would be great that we have some REST APIs to access Master、Workers and 
> Applications information. Right now the only way to get them is using the Web 
> UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception

2017-04-04 Thread StanZhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955285#comment-15955285
 ] 

StanZhai commented on SPARK-20211:
--

A workaround is difficult for me, because of all of my SQL are generated by a 
high-level system, I cannot cast all columns as double.
FLOOR and CEIL are frequently used functions, and not all users will give a 
feedback to the community when encounter this problem.
We should pay attention to the correctness of the SQL.

> `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) 
> exception
> -
>
> Key: SPARK-20211
> URL: https://issues.apache.org/jira/browse/SPARK-20211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: StanZhai
>  Labels: correctness
>
> The following SQL:
> {code}
> select 1 > 0.0001 from tb
> {code}
> throws Decimal scale (0) cannot be greater than precision (-2) exception in 
> Spark 2.x.
> `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and 
> Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception

2017-04-04 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-20211:
-
Priority: Major  (was: Minor)

> `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) 
> exception
> -
>
> Key: SPARK-20211
> URL: https://issues.apache.org/jira/browse/SPARK-20211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: StanZhai
>  Labels: correctness
>
> The following SQL:
> {code}
> select 1 > 0.0001 from tb
> {code}
> throws Decimal scale (0) cannot be greater than precision (-2) exception in 
> Spark 2.x.
> `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and 
> Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception

2017-04-04 Thread StanZhai (JIRA)
StanZhai created SPARK-20211:


 Summary: `1 > 0.0001` throws Decimal scale (0) cannot be greater 
than precision (-2) exception
 Key: SPARK-20211
 URL: https://issues.apache.org/jira/browse/SPARK-20211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.1
Reporter: StanZhai
Priority: Critical


The following SQL:
{code}
select 1 > 0.0001 from tb
{code}
throws Decimal scale (0) cannot be greater than precision (-2) exception in 
Spark 2.x.

`floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and 
Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-04-03 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai closed SPARK-19532.

Resolution: Fixed

> [Core]`DataStreamer for file` threads of DFSOutputStream leak if set 
> `spark.speculation` to true
> 
>
> Key: SPARK-19532
> URL: https://issues.apache.org/jira/browse/SPARK-19532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> When set `spark.speculation` to true, from thread dump page of Executor of 
> WebUI, I found that there are about 1300 threads named  "DataStreamer for 
> file 
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
>  in TIMED_WAITING state.
> {code}
> java.lang.Object.wait(Native Method)
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
> {code}
> The off-heap memory exceeds a lot until Executor exited with OOM exception. 
> This problem occurs only when writing data to the Hadoop(tasks may be killed 
> by Executor during writing).
> Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? 
> The version of Hadoop is 2.6.4.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results

2017-02-28 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19766:
-
Summary: INNER JOIN on constant alias columns return incorrect results  
(was: INNER JOIN on constant alias columns returns incorrect results)

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19766) INNER JOIN on constant alias columns returns incorrect results

2017-02-28 Thread StanZhai (JIRA)
StanZhai created SPARK-19766:


 Summary: INNER JOIN on constant alias columns returns incorrect 
results
 Key: SPARK-19766
 URL: https://issues.apache.org/jira/browse/SPARK-19766
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Critical


We can demonstrate the problem with the following data set and query:

{code}
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()

val sql1 =
  """
|create temporary view t1 as select * from values
|(1)
|as grouping(a)
  """.stripMargin

val sql2 =
  """
|create temporary view t2 as select * from values
|(1)
|as grouping(a)
  """.stripMargin

val sql3 =
  """
|create temporary view t3 as select * from values
|(1),
|(1)
|as grouping(a)
  """.stripMargin

val sql4 =
  """
|create temporary view t4 as select * from values
|(1),
|(1)
|as grouping(a)
  """.stripMargin

val sqlA =
  """
|create temporary view ta as
|select a, 'a' as tag from t1 union all
|select a, 'b' as tag from t2
  """.stripMargin

val sqlB =
  """
|create temporary view tb as
|select a, 'a' as tag from t3 union all
|select a, 'b' as tag from t4
  """.stripMargin

val sql =
  """
|select tb.* from ta inner join tb on
|ta.a = tb.a and
|ta.tag = tb.tag
  """.stripMargin

spark.sql(sql1)
spark.sql(sql2)
spark.sql(sql3)
spark.sql(sql4)
spark.sql(sqlA)
spark.sql(sqlB)
spark.sql(sql).show()
{code}

The results which is incorrect:

{code}
+---+---+
|  a|tag|
+---+---+
|  1|  b|
|  1|  b|
|  1|  a|
|  1|  a|
|  1|  b|
|  1|  b|
|  1|  a|
|  1|  a|
+---+---+
{code}


The correct results should be:

{code}
+---+---+
|  a|tag|
+---+---+
|  1|  a|
|  1|  a|
|  1|  b|
|  1|  b|
+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19622:
-
Attachment: screenshot-1.png

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread StanZhai (JIRA)
StanZhai created SPARK-19622:


 Summary: Fix a http error in a paged table when using a `Go` 
button to search.
 Key: SPARK-19622
 URL: https://issues.apache.org/jira/browse/SPARK-19622
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Minor


The search function of paged table is not available because of we don't skip 
the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-02-13 Thread StanZhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864953#comment-15864953
 ] 

StanZhai commented on SPARK-19532:
--

I can reproduce this by split our online data to the production test cluster 
use our Spark application. 
Our application is a web service, sql job requests are concurrently handled by 
it(like hive-thriftserver).
It's really a bit difficult to reproduce in the development environment.

> [Core]`DataStreamer for file` threads of DFSOutputStream leak if set 
> `spark.speculation` to true
> 
>
> Key: SPARK-19532
> URL: https://issues.apache.org/jira/browse/SPARK-19532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> When set `spark.speculation` to true, from thread dump page of Executor of 
> WebUI, I found that there are about 1300 threads named  "DataStreamer for 
> file 
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
>  in TIMED_WAITING state.
> {code}
> java.lang.Object.wait(Native Method)
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
> {code}
> The off-heap memory exceeds a lot until Executor exited with OOM exception. 
> This problem occurs only when writing data to the Hadoop(tasks may be killed 
> by Executor during writing).
> Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? 
> The version of Hadoop is 2.6.4.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-02-13 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19532:
-
Description: 
When set `spark.speculation` to true, from thread dump page of Executor of 
WebUI, I found that there are about 1300 threads named  "DataStreamer for file 
/test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
 in TIMED_WAITING state.

{code}
java.lang.Object.wait(Native Method)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
{code}

The off-heap memory exceeds a lot until Executor exited with OOM exception. 

This problem occurs only when writing data to the Hadoop(tasks may be killed by 
Executor during writing).

Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? 

The version of Hadoop is 2.6.4.

  was:
When set `spark.speculation` to true, from thread dump page of Executor of 
WebUI, I found that there are about 1300 threads named  "DataStreamer for file 
/test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
 in TIMED_WAITING state.

{code}
java.lang.Object.wait(Native Method)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
{code}

The off-heap memory exceeds a lot until Executor exited with OOM exception. 

This problem occurs only when writing data to the Hadoop(tasks may be killed by 
Executor during writing).



> [Core]`DataStreamer for file` threads of DFSOutputStream leak if set 
> `spark.speculation` to true
> 
>
> Key: SPARK-19532
> URL: https://issues.apache.org/jira/browse/SPARK-19532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Blocker
>
> When set `spark.speculation` to true, from thread dump page of Executor of 
> WebUI, I found that there are about 1300 threads named  "DataStreamer for 
> file 
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
>  in TIMED_WAITING state.
> {code}
> java.lang.Object.wait(Native Method)
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
> {code}
> The off-heap memory exceeds a lot until Executor exited with OOM exception. 
> This problem occurs only when writing data to the Hadoop(tasks may be killed 
> by Executor during writing).
> Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? 
> The version of Hadoop is 2.6.4.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-02-11 Thread StanZhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862625#comment-15862625
 ] 

StanZhai commented on SPARK-19532:
--

We have been trying to upgrade our Spark from the releasing of Spark 2.1.0.
This version is not available for us because of the memory problems.

> [Core]`DataStreamer for file` threads of DFSOutputStream leak if set 
> `spark.speculation` to true
> 
>
> Key: SPARK-19532
> URL: https://issues.apache.org/jira/browse/SPARK-19532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Blocker
>
> When set `spark.speculation` to true, from thread dump page of Executor of 
> WebUI, I found that there are about 1300 threads named  "DataStreamer for 
> file 
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
>  in TIMED_WAITING state.
> {code}
> java.lang.Object.wait(Native Method)
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
> {code}
> The off-heap memory exceeds a lot until Executor exited with OOM exception. 
> This problem occurs only when writing data to the Hadoop(tasks may be killed 
> by Executor during writing).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-02-09 Thread StanZhai (JIRA)
StanZhai created SPARK-19532:


 Summary: [Core]`DataStreamer for file` threads of DFSOutputStream 
leak if set `spark.speculation` to true
 Key: SPARK-19532
 URL: https://issues.apache.org/jira/browse/SPARK-19532
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Blocker


When set `spark.speculation` to true, from thread dump page of Executor of 
WebUI, I found that there are about 1300 threads named  "DataStreamer for file 
/test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
 in TIMED_WAITING state.

{code}
java.lang.Object.wait(Native Method)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
{code}

The off-heap memory exceeds a lot until Executor exited with OOM exception. 

This problem occurs only when writing data to the Hadoop(tasks may be killed by 
Executor during writing).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19509:
-
Description: 
{code:sql|title=A simple case}
select count(1) from test group by e grouping sets(e)
{code}

{code:title=Schema of the test table}
scala> spark.sql("desc test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
|   e|   string|   null|
++-+---+
{code}

{code:sql|title=The column `e` is empty}
scala> spark.sql("select e from test").show()
++
|   e|
++
|null|
|null|
++
{code}

{code:title=Exception}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}


  was:
{code:sql|title=A simple case}
select count(1) from test group by e grouping sets(e)
{code}

{code:sql|title=The column `e` is empty}
scala> spark.sql("select e from test").show()
++
|   e|
++
|null|
|null|
++
{code}

{code:title=Exception}
Driver stacktrace:
  at 

[jira] [Commented] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a null column

2017-02-08 Thread StanZhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858009#comment-15858009
 ] 

StanZhai commented on SPARK-19509:
--

But, these‘s another problem, I've modified the description.

> [SQL]GROUPING SETS throws NullPointerException when use a null column
> -
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> 

[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-08 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19509:
-
Summary: [SQL]GROUPING SETS throws NullPointerException when use an empty 
column  (was: [SQL]GROUPING SETS throws NullPointerException when use a null 
column)

> [SQL]GROUPING SETS throws NullPointerException when use an empty column
> ---
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> 

[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a null column

2017-02-08 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19509:
-
Description: 
{code:sql|title=A simple case}
select count(1) from test group by e grouping sets(e)
{code}

{code:sql|title=The column `e` is empty}
scala> spark.sql("select e from test").show()
++
|   e|
++
|null|
|null|
++
{code}

{code:title=Exception}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}


  was:
To reproduce the issue a CASE WHEN statement must be return a STRING value and 
the grouping sets must be empty.

{code:sql|title=A simple case}
select case "0" when "0" then "a" else "b" end
from tb
group by case "0" when "0" then "a" else "b" end
grouping sets (())
{code}

{code:title=Exception}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 

[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a null column

2017-02-08 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19509:
-
Summary: [SQL]GROUPING SETS throws NullPointerException when use a null 
column  (was: [SQL]GROUPING SETS throws NullPointerException when use a CASE 
WHEN statement)

> [SQL]GROUPING SETS throws NullPointerException when use a null column
> -
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> To reproduce the issue a CASE WHEN statement must be return a STRING value 
> and the grouping sets must be empty.
> {code:sql|title=A simple case}
> select case "0" when "0" then "a" else "b" end
> from tb
> group by case "0" when "0" then "a" else "b" end
> grouping sets (())
> {code}
> {code:title=Exception}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement

2017-02-08 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai reopened SPARK-19509:
--

It doesn't look like the same problem.
I have tried to merge https://github.com/apache/spark/pull/15980 into 
branch-2.1.0. The problem still exist.

> [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement
> -
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> To reproduce the issue a CASE WHEN statement must be return a STRING value 
> and the grouping sets must be empty.
> {code:sql|title=A simple case}
> select case "0" when "0" then "a" else "b" end
> from tb
> group by case "0" when "0" then "a" else "b" end
> grouping sets (())
> {code}
> {code:title=Exception}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement

2017-02-08 Thread StanZhai (JIRA)
StanZhai created SPARK-19509:


 Summary: [SQL]GROUPING SETS throws NullPointerException when use a 
CASE WHEN statement
 Key: SPARK-19509
 URL: https://issues.apache.org/jira/browse/SPARK-19509
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Critical


To reproduce the issue a CASE WHEN statement must be return a STRING value and 
the grouping sets must be empty.

{code:sql|title=A simple case}
select case "0" when "0" then "a" else "b" end
from tb
group by case "0" when "0" then "a" else "b" end
grouping sets (())
{code}

{code:title=Exception}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses

2017-02-05 Thread StanZhai (JIRA)
StanZhai created SPARK-19472:


 Summary: [SQL]SQLParser fails to resolve nested CASE WHEN 
statement with parentheses
 Key: SPARK-19472
 URL: https://issues.apache.org/jira/browse/SPARK-19472
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: StanZhai


SQLParser fails to resolve nested CASE WHEN statement like this:

select case when
  (1) +
  case when 1>0 then 1 else 0 end = 2
then 1 else 0 end
from tb

 Exception 
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 
'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)

== SQL ==

select case when
  (1) +
  case when 1>0 then 1 else 0 end = 2
then 1 else 0 end
^^^
from tb

But,remove parentheses will be fine:

select case when
  1 +
  case when 1>0 then 1 else 0 end = 2
then 1 else 0 end
from tb




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19471) [SQL]A confusing NullPointerException when creating table

2017-02-05 Thread StanZhai (JIRA)
StanZhai created SPARK-19471:


 Summary: [SQL]A confusing NullPointerException when creating table
 Key: SPARK-19471
 URL: https://issues.apache.org/jira/browse/SPARK-19471
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Critical


After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing 
NullPointerException when creating table under Spark 2.1.0, but the problem 
does not exists in Spark 1.6.1. 

Environment: Hive 1.2.1, Hadoop 2.6.4 

 Code  
// spark is an instance of HiveContext 
// merge is a Hive UDF 
val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b FROM 
tb_1 group by field_a, field_b") 
df.createTempView("tb_temp") 
spark.sql("create table tb_result stored as parquet as " + 
  "SELECT new_a" + 
  "FROM tb_temp" + 
  "LEFT JOIN `tb_2` ON " + 
  "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), 
concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = 
`tb_2`.`fka6862f17`") 

 Physical Plan  
*Project [new_a] 
+- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, 
cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], 
[fka6862f17], LeftOuter, BuildRight 
   :- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, 
new_b, _nondeterministic]) 
   :  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), 
coordinator[target post-shuffle partition size: 1024880] 
   : +- *HashAggregate(keys=[field_a, field_b], functions=[], 
output=[field_a, field_b]) 
   :+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true])) 
  +- *Project [fka6862f17] 
 +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct 

What does '*' mean before HashAggregate? 

 Exception  
org.apache.spark.SparkException: Task failed while writing rows 
... 
java.lang.NullPointerException 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
 Source) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) 
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
 
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
 
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
 
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
 
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138)
 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:137)
 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 
at org.apache.spark.scheduler.Task.run(Task.scala:99) 
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) 
at 

[jira] [Created] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement

2017-01-17 Thread StanZhai (JIRA)
StanZhai created SPARK-19261:


 Summary: Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
 Key: SPARK-19261
 URL: https://issues.apache.org/jira/browse/SPARK-19261
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: StanZhai
 Fix For: 2.2.0


We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which 
already be used in version < 2.x.

This is very useful for those who want to upgrade there Spark version to 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9465) Could not read parquet table after recreating it with the same table name

2015-07-29 Thread StanZhai (JIRA)
StanZhai created SPARK-9465:
---

 Summary: Could not read parquet table after recreating it with the 
same table name
 Key: SPARK-9465
 URL: https://issues.apache.org/jira/browse/SPARK-9465
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: StanZhai


I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet 
table after recreating it, we can reproduce the error as following: 

```scala 
// hc is an instance of HiveContext 
hc.sql(select * from b).show() // this is ok and b is a parquet table 
val df = hc.sql(select * from a) 
df.write.mode(SaveMode.Overwrite).saveAsTable(b) 
hc.sql(select * from b).show() // got error 
``` 

The error is: 

java.io.FileNotFoundException: File does not exist: 
/user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet
 
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) 
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613)
 
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
 
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
 
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method) 
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 
at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
 
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
 
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) 
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) 
at 
org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) 
at 
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218)
 
at 
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214)
 
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214)
 
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:206)
 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:625)
 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:621)
 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getTaskSideSplits(ParquetTableOperations.scala:621)
 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:511)
 
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) 
at 

[jira] [Updated] (SPARK-9465) Could not read parquet table after recreating it with the same table name

2015-07-29 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-9465:

Description: 
I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet 
table after recreating it, we can reproduce the error as following: 

{code}
// hc is an instance of HiveContext 
hc.sql(select * from b).show() // this is ok and b is a parquet table 
val df = hc.sql(select * from a) 
df.write.mode(SaveMode.Overwrite).saveAsTable(b) 
hc.sql(select * from b).show() // got error 
{code}

The error is: 

{code}
java.io.FileNotFoundException: File does not exist: 
/user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet
 
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) 
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613)
 
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
 
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
 
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method) 
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 
at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
 
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
 
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) 
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) 
at 
org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) 
at 
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218)
 
at 
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214)
 
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214)
 
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:206)
 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:625)
 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:621)
 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getTaskSideSplits(ParquetTableOperations.scala:621)
 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:511)
 
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) 
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:464)
 
at 

[jira] [Created] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread StanZhai (JIRA)
StanZhai created SPARK-9010:
---

 Summary: Improve the Spark Configuration document about 
`spark.kryoserializer.buffer`
 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Minor


The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
serialization buffer. Note that there will be one buffer per core on each 
worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed..

The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-9010:

Component/s: (was: SQL)
 Documentation

 Improve the Spark Configuration document about `spark.kryoserializer.buffer`
 

 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Minor
  Labels: documentation

 The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
 serialization buffer. Note that there will be one buffer per core on each 
 worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
 needed..
 The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8588) Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-8588:

Priority: Critical  (was: Blocker)

 Could not use concat with UDF in where clause
 -

 Key: SPARK-8588
 URL: https://issues.apache.org/jira/browse/SPARK-8588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark 
 standalone cluster(or local).
Reporter: StanZhai
Priority: Critical

 After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the 
 following exception when use concat with UDF in where clause: 
 {code}
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
 dataType on unresolved object, tree: 
 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) 
 at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) 
 at scala.collection.immutable.List.exists(List.scala:84) 
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
 at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 

[jira] [Created] (SPARK-8588) Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai (JIRA)
StanZhai created SPARK-8588:
---

 Summary: Could not use concat with UDF in where clause
 Key: SPARK-8588
 URL: https://issues.apache.org/jira/browse/SPARK-8588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark 
standalone cluster(or local).
Reporter: StanZhai
Priority: Blocker


After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the 
following exception when use concat with UDF in where clause: 

{code}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 
'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) 
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
 
at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) 
at scala.collection.immutable.List.exists(List.scala:84) 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
at