[jira] [Updated] (SPARK-25067) Active tasks does not match the total cores of an executor in WebUI
[ https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-25067: - Attachment: WX20180810-144212.png > Active tasks does not match the total cores of an executor in WebUI > --- > > Key: SPARK-25067 > URL: https://issues.apache.org/jira/browse/SPARK-25067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Major > Attachments: WX20180810-144212.png, WechatIMG1.jpeg > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25067) Active tasks does not match the total cores of an executor in WebUI
[ https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-25067: - Summary: Active tasks does not match the total cores of an executor in WebUI (was: Active tasks exceed total cores of an executor in WebUI) > Active tasks does not match the total cores of an executor in WebUI > --- > > Key: SPARK-25067 > URL: https://issues.apache.org/jira/browse/SPARK-25067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Major > Attachments: WechatIMG1.jpeg > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI
[ https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-25067: - Attachment: (was: 1533128203469_2.png) > Active tasks exceed total cores of an executor in WebUI > --- > > Key: SPARK-25067 > URL: https://issues.apache.org/jira/browse/SPARK-25067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Major > Attachments: WechatIMG1.jpeg > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI
[ https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-25067: - Attachment: WechatIMG1.jpeg > Active tasks exceed total cores of an executor in WebUI > --- > > Key: SPARK-25067 > URL: https://issues.apache.org/jira/browse/SPARK-25067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Major > Attachments: 1533128203469_2.png, WechatIMG1.jpeg > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI
[ https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-25067: - Attachment: 1533128203469_2.png > Active tasks exceed total cores of an executor in WebUI > --- > > Key: SPARK-25067 > URL: https://issues.apache.org/jira/browse/SPARK-25067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Major > Attachments: 1533128203469_2.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI
StanZhai created SPARK-25067: Summary: Active tasks exceed total cores of an executor in WebUI Key: SPARK-25067 URL: https://issues.apache.org/jira/browse/SPARK-25067 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1, 2.3.0, 2.2.2 Reporter: StanZhai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25064) Total Tasks in WebUI does not match Active+Failed+Complete Tasks
[ https://issues.apache.org/jira/browse/SPARK-25064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-25064: - Attachment: 1533128402933_3.png > Total Tasks in WebUI does not match Active+Failed+Complete Tasks > > > Key: SPARK-25064 > URL: https://issues.apache.org/jira/browse/SPARK-25064 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Minor > Attachments: 1533128402933_3.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25064) Total Tasks in WebUI does not match Active+Failed+Complete Tasks
StanZhai created SPARK-25064: Summary: Total Tasks in WebUI does not match Active+Failed+Complete Tasks Key: SPARK-25064 URL: https://issues.apache.org/jira/browse/SPARK-25064 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1, 2.3.0, 2.2.2 Reporter: StanZhai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24704) The order of stages in the DAG graph is incorrect
[ https://issues.apache.org/jira/browse/SPARK-24704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-24704: - Attachment: WX20180630-161907.png > The order of stages in the DAG graph is incorrect > - > > Key: SPARK-24704 > URL: https://issues.apache.org/jira/browse/SPARK-24704 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Minor > Labels: regression > Attachments: WX20180630-161907.png > > > The regression is introduced by Spark 2.3.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24704) The order of stages in the DAG graph is incorrect
StanZhai created SPARK-24704: Summary: The order of stages in the DAG graph is incorrect Key: SPARK-24704 URL: https://issues.apache.org/jira/browse/SPARK-24704 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1, 2.3.0 Reporter: StanZhai The regression is introduced by Spark 2.3.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
StanZhai created SPARK-24680: Summary: spark.executorEnv.JAVA_HOME does not take effect in Standalone mode Key: SPARK-24680 URL: https://issues.apache.org/jira/browse/SPARK-24680 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.1, 2.2.1, 2.1.1 Reporter: StanZhai spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an Executor process in Standalone mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22084) Performance regression in aggregation strategy
[ https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-22084: - Labels: performance (was: ) > Performance regression in aggregation strategy > -- > > Key: SPARK-22084 > URL: https://issues.apache.org/jira/browse/SPARK-22084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: StanZhai > Labels: performance > > {code:sql} > SELECT a, SUM(b) AS b0, SUM(b) AS b1 > FROM VALUES(1, 1), (2, 2) AS (a, b) > GROUP BY a > {code} > Two exactly the same SUM(b) expressions in the SQL, and the following is the > physical plan in Spark 2.x. > {code} > == Physical Plan == > *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), > sum(cast(b#12 as bigint))]) > +- Exchange hashpartitioning(a#11, 200) >+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as > bigint)), partial_sum(cast(b#12 as bigint))]) > +- LocalTableScan [a#11, b#12] > {code} > functions in Aggregate should be: functions=[partial_sum(cast(b#12 as > bigint))] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22084) Performance regression in aggregation strategy
[ https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-22084: - Description: {code:sql} SELECT a, SUM(b) AS b0, SUM(b) AS b1 FROM VALUES(1, 1), (2, 2) AS (a, b) GROUP BY a {code} Two exactly the same SUM(b) expressions in the SQL, and the following is the physical plan in Spark 2.x. {code} == Physical Plan == *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 as bigint))]) +- Exchange hashpartitioning(a#11, 200) +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), partial_sum(cast(b#12 as bigint))]) +- LocalTableScan [a#11, b#12] {code} functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))] was: {code:sql} SELECT a, SUM(b) AS b0, SUM(b) AS b1 FROM VALUES(1, 1), (2, 2) AS (a, b) GROUP BY a {code} Two exactly the same SUM(b) expressions in the SQL, and the following is the physical plan in Spark 2.x. == Physical Plan == *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 as bigint))]) +- Exchange hashpartitioning(a#11, 200) +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), partial_sum(cast(b#12 as bigint))]) +- LocalTableScan [a#11, b#12] functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))] > Performance regression in aggregation strategy > -- > > Key: SPARK-22084 > URL: https://issues.apache.org/jira/browse/SPARK-22084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: StanZhai > > {code:sql} > SELECT a, SUM(b) AS b0, SUM(b) AS b1 > FROM VALUES(1, 1), (2, 2) AS (a, b) > GROUP BY a > {code} > Two exactly the same SUM(b) expressions in the SQL, and the following is the > physical plan in Spark 2.x. > {code} > == Physical Plan == > *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), > sum(cast(b#12 as bigint))]) > +- Exchange hashpartitioning(a#11, 200) >+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as > bigint)), partial_sum(cast(b#12 as bigint))]) > +- LocalTableScan [a#11, b#12] > {code} > functions in Aggregate should be: functions=[partial_sum(cast(b#12 as > bigint))] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22084) Performance regression in aggregation strategy
StanZhai created SPARK-22084: Summary: Performance regression in aggregation strategy Key: SPARK-22084 URL: https://issues.apache.org/jira/browse/SPARK-22084 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: StanZhai {code:sql} SELECT a, SUM(b) AS b0, SUM(b) AS b1 FROM VALUES(1, 1), (2, 2) AS (a, b) GROUP BY a {code} Two exactly the same SUM(b) expressions in the SQL, and the following is the physical plan in Spark 2.x. == Physical Plan == *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 as bigint))]) +- Exchange hashpartitioning(a#11, 200) +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), partial_sum(cast(b#12 as bigint))]) +- LocalTableScan [a#11, b#12] functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131921#comment-16131921 ] StanZhai commented on SPARK-21774: -- I've opened a PR (https://github.com/apache/spark/pull/18986) for this issue. This PR is not automatically associated with JIRA yet. > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - External issue URL: (was: https://github.com/apache/spark/pull/18986) > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - External issue URL: https://github.com/apache/spark/pull/18986 > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - External issue ID: (was: SPARK-21646) > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21774: - Description: Data {code} create temporary view tb as select * from values ("0", 1), ("-0.1", 2), ("1", 3) as grouping(a, b) {code} SQL: {code} select a, b from tb where a=0 {code} The result which is wrong: {code} ++---+ | a| b| ++---+ | 0| 1| |-0.1| 2| ++---+ {code} Logical Plan: {code} == Parsed Logical Plan == 'Project ['a] +- 'Filter ('a = 0) +- 'UnresolvedRelation `src` == Analyzed Logical Plan == a: string Project [a#8528] +- Filter (cast(a#8528 as int) = 0) +- SubqueryAlias src +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] +- LocalRelation [_1#8525, _2#8526] {code} was: Data {code} create temporary view tb as select * from values ("0", 1), ("-0.1", 2), ("1", 3) as grouping(a, b) {code} SQL: {code} select a, b from tb where a=0 {code} The result which is wrong: {code} ++---+ | a| b| ++---+ | 0| 1| |-0.1| 2| ++---+ {code} > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
StanZhai created SPARK-21774: Summary: The rule PromoteStrings cast string to a wrong data type Key: SPARK-21774 URL: https://issues.apache.org/jira/browse/SPARK-21774 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: StanZhai Priority: Critical Data {code} create temporary view tb as select * from values ("0", 1), ("-0.1", 2), ("1", 3) as grouping(a, b) {code} SQL: {code} select a, b from tb where a=0 {code} The result which is wrong: {code} ++---+ | a| b| ++---+ | 0| 1| |-0.1| 2| ++---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21758) `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*
StanZhai created SPARK-21758: Summary: `SHOW TBLPROPERTIES` can not get properties start with spark.sql.* Key: SPARK-21758 URL: https://issues.apache.org/jira/browse/SPARK-21758 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.1, 2.1.0 Reporter: StanZhai Priority: Critical SQL: SHOW TBLPROPERTIES test_tb("spark.sql.sources.schema.numParts") Exception: Table test_db.test.tb does not have property: spark.sql.sources.schema.numParts The `spark.sql.sources.schema.numParts` property exactly exists in HiveMetastore. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.
[ https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074840#comment-16074840 ] StanZhai commented on SPARK-21318: -- yes. It has been registered into the `functionRegistry`, but it's not available for use. > The exception message thrown by `lookupFunction` is ambiguous. > -- > > Key: SPARK-21318 > URL: https://issues.apache.org/jira/browse/SPARK-21318 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai >Priority: Minor > > The function actually exists in current selected database, but the exception > message is: > {code} > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'. > {code} > My UDF has already been registered in the current database. But it's failed > to init during lookupFunction. > The exception message should be: > {code} > No handler for Hive UDF 'site.stanzhai.UDAFXXX': > org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is > expected > {code} > This is not conducive to positioning problems. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.
[ https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21318: - Description: The function actually exists in current selected database, but the exception message is: {code} This function is neither a registered temporary function nor a permanent function registered in the database 'default'. {code} My UDF has already been registered in the current database. But it's failed to init during lookupFunction. The exception message should be: {code} No handler for Hive UDF 'site.stanzhai.UDAFXXX': org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is expected {code} This is not conducive to positioning problems. was: The function actually exists, but the exception message is: {code} This function is neither a registered temporary function nor a permanent function registered in the database 'default'. {code} My UDF has already been registered in the current database. But it's failed to init during lookupFunction. The exception message should be: {code} No handler for Hive UDF 'site.stanzhai.UDAFXXX': org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is expected {code} This is not conducive to positioning problems. > The exception message thrown by `lookupFunction` is ambiguous. > -- > > Key: SPARK-21318 > URL: https://issues.apache.org/jira/browse/SPARK-21318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai >Priority: Minor > > The function actually exists in current selected database, but the exception > message is: > {code} > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'. > {code} > My UDF has already been registered in the current database. But it's failed > to init during lookupFunction. > The exception message should be: > {code} > No handler for Hive UDF 'site.stanzhai.UDAFXXX': > org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is > expected > {code} > This is not conducive to positioning problems. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.
[ https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-21318: - Description: The function actually exists, but the exception message is: {code} This function is neither a registered temporary function nor a permanent function registered in the database 'default'. {code} My UDF has already been registered in the current database. But it's failed to init during lookupFunction. The exception message should be: {code} No handler for Hive UDF 'site.stanzhai.UDAFXXX': org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is expected {code} This is not conducive to positioning problems. was: The function actually exists, but the exception message is: {code} This function is neither a registered temporary function nor a permanent function registered in the database 'default'. {code} My UDF has already been registered in the current database. But it's failed to init during lookupFunction. The exception message should be: {code} No handler for Hive UDF 'site.stanzhai.UDAFXXX': org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is expected {code} > The exception message thrown by `lookupFunction` is ambiguous. > -- > > Key: SPARK-21318 > URL: https://issues.apache.org/jira/browse/SPARK-21318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai >Priority: Minor > > The function actually exists, but the exception message is: > {code} > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'. > {code} > My UDF has already been registered in the current database. But it's failed > to init during lookupFunction. > The exception message should be: > {code} > No handler for Hive UDF 'site.stanzhai.UDAFXXX': > org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is > expected > {code} > This is not conducive to positioning problems. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.
StanZhai created SPARK-21318: Summary: The exception message thrown by `lookupFunction` is ambiguous. Key: SPARK-21318 URL: https://issues.apache.org/jira/browse/SPARK-21318 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: StanZhai Priority: Minor The function actually exists, but the exception message is: {code} This function is neither a registered temporary function nor a permanent function registered in the database 'default'. {code} My UDF has already been registered in the current database. But it's failed to init during lookupFunction. The exception message should be: {code} No handler for Hive UDF 'site.stanzhai.UDAFXXX': org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is expected {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18683) REST APIs for standalone Master、Workers and Applications
[ https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-18683: - Summary: REST APIs for standalone Master、Workers and Applications (was: REST APIs for standalone Master and Workers) > REST APIs for standalone Master、Workers and Applications > > > Key: SPARK-18683 > URL: https://issues.apache.org/jira/browse/SPARK-18683 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu > > It would be great that we have some REST APIs to access Master、Workers and > Applications information. Right now the only way to get them is using the Web > UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18683) REST APIs for standalone Master and Workers
[ https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-18683: - Description: It would be great that we have some REST APIs to access Master、Workers and Applications information. Right now the only way to get them is using the Web UI. (was: It would be great that we have some REST APIs to access Master and Workers information. Right now the only way to get them is using the Web UI.) > REST APIs for standalone Master and Workers > --- > > Key: SPARK-18683 > URL: https://issues.apache.org/jira/browse/SPARK-18683 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu > > It would be great that we have some REST APIs to access Master、Workers and > Applications information. Right now the only way to get them is using the Web > UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception
[ https://issues.apache.org/jira/browse/SPARK-20211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955285#comment-15955285 ] StanZhai commented on SPARK-20211: -- A workaround is difficult for me, because of all of my SQL are generated by a high-level system, I cannot cast all columns as double. FLOOR and CEIL are frequently used functions, and not all users will give a feedback to the community when encounter this problem. We should pay attention to the correctness of the SQL. > `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) > exception > - > > Key: SPARK-20211 > URL: https://issues.apache.org/jira/browse/SPARK-20211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai > Labels: correctness > > The following SQL: > {code} > select 1 > 0.0001 from tb > {code} > throws Decimal scale (0) cannot be greater than precision (-2) exception in > Spark 2.x. > `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and > Spark 2.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception
[ https://issues.apache.org/jira/browse/SPARK-20211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-20211: - Priority: Major (was: Minor) > `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) > exception > - > > Key: SPARK-20211 > URL: https://issues.apache.org/jira/browse/SPARK-20211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai > Labels: correctness > > The following SQL: > {code} > select 1 > 0.0001 from tb > {code} > throws Decimal scale (0) cannot be greater than precision (-2) exception in > Spark 2.x. > `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and > Spark 2.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception
StanZhai created SPARK-20211: Summary: `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception Key: SPARK-20211 URL: https://issues.apache.org/jira/browse/SPARK-20211 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.1 Reporter: StanZhai Priority: Critical The following SQL: {code} select 1 > 0.0001 from tb {code} throws Decimal scale (0) cannot be greater than precision (-2) exception in Spark 2.x. `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and Spark 2.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
[ https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai closed SPARK-19532. Resolution: Fixed > [Core]`DataStreamer for file` threads of DFSOutputStream leak if set > `spark.speculation` to true > > > Key: SPARK-19532 > URL: https://issues.apache.org/jira/browse/SPARK-19532 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > When set `spark.speculation` to true, from thread dump page of Executor of > WebUI, I found that there are about 1300 threads named "DataStreamer for > file > /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" > in TIMED_WAITING state. > {code} > java.lang.Object.wait(Native Method) > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) > {code} > The off-heap memory exceeds a lot until Executor exited with OOM exception. > This problem occurs only when writing data to the Hadoop(tasks may be killed > by Executor during writing). > Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? > The version of Hadoop is 2.6.4. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19766: - Summary: INNER JOIN on constant alias columns return incorrect results (was: INNER JOIN on constant alias columns returns incorrect results) > INNER JOIN on constant alias columns return incorrect results > - > > Key: SPARK-19766 > URL: https://issues.apache.org/jira/browse/SPARK-19766 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > We can demonstrate the problem with the following data set and query: > {code} > val spark = > SparkSession.builder().appName("test").master("local").getOrCreate() > val sql1 = > """ > |create temporary view t1 as select * from values > |(1) > |as grouping(a) > """.stripMargin > val sql2 = > """ > |create temporary view t2 as select * from values > |(1) > |as grouping(a) > """.stripMargin > val sql3 = > """ > |create temporary view t3 as select * from values > |(1), > |(1) > |as grouping(a) > """.stripMargin > val sql4 = > """ > |create temporary view t4 as select * from values > |(1), > |(1) > |as grouping(a) > """.stripMargin > val sqlA = > """ > |create temporary view ta as > |select a, 'a' as tag from t1 union all > |select a, 'b' as tag from t2 > """.stripMargin > val sqlB = > """ > |create temporary view tb as > |select a, 'a' as tag from t3 union all > |select a, 'b' as tag from t4 > """.stripMargin > val sql = > """ > |select tb.* from ta inner join tb on > |ta.a = tb.a and > |ta.tag = tb.tag > """.stripMargin > spark.sql(sql1) > spark.sql(sql2) > spark.sql(sql3) > spark.sql(sql4) > spark.sql(sqlA) > spark.sql(sqlB) > spark.sql(sql).show() > {code} > The results which is incorrect: > {code} > +---+---+ > | a|tag| > +---+---+ > | 1| b| > | 1| b| > | 1| a| > | 1| a| > | 1| b| > | 1| b| > | 1| a| > | 1| a| > +---+---+ > {code} > The correct results should be: > {code} > +---+---+ > | a|tag| > +---+---+ > | 1| a| > | 1| a| > | 1| b| > | 1| b| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19766) INNER JOIN on constant alias columns returns incorrect results
StanZhai created SPARK-19766: Summary: INNER JOIN on constant alias columns returns incorrect results Key: SPARK-19766 URL: https://issues.apache.org/jira/browse/SPARK-19766 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: StanZhai Priority: Critical We can demonstrate the problem with the following data set and query: {code} val spark = SparkSession.builder().appName("test").master("local").getOrCreate() val sql1 = """ |create temporary view t1 as select * from values |(1) |as grouping(a) """.stripMargin val sql2 = """ |create temporary view t2 as select * from values |(1) |as grouping(a) """.stripMargin val sql3 = """ |create temporary view t3 as select * from values |(1), |(1) |as grouping(a) """.stripMargin val sql4 = """ |create temporary view t4 as select * from values |(1), |(1) |as grouping(a) """.stripMargin val sqlA = """ |create temporary view ta as |select a, 'a' as tag from t1 union all |select a, 'b' as tag from t2 """.stripMargin val sqlB = """ |create temporary view tb as |select a, 'a' as tag from t3 union all |select a, 'b' as tag from t4 """.stripMargin val sql = """ |select tb.* from ta inner join tb on |ta.a = tb.a and |ta.tag = tb.tag """.stripMargin spark.sql(sql1) spark.sql(sql2) spark.sql(sql3) spark.sql(sql4) spark.sql(sqlA) spark.sql(sqlB) spark.sql(sql).show() {code} The results which is incorrect: {code} +---+---+ | a|tag| +---+---+ | 1| b| | 1| b| | 1| a| | 1| a| | 1| b| | 1| b| | 1| a| | 1| a| +---+---+ {code} The correct results should be: {code} +---+---+ | a|tag| +---+---+ | 1| a| | 1| a| | 1| b| | 1| b| +---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
[ https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19622: - Attachment: screenshot-1.png > Fix a http error in a paged table when using a `Go` button to search. > - > > Key: SPARK-19622 > URL: https://issues.apache.org/jira/browse/SPARK-19622 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Minor > Attachments: screenshot-1.png > > > The search function of paged table is not available because of we don't skip > the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
StanZhai created SPARK-19622: Summary: Fix a http error in a paged table when using a `Go` button to search. Key: SPARK-19622 URL: https://issues.apache.org/jira/browse/SPARK-19622 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: StanZhai Priority: Minor The search function of paged table is not available because of we don't skip the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
[ https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864953#comment-15864953 ] StanZhai commented on SPARK-19532: -- I can reproduce this by split our online data to the production test cluster use our Spark application. Our application is a web service, sql job requests are concurrently handled by it(like hive-thriftserver). It's really a bit difficult to reproduce in the development environment. > [Core]`DataStreamer for file` threads of DFSOutputStream leak if set > `spark.speculation` to true > > > Key: SPARK-19532 > URL: https://issues.apache.org/jira/browse/SPARK-19532 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > When set `spark.speculation` to true, from thread dump page of Executor of > WebUI, I found that there are about 1300 threads named "DataStreamer for > file > /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" > in TIMED_WAITING state. > {code} > java.lang.Object.wait(Native Method) > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) > {code} > The off-heap memory exceeds a lot until Executor exited with OOM exception. > This problem occurs only when writing data to the Hadoop(tasks may be killed > by Executor during writing). > Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? > The version of Hadoop is 2.6.4. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
[ https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19532: - Description: When set `spark.speculation` to true, from thread dump page of Executor of WebUI, I found that there are about 1300 threads named "DataStreamer for file /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" in TIMED_WAITING state. {code} java.lang.Object.wait(Native Method) org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) {code} The off-heap memory exceeds a lot until Executor exited with OOM exception. This problem occurs only when writing data to the Hadoop(tasks may be killed by Executor during writing). Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? The version of Hadoop is 2.6.4. was: When set `spark.speculation` to true, from thread dump page of Executor of WebUI, I found that there are about 1300 threads named "DataStreamer for file /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" in TIMED_WAITING state. {code} java.lang.Object.wait(Native Method) org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) {code} The off-heap memory exceeds a lot until Executor exited with OOM exception. This problem occurs only when writing data to the Hadoop(tasks may be killed by Executor during writing). > [Core]`DataStreamer for file` threads of DFSOutputStream leak if set > `spark.speculation` to true > > > Key: SPARK-19532 > URL: https://issues.apache.org/jira/browse/SPARK-19532 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Blocker > > When set `spark.speculation` to true, from thread dump page of Executor of > WebUI, I found that there are about 1300 threads named "DataStreamer for > file > /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" > in TIMED_WAITING state. > {code} > java.lang.Object.wait(Native Method) > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) > {code} > The off-heap memory exceeds a lot until Executor exited with OOM exception. > This problem occurs only when writing data to the Hadoop(tasks may be killed > by Executor during writing). > Could this be related to [https://issues.apache.org/jira/browse/HDFS-9812]? > The version of Hadoop is 2.6.4. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
[ https://issues.apache.org/jira/browse/SPARK-19532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862625#comment-15862625 ] StanZhai commented on SPARK-19532: -- We have been trying to upgrade our Spark from the releasing of Spark 2.1.0. This version is not available for us because of the memory problems. > [Core]`DataStreamer for file` threads of DFSOutputStream leak if set > `spark.speculation` to true > > > Key: SPARK-19532 > URL: https://issues.apache.org/jira/browse/SPARK-19532 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Blocker > > When set `spark.speculation` to true, from thread dump page of Executor of > WebUI, I found that there are about 1300 threads named "DataStreamer for > file > /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" > in TIMED_WAITING state. > {code} > java.lang.Object.wait(Native Method) > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) > {code} > The off-heap memory exceeds a lot until Executor exited with OOM exception. > This problem occurs only when writing data to the Hadoop(tasks may be killed > by Executor during writing). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
StanZhai created SPARK-19532: Summary: [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true Key: SPARK-19532 URL: https://issues.apache.org/jira/browse/SPARK-19532 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.1.0 Reporter: StanZhai Priority: Blocker When set `spark.speculation` to true, from thread dump page of Executor of WebUI, I found that there are about 1300 threads named "DataStreamer for file /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" in TIMED_WAITING state. {code} java.lang.Object.wait(Native Method) org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) {code} The off-heap memory exceeds a lot until Executor exited with OOM exception. This problem occurs only when writing data to the Hadoop(tasks may be killed by Executor during writing). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19509: - Description: {code:sql|title=A simple case} select count(1) from test group by e grouping sets(e) {code} {code:title=Schema of the test table} scala> spark.sql("desc test").show() ++-+---+ |col_name|data_type|comment| ++-+---+ | e| string| null| ++-+---+ {code} {code:sql|title=The column `e` is empty} scala> spark.sql("select e from test").show() ++ | e| ++ |null| |null| ++ {code} {code:title=Exception} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at org.apache.spark.sql.Dataset.show(Dataset.scala:636) at org.apache.spark.sql.Dataset.show(Dataset.scala:595) at org.apache.spark.sql.Dataset.show(Dataset.scala:604) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was: {code:sql|title=A simple case} select count(1) from test group by e grouping sets(e) {code} {code:sql|title=The column `e` is empty} scala> spark.sql("select e from test").show() ++ | e| ++ |null| |null| ++ {code} {code:title=Exception} Driver stacktrace: at
[jira] [Commented] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a null column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858009#comment-15858009 ] StanZhai commented on SPARK-19509: -- But, these‘s another problem, I've modified the description. > [SQL]GROUPING SETS throws NullPointerException when use a null column > - > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at >
[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19509: - Summary: [SQL]GROUPING SETS throws NullPointerException when use an empty column (was: [SQL]GROUPING SETS throws NullPointerException when use a null column) > [SQL]GROUPING SETS throws NullPointerException when use an empty column > --- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at >
[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a null column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19509: - Description: {code:sql|title=A simple case} select count(1) from test group by e grouping sets(e) {code} {code:sql|title=The column `e` is empty} scala> spark.sql("select e from test").show() ++ | e| ++ |null| |null| ++ {code} {code:title=Exception} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at org.apache.spark.sql.Dataset.show(Dataset.scala:636) at org.apache.spark.sql.Dataset.show(Dataset.scala:595) at org.apache.spark.sql.Dataset.show(Dataset.scala:604) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was: To reproduce the issue a CASE WHEN statement must be return a STRING value and the grouping sets must be empty. {code:sql|title=A simple case} select case "0" when "0" then "a" else "b" end from tb group by case "0" when "0" then "a" else "b" end grouping sets (()) {code} {code:title=Exception} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at
[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a null column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19509: - Summary: [SQL]GROUPING SETS throws NullPointerException when use a null column (was: [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement) > [SQL]GROUPING SETS throws NullPointerException when use a null column > - > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > To reproduce the issue a CASE WHEN statement must be return a STRING value > and the grouping sets must be empty. > {code:sql|title=A simple case} > select case "0" when "0" then "a" else "b" end > from tb > group by case "0" when "0" then "a" else "b" end > grouping sets (()) > {code} > {code:title=Exception} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai reopened SPARK-19509: -- It doesn't look like the same problem. I have tried to merge https://github.com/apache/spark/pull/15980 into branch-2.1.0. The problem still exist. > [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement > - > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > To reproduce the issue a CASE WHEN statement must be return a STRING value > and the grouping sets must be empty. > {code:sql|title=A simple case} > select case "0" when "0" then "a" else "b" end > from tb > group by case "0" when "0" then "a" else "b" end > grouping sets (()) > {code} > {code:title=Exception} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement
StanZhai created SPARK-19509: Summary: [SQL]GROUPING SETS throws NullPointerException when use a CASE WHEN statement Key: SPARK-19509 URL: https://issues.apache.org/jira/browse/SPARK-19509 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: StanZhai Priority: Critical To reproduce the issue a CASE WHEN statement must be return a STRING value and the grouping sets must be empty. {code:sql|title=A simple case} select case "0" when "0" then "a" else "b" end from tb group by case "0" when "0" then "a" else "b" end grouping sets (()) {code} {code:title=Exception} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
StanZhai created SPARK-19472: Summary: [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses Key: SPARK-19472 URL: https://issues.apache.org/jira/browse/SPARK-19472 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: StanZhai SQLParser fails to resolve nested CASE WHEN statement like this: select case when (1) + case when 1>0 then 1 else 0 end = 2 then 1 else 0 end from tb Exception Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0) == SQL == select case when (1) + case when 1>0 then 1 else 0 end = 2 then 1 else 0 end ^^^ from tb But,remove parentheses will be fine: select case when 1 + case when 1>0 then 1 else 0 end = 2 then 1 else 0 end from tb -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19471) [SQL]A confusing NullPointerException when creating table
StanZhai created SPARK-19471: Summary: [SQL]A confusing NullPointerException when creating table Key: SPARK-19471 URL: https://issues.apache.org/jira/browse/SPARK-19471 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: StanZhai Priority: Critical After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing NullPointerException when creating table under Spark 2.1.0, but the problem does not exists in Spark 1.6.1. Environment: Hive 1.2.1, Hadoop 2.6.4 Code // spark is an instance of HiveContext // merge is a Hive UDF val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b FROM tb_1 group by field_a, field_b") df.createTempView("tb_temp") spark.sql("create table tb_result stored as parquet as " + "SELECT new_a" + "FROM tb_temp" + "LEFT JOIN `tb_2` ON " + "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = `tb_2`.`fka6862f17`") Physical Plan *Project [new_a] +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], [fka6862f17], LeftOuter, BuildRight :- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, new_b, _nondeterministic]) : +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), coordinator[target post-shuffle partition size: 1024880] : +- *HashAggregate(keys=[field_a, field_b], functions=[], output=[field_a, field_b]) :+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) +- *Project [fka6862f17] +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct What does '*' mean before HashAggregate? Exception org.apache.spark.SparkException: Task failed while writing rows ... java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:137) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at
[jira] [Created] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
StanZhai created SPARK-19261: Summary: Support `ALTER TABLE table_name ADD COLUMNS(..)` statement Key: SPARK-19261 URL: https://issues.apache.org/jira/browse/SPARK-19261 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: StanZhai Fix For: 2.2.0 We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which already be used in version < 2.x. This is very useful for those who want to upgrade there Spark version to 2.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9465) Could not read parquet table after recreating it with the same table name
StanZhai created SPARK-9465: --- Summary: Could not read parquet table after recreating it with the same table name Key: SPARK-9465 URL: https://issues.apache.org/jira/browse/SPARK-9465 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: StanZhai I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet table after recreating it, we can reproduce the error as following: ```scala // hc is an instance of HiveContext hc.sql(select * from b).show() // this is ok and b is a parquet table val df = hc.sql(select * from a) df.write.mode(SaveMode.Overwrite).saveAsTable(b) hc.sql(select * from b).show() // got error ``` The error is: java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:206) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:625) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:621) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getTaskSideSplits(ParquetTableOperations.scala:621) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:511) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at
[jira] [Updated] (SPARK-9465) Could not read parquet table after recreating it with the same table name
[ https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-9465: Description: I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet table after recreating it, we can reproduce the error as following: {code} // hc is an instance of HiveContext hc.sql(select * from b).show() // this is ok and b is a parquet table val df = hc.sql(select * from a) df.write.mode(SaveMode.Overwrite).saveAsTable(b) hc.sql(select * from b).show() // got error {code} The error is: {code} java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:206) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:625) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:621) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getTaskSideSplits(ParquetTableOperations.scala:621) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:511) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:464) at
[jira] [Created] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
StanZhai created SPARK-9010: --- Summary: Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: StanZhai Priority: Minor The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
[ https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-9010: Component/s: (was: SQL) Documentation Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: StanZhai Priority: Minor Labels: documentation The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8588) Could not use concat with UDF in where clause
[ https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-8588: Priority: Critical (was: Blocker) Could not use concat with UDF in where clause - Key: SPARK-8588 URL: https://issues.apache.org/jira/browse/SPARK-8588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark standalone cluster(or local). Reporter: StanZhai Priority: Critical After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
[jira] [Created] (SPARK-8588) Could not use concat with UDF in where clause
StanZhai created SPARK-8588: --- Summary: Could not use concat with UDF in where clause Key: SPARK-8588 URL: https://issues.apache.org/jira/browse/SPARK-8588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark standalone cluster(or local). Reporter: StanZhai Priority: Blocker After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at