[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23532: Description: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, refer to https://issues.apache.org/jira/browse/SPARK-16944 It would be better that Standalone can also support this feature. was: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944 It would be better that Standalone can also support this feature. > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23532: Assignee: (was: Apache Spark) > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379906#comment-16379906 ] Apache Spark commented on SPARK-23532: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/20690 > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23532: Assignee: Apache Spark > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown
[ https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxianjiao updated SPARK-23530: Priority: Critical (was: Major) > It's not appropriate to let the original master exit while the leader of > zookeeper shutdown > --- > > Key: SPARK-23530 > URL: https://issues.apache.org/jira/browse/SPARK-23530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1, 2.3.0 >Reporter: liuxianjiao >Priority: Critical > > When the leader of zookeeper shutdown,the current method of spark is letting > the master exit to revoke the leadership.However,this sacrifice a master > node.According the treatment of hadoop and storm ,we should let the origin > active master to be standby ,or Re-election for spark master,or any other > ways to revoke leadership gracefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23532: Description: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944 It would be better that Standalone can also support this feature. was: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23532: Description: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. was: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to +https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944+ > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset
[ https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23533: Assignee: Apache Spark > Add support for changing ContinuousDataReader's startOffset > --- > > Key: SPARK-23533 > URL: https://issues.apache.org/jira/browse/SPARK-23533 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Assignee: Apache Spark >Priority: Major > > As discussion in [https://github.com/apache/spark/pull/20675], we need add a > new interface `ContinuousDataReaderFactory` to support the requirements of > setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset
[ https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23533: Assignee: (was: Apache Spark) > Add support for changing ContinuousDataReader's startOffset > --- > > Key: SPARK-23533 > URL: https://issues.apache.org/jira/browse/SPARK-23533 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Priority: Major > > As discussion in [https://github.com/apache/spark/pull/20675], we need add a > new interface `ContinuousDataReaderFactory` to support the requirements of > setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset
[ https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379897#comment-16379897 ] Apache Spark commented on SPARK-23533: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/20689 > Add support for changing ContinuousDataReader's startOffset > --- > > Key: SPARK-23533 > URL: https://issues.apache.org/jira/browse/SPARK-23533 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Priority: Major > > As discussion in [https://github.com/apache/spark/pull/20675], we need add a > new interface `ContinuousDataReaderFactory` to support the requirements of > setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset
[ https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23533: Summary: Add support for changing ContinuousDataReader's startOffset (was: Add support for changing ContinousDataReader's startOffset) > Add support for changing ContinuousDataReader's startOffset > --- > > Key: SPARK-23533 > URL: https://issues.apache.org/jira/browse/SPARK-23533 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Priority: Major > > As discussion in [https://github.com/apache/spark/pull/20675], we need add a > new interface `ContinuousDataReaderFactory` to support the requirements of > setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23533) Add support for changing ContinousDataReader's startOffset
Li Yuanjian created SPARK-23533: --- Summary: Add support for changing ContinousDataReader's startOffset Key: SPARK-23533 URL: https://issues.apache.org/jira/browse/SPARK-23533 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Li Yuanjian As discussion in [https://github.com/apache/spark/pull/20675], we need add a new interface `ContinuousDataReaderFactory` to support the requirements of setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
liuxian created SPARK-23532: --- Summary: [STANDALONE] Improve data locality when launching new executors for dynamic allocation Key: SPARK-23532 URL: https://issues.apache.org/jira/browse/SPARK-23532 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to +https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379807#comment-16379807 ] Astha Arya commented on SPARK-14922: Any updates ? > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.2, 2.2.1 >Reporter: Xiao Li >Priority: Major > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23531) When explain, plan's output should include attribute type info
Wenchen Fan created SPARK-23531: --- Summary: When explain, plan's output should include attribute type info Key: SPARK-23531 URL: https://issues.apache.org/jira/browse/SPARK-23531 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23498) Accuracy problem in comparison with string and integer
[ https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379753#comment-16379753 ] Kevin Zhang commented on SPARK-23498: - yes, thanks. But when we use spark sql to run existing hive scripts we expected spark sql could have the same results as hive, and that's why I open this jira. Now that [~q79969786] has marked this as duplicated with [SPARK-21646 |https://issues.apache.org/jira/browse/SPARK-21646], I will patch in my own branch first. > Accuracy problem in comparison with string and integer > -- > > Key: SPARK-23498 > URL: https://issues.apache.org/jira/browse/SPARK-23498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Kevin Zhang >Priority: Major > > While comparing a string column with integer value, spark sql will > automatically cast the string operant to int, the following sql will return > true in hive but false in spark > > {code:java} > select '1000.1'>1000 > {code} > > from the physical plan we can see the string operant was cast to int which > caused the accuracy loss > {code:java} > *Project [false AS (CAST(1000.1 AS INT) > 1000)#4] > +- Scan OneRowRelation[] > {code} > To solve it, using a wider common type like double to cast both sides of > operant of a binary operator may be safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown
[ https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxianjiao updated SPARK-23530: Affects Version/s: 2.3.0 > It's not appropriate to let the original master exit while the leader of > zookeeper shutdown > --- > > Key: SPARK-23530 > URL: https://issues.apache.org/jira/browse/SPARK-23530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1, 2.3.0 >Reporter: liuxianjiao >Priority: Major > > When the leader of zookeeper shutdown,the current method of spark is letting > the master exit to revoke the leadership.However,this sacrifice a master > node.According the treatment of hadoop and storm ,we should let the origin > active master to be standby ,or Re-election for spark master,or any other > ways to revoke leadership gracefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown
[ https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxianjiao updated SPARK-23530: Comment: was deleted (was: When the leader of zookeeper shutdown,the current method of spark is letting the master exit to revoke the leadership.However,this sacrifice a master node.According the treatment of hadoop and storm ,we should let the origin active master to be standby ,or Re-election for spark master,or any other ways to revoke leadership gracefully.) > It's not appropriate to let the original master exit while the leader of > zookeeper shutdown > --- > > Key: SPARK-23530 > URL: https://issues.apache.org/jira/browse/SPARK-23530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: liuxianjiao >Priority: Major > > When the leader of zookeeper shutdown,the current method of spark is letting > the master exit to revoke the leadership.However,this sacrifice a master > node.According the treatment of hadoop and storm ,we should let the origin > active master to be standby ,or Re-election for spark master,or any other > ways to revoke leadership gracefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown
[ https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxianjiao updated SPARK-23530: Description: When the leader of zookeeper shutdown,the current method of spark is letting the master exit to revoke the leadership.However,this sacrifice a master node.According the treatment of hadoop and storm ,we should let the origin active master to be standby ,or Re-election for spark master,or any other ways to revoke leadership gracefully. > It's not appropriate to let the original master exit while the leader of > zookeeper shutdown > --- > > Key: SPARK-23530 > URL: https://issues.apache.org/jira/browse/SPARK-23530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: liuxianjiao >Priority: Major > > When the leader of zookeeper shutdown,the current method of spark is letting > the master exit to revoke the leadership.However,this sacrifice a master > node.According the treatment of hadoop and storm ,we should let the origin > active master to be standby ,or Re-election for spark master,or any other > ways to revoke leadership gracefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown
[ https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379723#comment-16379723 ] liuxianjiao commented on SPARK-23530: - When the leader of zookeeper shutdown,the current method of spark is letting the master exit to revoke the leadership.However,this sacrifice a master node.According the treatment of hadoop and storm ,we should let the origin active master to be standby ,or Re-election for spark master,or any other ways to revoke leadership gracefully. > It's not appropriate to let the original master exit while the leader of > zookeeper shutdown > --- > > Key: SPARK-23530 > URL: https://issues.apache.org/jira/browse/SPARK-23530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: liuxianjiao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown
liuxianjiao created SPARK-23530: --- Summary: It's not appropriate to let the original master exit while the leader of zookeeper shutdown Key: SPARK-23530 URL: https://issues.apache.org/jira/browse/SPARK-23530 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: liuxianjiao -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23096: Assignee: Apache Spark > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379665#comment-16379665 ] Apache Spark commented on SPARK-23096: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/20688 > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23096: Assignee: (was: Apache Spark) > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23448. -- Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 20666 [https://github.com/apache/spark/pull/20666] > Dataframe returns wrong result when column don't respect datatype > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: Local >Reporter: Ahmed ZAROUI >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.1 > > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val schema = StructType( > Seq(StructField("attr1", StringType, true), > StructField("attr2", ArrayType(StringType, true), true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23448: Assignee: Liang-Chi Hsieh > Dataframe returns wrong result when column don't respect datatype > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: Local >Reporter: Ahmed ZAROUI >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.1 > > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val schema = StructType( > Seq(StructField("attr1", StringType, true), > StructField("attr2", ArrayType(StringType, true), true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jepson resolved SPARK-22968. Resolution: Fixed > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.scheduler.JobGenerator.g
[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379639#comment-16379639 ] Jepson commented on SPARK-22968: It's been a long time.I close it first. > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.sc
[jira] [Comment Edited] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379614#comment-16379614 ] Lulu Cheng edited comment on SPARK-13210 at 2/28/18 1:23 AM: - [~Aspecting] Were you able to get around that? Have the same stack trace in 2.1.1 was (Author: l1990790120): [~Aspecting] Were you able to get around that, looking at a similar stack trace > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379614#comment-16379614 ] Lulu Cheng commented on SPARK-13210: [~Aspecting] Were you able to get around that, looking at a similar stack trace > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suman Somasundar updated SPARK-23529: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-18278) > Specify hostpath volume and mount the volume in Spark driver and executor > pods in Kubernetes > > > Key: SPARK-23529 > URL: https://issues.apache.org/jira/browse/SPARK-23529 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Suman Somasundar >Assignee: Anirudh Ramanathan >Priority: Minor > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23417) pyspark tests give wrong sbt instructions
[ https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23417: Assignee: Bruce Robbins > pyspark tests give wrong sbt instructions > - > > Key: SPARK-23417 > URL: https://issues.apache.org/jira/browse/SPARK-23417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > When running python/run-tests, the script indicates that I must run > "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or > 'build/mvn -Pkafka-0-8 package'". The sbt command fails: > > [error] Expected ID character > [error] Not a valid command: streaming-kafka-0-8-assembly > [error] Expected project ID > [error] Expected configuration > [error] Expected ':' (if selecting a configuration) > [error] Expected key > [error] Not a valid key: streaming-kafka-0-8-assembly > [error] streaming-kafka-0-8-assembly/assembly > [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23417) pyspark tests give wrong sbt instructions
[ https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23417. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20638 [https://github.com/apache/spark/pull/20638] > pyspark tests give wrong sbt instructions > - > > Key: SPARK-23417 > URL: https://issues.apache.org/jira/browse/SPARK-23417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > When running python/run-tests, the script indicates that I must run > "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or > 'build/mvn -Pkafka-0-8 package'". The sbt command fails: > > [error] Expected ID character > [error] Not a valid command: streaming-kafka-0-8-assembly > [error] Expected project ID > [error] Expected configuration > [error] Expected ':' (if selecting a configuration) > [error] Expected key > [error] Not a valid key: streaming-kafka-0-8-assembly > [error] streaming-kafka-0-8-assembly/assembly > [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379516#comment-16379516 ] Liang-Chi Hsieh commented on SPARK-23471: - I can't reproduce this. With `fit`, the params are copied and saved correctly. > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
Suman Somasundar created SPARK-23529: Summary: Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes Key: SPARK-23529 URL: https://issues.apache.org/jira/browse/SPARK-23529 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 2.3.0 Reporter: Suman Somasundar Assignee: Anirudh Ramanathan Fix For: 2.3.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suman Somasundar updated SPARK-23529: - Priority: Minor (was: Blocker) > Specify hostpath volume and mount the volume in Spark driver and executor > pods in Kubernetes > > > Key: SPARK-23529 > URL: https://issues.apache.org/jira/browse/SPARK-23529 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Suman Somasundar >Assignee: Anirudh Ramanathan >Priority: Minor > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23528) Expose vital statistics of GaussianMixtureModel
Erich Schubert created SPARK-23528: -- Summary: Expose vital statistics of GaussianMixtureModel Key: SPARK-23528 URL: https://issues.apache.org/jira/browse/SPARK-23528 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.1 Reporter: Erich Schubert Spark ML should expose vital statistics of the GMM model: * *Number of iterations* (actual, not max) until the tolerance threshold was hit: we can set a maximum, but how do we know the limit was large enough, and how many iterations it really took? * Final *log likelihood* of the model: if we run multiple times with different starting conditions, how do we know which run converged to the better fit? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296 ] Bago Amirbekian edited comment on SPARK-2 at 2/27/18 9:24 PM: -- [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala#L42 was (Author: bago.amirbekian): [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296 ] Bago Amirbekian commented on SPARK-2: - [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels
[ https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379282#comment-16379282 ] Bago Amirbekian commented on SPARK-19947: - I think this was resolved by [https://github.com/apache/spark/pull/18496] &; [https://github.com/apache/spark/pull/18613]. > RFormulaModel always throws Exception on transforming data with NULL or > Unseen labels > - > > Key: SPARK-19947 > URL: https://issues.apache.org/jira/browse/SPARK-19947 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Andrey Yatsuk >Priority: Major > > I have trained ML model and big data table in parquet. I want add new column > to this table with predicted values. I can't lose any data, but can having > null values in it. > RFormulaModel.fit() method creates new StringIndexer with default > (handleInvalid="error") parameter. Also VectorAssembler on NULL values > throwing Exception. So I must call df.na.drop() to transform this DataFrame > and I don't want to do this. > Need add to RFormula new parameter like handleInvalid in StringIndexer. > Or add transform(Seq): Vector method which user can use as UDF method > in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, > Seq)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23527) Error with spark-submit and kerberos with TLS-enabled Hadoop cluster
Ron Gonzalez created SPARK-23527: Summary: Error with spark-submit and kerberos with TLS-enabled Hadoop cluster Key: SPARK-23527 URL: https://issues.apache.org/jira/browse/SPARK-23527 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.2.1 Environment: core-site.xml hadoop.security.key.provider.path kms://ht...@host1.domain.com;host2.domain.com:16000/kms hdfs-site.xml dfs.encryption.key.provider.uri kms://ht...@host1.domain.com;host2.domain.com:16000/kms Reporter: Ron Gonzalez For current configuration of our enterprise cluster, I submit using spark-submit: ./spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.yarn.jars=hdfs:/user/user1/spark/lib/*.jar ../examples/jars/spark-examples_2.11-2.2.1.jar 10 I am getting the following problem: 18/02/27 21:03:48 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 3351181 for svchdc236d on ha-hdfs:nameservice1 Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: host1.domain.com;host2.domain.com at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) at org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:825) at org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:781) at org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86) at org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046) at org.apache.spark.deploy.yarn.security.HadoopFSCredentialProvider$$anonfun$obtainCredentials$1.apply(HadoopFSCredentialProvider.scala:52) If I get rid of the other host for the properties so instead of kms://ht...@host1.domain.com;host2.domain.com:16000/kms, I convert it to: kms://ht...@host1.domain.com:16000/kms it fails with a different error: java.io.IOException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target If I do the same thing using spark 1.6, it works so it seems like a regression... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22915: Assignee: (was: Apache Spark) > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22915: Assignee: Apache Spark > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379214#comment-16379214 ] Apache Spark commented on SPARK-22915: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/20686 > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:07 PM: -- [~Keepun], `train` is a protected API, it's called by `Predictor.fit` which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use `RandomForestClassifier.fit`? was (Author: bago.amirbekian): [~Keepun] {noformat} train {noformat} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM: -- [~Keepun] {code:java} train {code} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? was (Author: bago.amirbekian): [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM: -- [~Keepun] {noformat} train {noformat} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? was (Author: bago.amirbekian): [~Keepun] {code:java} train {code} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM: -- [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit|? was (Author: bago.amirbekian): [~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use `RandomForestClassifier.fit`? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM: -- [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? was (Author: bago.amirbekian): [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit|? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23500) Filters on named_structs could be pushed into scans
[ https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379201#comment-16379201 ] Henry Robinson commented on SPARK-23500: Ok, I figured it out! {{SimplifyCreateStructOps}} does not get applied recursively across the whole plan, just across the expressions recursively reachable from the root. So even the following: {{df.filter("named_struct('id', id, 'id2', id2).id > 1").select("id2")}} doesn't trigger the rule because the {{named_struct}} is never seen. Changing the rule to walk the plan, and then walk the expression trees rooted at each node, caused the optimization to trigger. {code} object SimplifyCreateStructOps extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case p => p.transformExpressionsUp { case GetStructField(createNamedStructLike: CreateNamedStructLike, ordinal, _) => createNamedStructLike.valExprs(ordinal) } } } {code} > Filters on named_structs could be pushed into scans > --- > > Key: SPARK-23500 > URL: https://issues.apache.org/jira/browse/SPARK-23500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Major > > Simple filters on dataframes joined with {{joinWith()}} are missing an > opportunity to get pushed into the scan because they're written in terms of > {{named_struct}} that could be removed by the optimizer. > Given the following simple query over two dataframes: > {code:java} > scala> val df = spark.read.parquet("one_million") > df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = spark.read.parquet("one_million") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > > 30").explain > == Physical Plan == > *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight > :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94] > : +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > struct, false].id)) >+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95] > +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30) > +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and > is then pushed down. When the filter is just above the scan, the > wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be > removed. Then the filter can be pushed down to Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian commented on SPARK-23471: - [~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use `RandomForestClassifier.fit`? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23501) Refactor AllStagesPage in order to avoid redundant code
[ https://issues.apache.org/jira/browse/SPARK-23501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23501: -- Assignee: Marco Gaido > Refactor AllStagesPage in order to avoid redundant code > --- > > Key: SPARK-23501 > URL: https://issues.apache.org/jira/browse/SPARK-23501 > Project: Spark > Issue Type: Task > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.4.0 > > > AllStagesPage contains a lot of copy-pasted code for the different statuses > of the stages. This can be improved in order to make it easier to add new > stages statuses here, make changes and make less error prone these tasks. > cc [~vanzin] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23501) Refactor AllStagesPage in order to avoid redundant code
[ https://issues.apache.org/jira/browse/SPARK-23501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23501. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20663 [https://github.com/apache/spark/pull/20663] > Refactor AllStagesPage in order to avoid redundant code > --- > > Key: SPARK-23501 > URL: https://issues.apache.org/jira/browse/SPARK-23501 > Project: Spark > Issue Type: Task > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.4.0 > > > AllStagesPage contains a lot of copy-pasted code for the different statuses > of the stages. This can be improved in order to make it easier to add new > stages statuses here, make changes and make less error prone these tasks. > cc [~vanzin] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23365: -- Assignee: Imran Rashid > DynamicAllocation with failure in straggler task can lead to a hung spark job > - > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors > requested when the executors in the last tasks of a taskset fail (eg. with an > OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of > executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target > number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many > tasks are active or pending in submitted stages, and computes how many > executors would be needed for them. And as tasks finish, it will actively > decrease that count, informing the {{CGSB}} along the way. (2) When it > decides executors are inactive for long enough, then it requests that > {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its > target number of executors: > https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of > events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the > {{CGSB}} too > (2) while that final task is still running, the other executors cross the > idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target > of 0 executors > If the final task completed normally now, everything would be OK; the next > taskset would get submitted, the {{EAM}} would increase the target number of > executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} > thinks it [doesn't need to update > anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], > because its target is already 1, which is all it needs for that final task; > and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf > "spark.dynamicAllocation.minExecutors=1" --conf > "spark.dynamicAllocation.maxExecutors=5" --conf > "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 1, 1).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => > SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > new Thread() { > override def run(): Unit = { > Thread.sleep(1) > println("about to kill exec " + badExec) > sc.killExecutor(badExec) > } > }.start() > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(2) // long enough that all the other tasks finish, and > the executors cross the idle timeout > // meanwhile, something else should kill this executor > itr > } else { > itr > } > }.collect() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23365. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20604 [https://github.com/apache/spark/pull/20604] > DynamicAllocation with failure in straggler task can lead to a hung spark job > - > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors > requested when the executors in the last tasks of a taskset fail (eg. with an > OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of > executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target > number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many > tasks are active or pending in submitted stages, and computes how many > executors would be needed for them. And as tasks finish, it will actively > decrease that count, informing the {{CGSB}} along the way. (2) When it > decides executors are inactive for long enough, then it requests that > {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its > target number of executors: > https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of > events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the > {{CGSB}} too > (2) while that final task is still running, the other executors cross the > idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target > of 0 executors > If the final task completed normally now, everything would be OK; the next > taskset would get submitted, the {{EAM}} would increase the target number of > executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} > thinks it [doesn't need to update > anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], > because its target is already 1, which is all it needs for that final task; > and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf > "spark.dynamicAllocation.minExecutors=1" --conf > "spark.dynamicAllocation.maxExecutors=5" --conf > "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 1, 1).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => > SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > new Thread() { > override def run(): Unit = { > Thread.sleep(1) > println("about to kill exec " + badExec) > sc.killExecutor(badExec) > } > }.start() > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(2) // long enough that all the other tasks finish, and > the executors cross the idle timeout > // meanwhile, something else should kill this executor > itr > } else { > itr > } > }.collect() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379056#comment-16379056 ] Attila Zsolt Piros commented on SPARK-22915: Thank you as I have already a lot of work in it. I can do a quick PR right now, but I would prefer to do one more self review and test executions beforehand. I can promise finishing it in the following two days (as I am now on a journey). > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector
[ https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18062. --- Resolution: Duplicate > ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should > return probabilities when given all-0 vector > > > Key: SPARK-18062 > URL: https://issues.apache.org/jira/browse/SPARK-18062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > > {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns > a vector of all-0 when given a rawPrediction vector of all-0. It should > return a valid probability vector with the uniform distribution. > Note: This will be a *behavior change* but it should be very minor and affect > few if any users. But we should note it in release notes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods
[ https://issues.apache.org/jira/browse/SPARK-19416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379034#comment-16379034 ] Joseph K. Bradley commented on SPARK-19416: --- [~rxin] Shall we close this as Won't Do, or shall we mark it as a thing to break in 3.0? > Dataset.schema is inconsistent with Dataset in handling columns with periods > > > Key: SPARK-19416 > URL: https://issues.apache.org/jira/browse/SPARK-19416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > When you have a DataFrame with a column with a period in its name, the API is > inconsistent about how to quote the column name. > Here's a reproduction: > {code} > import org.apache.spark.sql.functions.col > val rows = Seq( > ("foo", 1), > ("bar", 2) > ) > val df = spark.createDataFrame(rows).toDF("a.b", "id") > {code} > These methods are all consistent: > {code} > df.select("a.b") // fails > df.select("`a.b`") // succeeds > df.select(col("a.b")) // fails > df.select(col("`a.b`")) // succeeds > df("a.b") // fails > df("`a.b`") // succeeds > {code} > But {{schema}} is inconsistent: > {code} > df.schema("a.b") // succeeds > df.schema("`a.b`") // fails > {code} > "fails" produces error messages like: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input > columns: [a.b, id];; > 'Project ['a.b] > +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517] >+- LocalRelation [_1#1511, _2#1512] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1121) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1139) > at > line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw$$
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379035#comment-16379035 ] yogesh garg commented on SPARK-22915: - Ah, doesn't make sense for me to take it then. Thanks! Please go ahead. > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23523: -- Affects Version/s: 2.1.2 > Incorrect result caused by the rule OptimizeMetadataOnlyQuery > - > > Key: SPARK-23523 > URL: https://issues.apache.org/jira/browse/SPARK-23523 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > {code:scala} > val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") > Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") > .write.json(tablePath.getCanonicalPath) > val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", > "CoL3").distinct() > df.show() > {code} > This returns a wrong result > {{[c,e,a]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378987#comment-16378987 ] Attila Zsolt Piros commented on SPARK-22915: I am also very close to a PR. > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378978#comment-16378978 ] yogesh garg commented on SPARK-22915: - I have started working on this and can raise a PR soon. Thanks for the help! > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378963#comment-16378963 ] Dongjoon Hyun commented on SPARK-19809: --- I meant 2.1.1 literally. :) > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) >
[jira] [Resolved] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23523. - Resolution: Fixed Fix Version/s: 2.3.0 > Incorrect result caused by the rule OptimizeMetadataOnlyQuery > - > > Key: SPARK-23523 > URL: https://issues.apache.org/jira/browse/SPARK-23523 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.0 > > > {code:scala} > val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") > Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") > .write.json(tablePath.getCanonicalPath) > val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", > "CoL3").distinct() > df.show() > {code} > This returns a wrong result > {{[c,e,a]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23523: Fix Version/s: (was: 2.3.0) 2.4.0 > Incorrect result caused by the rule OptimizeMetadataOnlyQuery > - > > Key: SPARK-23523 > URL: https://issues.apache.org/jira/browse/SPARK-23523 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > {code:scala} > val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") > Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") > .write.json(tablePath.getCanonicalPath) > val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", > "CoL3").distinct() > df.show() > {code} > This returns a wrong result > {{[c,e,a]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10856) SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of TIMESTAMP
[ https://issues.apache.org/jira/browse/SPARK-10856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378904#comment-16378904 ] Greg Michalopoulos commented on SPARK-10856: I am facing an issue with one of the datatypes (bit) suggested by [~henrikbehrens] for patching. Was this addressed in another bug? I am using Spark 2.1.1. > SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of > TIMESTAMP > --- > > Key: SPARK-10856 > URL: https://issues.apache.org/jira/browse/SPARK-10856 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Henrik Behrens >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: patch > Fix For: 1.6.0 > > > When saving a DataFrame to MS SQL Server, en error is thrown if there is more > than one TIMESTAMP column: > df.printSchema > root > |-- Id: string (nullable = false) > |-- TypeInformation_CreatedBy: string (nullable = false) > |-- TypeInformation_ModifiedBy: string (nullable = true) > |-- TypeInformation_TypeStatus: integer (nullable = false) > |-- TypeInformation_CreatedAtDatabase: timestamp (nullable = false) > |-- TypeInformation_ModifiedAtDatabase: timestamp (nullable = true) > df.write.mode("overwrite").jdbc(url, tablename, props) > com.microsoft.sqlserver.jdbc.SQLServerException: A table can only have one > timestamp column. Because table 'DebtorTypeSet1' already has one, the column > 'TypeInformation_ModifiedAtDatabase' cannot be added. > at > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError > (SQLServerException.java:217) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServ > erStatement.java:1635) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePrep > aredStatement(SQLServerPreparedStatement.java:426) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecC > md.doExecute(SQLServerPreparedStatement.java:372) > at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:6276) > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLSe > rverConnection.java:1793) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLSer > verStatement.java:184) > at > com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLS > erverStatement.java:159) > at > com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeUpdate > (SQLServerPreparedStatement.java:315) > I tested this on Windows and SQL Server 12 using Spark 1.4.1. > I think this can be fixed in a similar way to Spark-10419. > As a refererence, here is the type mapping according to the SQL Server JDBC > driver (basicDT.java, extracted from sqljdbc_4.2.6420.100_enu.exe): >private static void displayRow(String title, ResultSet rs) { > try { > System.out.println(title); > System.out.println(rs.getInt(1) + " , " +// SQL integer > type. >rs.getString(2) + " , " + // SQL char > type. >rs.getString(3) + " , " + // SQL varchar > type. >rs.getBoolean(4) + " , " + // SQL bit type. >rs.getDouble(5) + " , " + // SQL decimal > type. >rs.getDouble(6) + " , " + // SQL money > type. >rs.getTimestamp(7) + " , " + // SQL datetime > type. >rs.getDate(8) + " , " +// SQL date > type. >rs.getTime(9) + " , " +// SQL time > type. >rs.getTimestamp(10) + " , " + // SQL > datetime2 type. >((SQLServerResultSet)rs).getDateTimeOffset(11)); // SQL > datetimeoffset type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378879#comment-16378879 ] Edwina Lu commented on SPARK-23206: --- [~assia6], I haven't submitted the PR yet – the code is currently in Spark 2.1 and needs to be forward ported. I will update the ticket when the PR is available. [~felixcheung], I will let you know when we schedule the next meeting. Is there more information on what you are looking into for shuffle, and metric collections? > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17147. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20572 [https://github.com/apache/spark/pull/20572] > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Assignee: Cody Koeninger >Priority: Major > Fix For: 2.4.0 > > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-17147: - Assignee: Cody Koeninger > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Assignee: Cody Koeninger >Priority: Major > Fix For: 2.4.0 > > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
[ https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378636#comment-16378636 ] PandaMonkey commented on SPARK-23509: - [~srowen]Thx for your reply. > Upgrade commons-net from 2.2 to 3.1 > --- > > Key: SPARK-23509 > URL: https://issues.apache.org/jira/browse/SPARK-23509 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PandaMonkey >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.4.0 > > Attachments: spark.txt > > > Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core > depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity > introduced commons-net:3.1. At the same time, Spark-core directly depends on > a older version of commons-net:2.2. By further look into the source code, > these two versions of commons-net have many different features. The > dependency conflict problem brings high risks of "NotClassDefFoundError:" or > "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe > upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can > help you. Thanks! > > Regards, > Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
[ https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23509: -- Priority: Minor (was: Major) > Upgrade commons-net from 2.2 to 3.1 > --- > > Key: SPARK-23509 > URL: https://issues.apache.org/jira/browse/SPARK-23509 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PandaMonkey >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.4.0 > > Attachments: spark.txt > > > Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core > depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity > introduced commons-net:3.1. At the same time, Spark-core directly depends on > a older version of commons-net:2.2. By further look into the source code, > these two versions of commons-net have many different features. The > dependency conflict problem brings high risks of "NotClassDefFoundError:" or > "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe > upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can > help you. Thanks! > > Regards, > Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
[ https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23509: - Assignee: Kazuaki Ishizaki > Upgrade commons-net from 2.2 to 3.1 > --- > > Key: SPARK-23509 > URL: https://issues.apache.org/jira/browse/SPARK-23509 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PandaMonkey >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > Attachments: spark.txt > > > Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core > depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity > introduced commons-net:3.1. At the same time, Spark-core directly depends on > a older version of commons-net:2.2. By further look into the source code, > these two versions of commons-net have many different features. The > dependency conflict problem brings high risks of "NotClassDefFoundError:" or > "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe > upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can > help you. Thanks! > > Regards, > Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
[ https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23509. --- Resolution: Fixed Issue resolved by pull request 20672 [https://github.com/apache/spark/pull/20672] > Upgrade commons-net from 2.2 to 3.1 > --- > > Key: SPARK-23509 > URL: https://issues.apache.org/jira/browse/SPARK-23509 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PandaMonkey >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > Attachments: spark.txt > > > Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core > depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity > introduced commons-net:3.1. At the same time, Spark-core directly depends on > a older version of commons-net:2.2. By further look into the source code, > these two versions of commons-net have many different features. The > dependency conflict problem brings high risks of "NotClassDefFoundError:" or > "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe > upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can > help you. Thanks! > > Regards, > Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table
[ https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378550#comment-16378550 ] Pavlo Skliar commented on SPARK-23525: -- Unbelievably quick response as for an open-sourced project. so appreciated, thanks > ALTER TABLE CHANGE COLUMN doesn't work for external hive table > -- > > Key: SPARK-23525 > URL: https://issues.apache.org/jira/browse/SPARK-23525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Pavlo Skliar >Priority: Major > > {code:java} > print(spark.sql(""" > SHOW CREATE TABLE test.trends > """).collect()[0].createtab_stmt) > /// OUTPUT > CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string > COMMENT '', `amount` bigint COMMENT '') > COMMENT '' > PARTITIONED BY (`date` string COMMENT '') > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 's3://x/x/' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1519729384', > 'last_modified_time' = '1519645652', > 'last_modified_by' = 'pavlo', > 'last_castor_run_ts' = '1513561658.0' > ) > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), > Row(col_name='metric', data_type='string', comment=''), > Row(col_name='amount', data_type='bigint', comment=''), > Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > spark.sql("""alter table test.trends change column id id string comment > 'unique identifier'""") > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', > data_type='string', comment=''), Row(col_name='amount', data_type='bigint', > comment=''), Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > {code} > The strange is that I've assigned comment to the id field from hive > successfully, and it's visible in Hue UI, but it's still not visible in from > spark, and any spark requests doesn't have effect on the comments. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table
[ https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378546#comment-16378546 ] Jiang Xingbo commented on SPARK-23525: -- I'm working on a fix for it, and will try to backport the fix to 2.2. > ALTER TABLE CHANGE COLUMN doesn't work for external hive table > -- > > Key: SPARK-23525 > URL: https://issues.apache.org/jira/browse/SPARK-23525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Pavlo Skliar >Priority: Major > > {code:java} > print(spark.sql(""" > SHOW CREATE TABLE test.trends > """).collect()[0].createtab_stmt) > /// OUTPUT > CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string > COMMENT '', `amount` bigint COMMENT '') > COMMENT '' > PARTITIONED BY (`date` string COMMENT '') > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 's3://x/x/' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1519729384', > 'last_modified_time' = '1519645652', > 'last_modified_by' = 'pavlo', > 'last_castor_run_ts' = '1513561658.0' > ) > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), > Row(col_name='metric', data_type='string', comment=''), > Row(col_name='amount', data_type='bigint', comment=''), > Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > spark.sql("""alter table test.trends change column id id string comment > 'unique identifier'""") > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', > data_type='string', comment=''), Row(col_name='amount', data_type='bigint', > comment=''), Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > {code} > The strange is that I've assigned comment to the id field from hive > successfully, and it's visible in Hue UI, but it's still not visible in from > spark, and any spark requests doesn't have effect on the comments. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table
[ https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378533#comment-16378533 ] Jiang Xingbo commented on SPARK-23525: -- Thank you for reporting this. I believe the bug is caused by: https://github.com/apache/spark/blob/8077bb04f350fd35df83ef896135c0672dc3f7b0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L613 > ALTER TABLE CHANGE COLUMN doesn't work for external hive table > -- > > Key: SPARK-23525 > URL: https://issues.apache.org/jira/browse/SPARK-23525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Pavlo Skliar >Priority: Major > > {code:java} > print(spark.sql(""" > SHOW CREATE TABLE test.trends > """).collect()[0].createtab_stmt) > /// OUTPUT > CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string > COMMENT '', `amount` bigint COMMENT '') > COMMENT '' > PARTITIONED BY (`date` string COMMENT '') > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 's3://x/x/' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1519729384', > 'last_modified_time' = '1519645652', > 'last_modified_by' = 'pavlo', > 'last_castor_run_ts' = '1513561658.0' > ) > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), > Row(col_name='metric', data_type='string', comment=''), > Row(col_name='amount', data_type='bigint', comment=''), > Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > spark.sql("""alter table test.trends change column id id string comment > 'unique identifier'""") > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', > data_type='string', comment=''), Row(col_name='amount', data_type='bigint', > comment=''), Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > {code} > The strange is that I've assigned comment to the id field from hive > successfully, and it's visible in Hue UI, but it's still not visible in from > spark, and any spark requests doesn't have effect on the comments. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table
[ https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-23525: - Affects Version/s: 2.3.0 Priority: Major (was: Minor) Summary: ALTER TABLE CHANGE COLUMN doesn't work for external hive table (was: Update column comment doesn't work from spark) > ALTER TABLE CHANGE COLUMN doesn't work for external hive table > -- > > Key: SPARK-23525 > URL: https://issues.apache.org/jira/browse/SPARK-23525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Pavlo Skliar >Priority: Major > > {code:java} > print(spark.sql(""" > SHOW CREATE TABLE test.trends > """).collect()[0].createtab_stmt) > /// OUTPUT > CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string > COMMENT '', `amount` bigint COMMENT '') > COMMENT '' > PARTITIONED BY (`date` string COMMENT '') > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION 's3://x/x/' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1519729384', > 'last_modified_time' = '1519645652', > 'last_modified_by' = 'pavlo', > 'last_castor_run_ts' = '1513561658.0' > ) > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), > Row(col_name='metric', data_type='string', comment=''), > Row(col_name='amount', data_type='bigint', comment=''), > Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > spark.sql("""alter table test.trends change column id id string comment > 'unique identifier'""") > spark.sql(""" > DESCRIBE test.trends > """).collect() > // OUTPUT > [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', > data_type='string', comment=''), Row(col_name='amount', data_type='bigint', > comment=''), Row(col_name='date', data_type='string', comment=''), > Row(col_name='# Partition Information', data_type='', comment=''), > Row(col_name='# col_name', data_type='data_type', comment='comment'), > Row(col_name='date', data_type='string', comment='')] > {code} > The strange is that I've assigned comment to the id field from hive > successfully, and it's visible in Hue UI, but it's still not visible in from > spark, and any spark requests doesn't have effect on the comments. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23526) KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log
[ https://issues.apache.org/jira/browse/SPARK-23526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378468#comment-16378468 ] Wenchen Fan commented on SPARK-23526: - cc [~zsxwing] > KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only > one offset in offset log > --- > > Key: SPARK-23526 > URL: https://issues.apache.org/jira/browse/SPARK-23526 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > Labels: flaky-test > > See it failed in PR builder with error message: > {code:java} > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 676a8b08-c89b-450b-8cd8-fbf9868cd240, runId = > 46bb7aae-138b-420d-9b4f-44f42a2a4a0f] terminated with exception: Job aborted > due to stage failure: Task 0 in stage 163.0 failed 1 times, most recent > failure: Lost task 0.0 in stage 163.0 (TID 799, localhost, executor driver): > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:305) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:216) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) > at > org.apache.spark.sql.kafka010.KafkaMicroBatchDataReader.next(KafkaMicroBatchReader.scala:353) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:109) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23526) KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log
Wenchen Fan created SPARK-23526: --- Summary: KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log Key: SPARK-23526 URL: https://issues.apache.org/jira/browse/SPARK-23526 Project: Spark Issue Type: Test Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Wenchen Fan See it failed in PR builder with error message: {code:java} sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 676a8b08-c89b-450b-8cd8-fbf9868cd240, runId = 46bb7aae-138b-420d-9b4f-44f42a2a4a0f] terminated with exception: Job aborted due to stage failure: Task 0 in stage 163.0 failed 1 times, most recent failure: Lost task 0.0 in stage 163.0 (TID 799, localhost, executor driver): java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:305) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:216) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaMicroBatchDataReader.next(KafkaMicroBatchReader.scala:353) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23525) Update column comment doesn't work from spark
Pavlo Skliar created SPARK-23525: Summary: Update column comment doesn't work from spark Key: SPARK-23525 URL: https://issues.apache.org/jira/browse/SPARK-23525 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Pavlo Skliar {code:java} print(spark.sql(""" SHOW CREATE TABLE test.trends """).collect()[0].createtab_stmt) /// OUTPUT CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string COMMENT '', `amount` bigint COMMENT '') COMMENT '' PARTITIONED BY (`date` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://x/x/' TBLPROPERTIES ( 'transient_lastDdlTime' = '1519729384', 'last_modified_time' = '1519645652', 'last_modified_by' = 'pavlo', 'last_castor_run_ts' = '1513561658.0' ) spark.sql(""" DESCRIBE test.trends """).collect() // OUTPUT [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', data_type='string', comment=''), Row(col_name='amount', data_type='bigint', comment=''), Row(col_name='date', data_type='string', comment=''), Row(col_name='# Partition Information', data_type='', comment=''), Row(col_name='# col_name', data_type='data_type', comment='comment'), Row(col_name='date', data_type='string', comment='')] spark.sql("""alter table test.trends change column id id string comment 'unique identifier'""") spark.sql(""" DESCRIBE test.trends """).collect() // OUTPUT [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', data_type='string', comment=''), Row(col_name='amount', data_type='bigint', comment=''), Row(col_name='date', data_type='string', comment=''), Row(col_name='# Partition Information', data_type='', comment=''), Row(col_name='# col_name', data_type='data_type', comment='comment'), Row(col_name='date', data_type='string', comment='')] {code} The strange is that I've assigned comment to the id field from hive successfully, and it's visible in Hue UI, but it's still not visible in from spark, and any spark requests doesn't have effect on the comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.
[ https://issues.apache.org/jira/browse/SPARK-17859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fernando Pereira updated SPARK-17859: - Fix Version/s: 2.2.1 > persist should not impede with spark's ability to perform a broadcast join. > --- > > Key: SPARK-17859 > URL: https://issues.apache.org/jira/browse/SPARK-17859 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0 > Environment: spark 2.0.0 , Linux RedHat >Reporter: Franck Tago >Priority: Major > Fix For: 2.0.2, 2.2.1 > > > I am using Spark 2.0.0 > My investigation leads me to conclude that calling persist could prevent > broadcast join from happening . > Example > Case1: No persist call > var df1 =spark.range(100).select($"id".as("id1")) > df1: org.apache.spark.sql.DataFrame = [id1: bigint] > var df2 =spark.range(1000).select($"id".as("id2")) > df2: org.apache.spark.sql.DataFrame = [id2: bigint] > df1.join(df2 , $"id1" === $"id2" ).explain > == Physical Plan == > *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight > :- *Project [id#114L AS id1#117L] > : +- *Range (0, 100, splits=2) > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])) >+- *Project [id#120L AS id2#123L] > +- *Range (0, 1000, splits=2) > Case 2: persist call > df1.persist.join(df2 , $"id1" === $"id2" ).explain > 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data. > == Physical Plan == > *SortMergeJoin [id1#3L], [id2#9L], Inner > :- *Sort [id1#3L ASC], false, 0 > : +- Exchange hashpartitioning(id1#3L, 10) > : +- InMemoryTableScan [id1#3L] > :: +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > :: : +- *Project [id#0L AS id1#3L] > :: : +- *Range (0, 100, splits=2) > +- *Sort [id2#9L ASC], false, 0 >+- Exchange hashpartitioning(id2#9L, 10) > +- InMemoryTableScan [id2#9L] > : +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : : +- *Project [id#6L AS id2#9L] > : : +- *Range (0, 1000, splits=2) > Why does the persist call prevent the broadcast join . > My opinion is that it should not . > I was made aware that the persist call is lazy and that might have something > to do with it , but I still contend that it should not . > Losing broadcast joins is really costly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.
[ https://issues.apache.org/jira/browse/SPARK-23524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23524: Assignee: Apache Spark > Big local shuffle blocks should not be checked for corruption. > -- > > Key: SPARK-23524 > URL: https://issues.apache.org/jira/browse/SPARK-23524 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: jin xing >Assignee: Apache Spark >Priority: Major > > In current code, all local blocks will be checked for corruption no matter > it's big or not. The reasons are as below: > # Size in FetchResult for local block is set to be 0 > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)] > # SPARK-4105 meant to only check the small blocks(size but for reason 1, below check will be invalid. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.
[ https://issues.apache.org/jira/browse/SPARK-23524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23524: Assignee: (was: Apache Spark) > Big local shuffle blocks should not be checked for corruption. > -- > > Key: SPARK-23524 > URL: https://issues.apache.org/jira/browse/SPARK-23524 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: jin xing >Priority: Major > > In current code, all local blocks will be checked for corruption no matter > it's big or not. The reasons are as below: > # Size in FetchResult for local block is set to be 0 > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)] > # SPARK-4105 meant to only check the small blocks(size but for reason 1, below check will be invalid. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.
[ https://issues.apache.org/jira/browse/SPARK-23524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378324#comment-16378324 ] Apache Spark commented on SPARK-23524: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/20685 > Big local shuffle blocks should not be checked for corruption. > -- > > Key: SPARK-23524 > URL: https://issues.apache.org/jira/browse/SPARK-23524 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: jin xing >Priority: Major > > In current code, all local blocks will be checked for corruption no matter > it's big or not. The reasons are as below: > # Size in FetchResult for local block is set to be 0 > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)] > # SPARK-4105 meant to only check the small blocks(size but for reason 1, below check will be invalid. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23351) checkpoint corruption in long running application
[ https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ahern resolved SPARK-23351. - Resolution: Fixed Fix Version/s: 2.2.1 already fixed in future version... will test when available by Cloudera > checkpoint corruption in long running application > - > > Key: SPARK-23351 > URL: https://issues.apache.org/jira/browse/SPARK-23351 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: David Ahern >Priority: Major > Fix For: 2.2.1 > > > hi, after leaving my (somewhat high volume) Structured Streaming application > running for some time, i get the following exception. The same exception > also repeats when i try to restart the application. The only way to get the > application back running is to clear the checkpoint directory which is far > from ideal. > Maybe a stream is not being flushed/closed properly internally by Spark when > checkpointing? > > User class threw exception: > org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to > stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost > task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application
[ https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378311#comment-16378311 ] David Ahern commented on SPARK-23351: - will close this assuming it's fixed in future version... will reopen if necessary > checkpoint corruption in long running application > - > > Key: SPARK-23351 > URL: https://issues.apache.org/jira/browse/SPARK-23351 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: David Ahern >Priority: Major > > hi, after leaving my (somewhat high volume) Structured Streaming application > running for some time, i get the following exception. The same exception > also repeats when i try to restart the application. The only way to get the > application back running is to clear the checkpoint directory which is far > from ideal. > Maybe a stream is not being flushed/closed properly internally by Spark when > checkpointing? > > User class threw exception: > org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to > stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost > task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.
jin xing created SPARK-23524: Summary: Big local shuffle blocks should not be checked for corruption. Key: SPARK-23524 URL: https://issues.apache.org/jira/browse/SPARK-23524 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: jin xing In current code, all local blocks will be checked for corruption no matter it's big or not. The reasons are as below: # Size in FetchResult for local block is set to be 0 ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)] # SPARK-4105 meant to only check the small blocks(sizehttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure
[ https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378303#comment-16378303 ] 吴志龙 commented on SPARK-23346: - This bug is found and solved? This bug will result in incorrect data. > Failed tasks reported as success if the failure reason is not ExceptionFailure > -- > > Key: SPARK-23346 > URL: https://issues.apache.org/jira/browse/SPARK-23346 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0 >Reporter: 吴志龙 >Priority: Critical > Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png > > > !企业微信截图_15179715023606.png! !企业微信截图_15179714603307.png! We have many other > failure reasons, such as TaskResultLost,but the status is success. In the web > ui, we count non-ExceptionFailure failures as successful tasks, which is > highly misleading. > detail message: > Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most > recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor > 27): TaskResultLost (result lost from block manager) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org