date:20180227

[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23532:

Description: 
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, refer to 
https://issues.apache.org/jira/browse/SPARK-16944

It would be better that Standalone can also support this feature.

  was:
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944

It would be better that Standalone can also support this feature.


> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23532:


Assignee: (was: Apache Spark)

> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379906#comment-16379906
 ] 

Apache Spark commented on SPARK-23532:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/20690

> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23532:


Assignee: Apache Spark

> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-02-27 Thread liuxianjiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxianjiao updated SPARK-23530:

Priority: Critical  (was: Major)

> It's not appropriate to let the original master exit while the leader of 
> zookeeper shutdown
> ---
>
> Key: SPARK-23530
> URL: https://issues.apache.org/jira/browse/SPARK-23530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: liuxianjiao
>Priority: Critical
>
> When the leader of zookeeper shutdown,the current method of spark is letting 
> the master exit to revoke the leadership.However,this sacrifice a master 
> node.According the treatment of hadoop and storm ,we should let the origin 
> active master to be standby ,or Re-election for spark master,or any other 
> ways to revoke leadership gracefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23532:

Description: 
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944

It would be better that Standalone can also support this feature.

  was:
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.


> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23532:

Description: 
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.

  was:
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
+https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.


> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944+
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23533:


Assignee: Apache Spark

> Add support for changing ContinuousDataReader's startOffset
> ---
>
> Key: SPARK-23533
> URL: https://issues.apache.org/jira/browse/SPARK-23533
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>Priority: Major
>
> As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
> new interface `ContinuousDataReaderFactory` to support the requirements of 
> setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23533:


Assignee: (was: Apache Spark)

> Add support for changing ContinuousDataReader's startOffset
> ---
>
> Key: SPARK-23533
> URL: https://issues.apache.org/jira/browse/SPARK-23533
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
>
> As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
> new interface `ContinuousDataReaderFactory` to support the requirements of 
> setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset

2018-02-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379897#comment-16379897
 ] 

Apache Spark commented on SPARK-23533:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/20689

> Add support for changing ContinuousDataReader's startOffset
> ---
>
> Key: SPARK-23533
> URL: https://issues.apache.org/jira/browse/SPARK-23533
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
>
> As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
> new interface `ContinuousDataReaderFactory` to support the requirements of 
> setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset

2018-02-27 Thread Li Yuanjian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23533:

Summary: Add support for changing ContinuousDataReader's startOffset  (was: 
Add support for changing ContinousDataReader's startOffset)

> Add support for changing ContinuousDataReader's startOffset
> ---
>
> Key: SPARK-23533
> URL: https://issues.apache.org/jira/browse/SPARK-23533
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
>
> As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
> new interface `ContinuousDataReaderFactory` to support the requirements of 
> setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23533) Add support for changing ContinousDataReader's startOffset

2018-02-27 Thread Li Yuanjian (JIRA)

Li Yuanjian created SPARK-23533:
---

 Summary: Add support for changing ContinousDataReader's startOffset
 Key: SPARK-23533
 URL: https://issues.apache.org/jira/browse/SPARK-23533
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Li Yuanjian


As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
new interface `ContinuousDataReaderFactory` to support the requirements of 
setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)

liuxian created SPARK-23532:
---

 Summary: [STANDALONE] Improve data locality when launching new 
executors for dynamic allocation
 Key: SPARK-23532
 URL: https://issues.apache.org/jira/browse/SPARK-23532
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
+https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2018-02-27 Thread Astha Arya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379807#comment-16379807
 ] 

Astha Arya commented on SPARK-14922:


Any updates ?

 

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.1
>Reporter: Xiao Li
>Priority: Major
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23531) When explain, plan's output should include attribute type info

2018-02-27 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-23531:
---

 Summary: When explain, plan's output should include attribute type 
info
 Key: SPARK-23531
 URL: https://issues.apache.org/jira/browse/SPARK-23531
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23498) Accuracy problem in comparison with string and integer

2018-02-27 Thread Kevin Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379753#comment-16379753
 ] 

Kevin Zhang commented on SPARK-23498:
-

yes, thanks. But when we use spark sql to run existing hive scripts we expected 
spark sql could have the same results as hive, and that's why I open this jira. 
Now that [~q79969786] has marked this as duplicated with [SPARK-21646 
|https://issues.apache.org/jira/browse/SPARK-21646], I will patch in my own 
branch first.

> Accuracy problem in comparison with string and integer
> --
>
> Key: SPARK-23498
> URL: https://issues.apache.org/jira/browse/SPARK-23498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kevin Zhang
>Priority: Major
>
> While comparing a string column with integer value, spark sql will 
> automatically cast the string operant to int, the following sql will return 
> true in hive but false in spark
>  
> {code:java}
> select '1000.1'>1000
> {code}
>  
>  from the physical plan we can see the string operant was cast to int which 
> caused the accuracy loss
> {code:java}
> *Project [false AS (CAST(1000.1 AS INT) > 1000)#4]
> +- Scan OneRowRelation[]
> {code}
> To solve it, using a wider common type like double to cast both sides of 
> operant of a binary operator may be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-02-27 Thread liuxianjiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxianjiao updated SPARK-23530:

Affects Version/s: 2.3.0

> It's not appropriate to let the original master exit while the leader of 
> zookeeper shutdown
> ---
>
> Key: SPARK-23530
> URL: https://issues.apache.org/jira/browse/SPARK-23530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: liuxianjiao
>Priority: Major
>
> When the leader of zookeeper shutdown,the current method of spark is letting 
> the master exit to revoke the leadership.However,this sacrifice a master 
> node.According the treatment of hadoop and storm ,we should let the origin 
> active master to be standby ,or Re-election for spark master,or any other 
> ways to revoke leadership gracefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-02-27 Thread liuxianjiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxianjiao updated SPARK-23530:

Comment: was deleted

(was: When the leader of zookeeper shutdown,the current method of spark is 
letting the master exit to revoke the leadership.However,this sacrifice a 
master node.According the treatment of hadoop and storm ,we should let the 
origin active master to be standby ,or Re-election for spark master,or any 
other ways to revoke leadership gracefully.)

> It's not appropriate to let the original master exit while the leader of 
> zookeeper shutdown
> ---
>
> Key: SPARK-23530
> URL: https://issues.apache.org/jira/browse/SPARK-23530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: liuxianjiao
>Priority: Major
>
> When the leader of zookeeper shutdown,the current method of spark is letting 
> the master exit to revoke the leadership.However,this sacrifice a master 
> node.According the treatment of hadoop and storm ,we should let the origin 
> active master to be standby ,or Re-election for spark master,or any other 
> ways to revoke leadership gracefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-02-27 Thread liuxianjiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxianjiao updated SPARK-23530:

Description: When the leader of zookeeper shutdown,the current method of 
spark is letting the master exit to revoke the leadership.However,this 
sacrifice a master node.According the treatment of hadoop and storm ,we should 
let the origin active master to be standby ,or Re-election for spark master,or 
any other ways to revoke leadership gracefully.

> It's not appropriate to let the original master exit while the leader of 
> zookeeper shutdown
> ---
>
> Key: SPARK-23530
> URL: https://issues.apache.org/jira/browse/SPARK-23530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: liuxianjiao
>Priority: Major
>
> When the leader of zookeeper shutdown,the current method of spark is letting 
> the master exit to revoke the leadership.However,this sacrifice a master 
> node.According the treatment of hadoop and storm ,we should let the origin 
> active master to be standby ,or Re-election for spark master,or any other 
> ways to revoke leadership gracefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-02-27 Thread liuxianjiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379723#comment-16379723
 ] 

liuxianjiao commented on SPARK-23530:
-

When the leader of zookeeper shutdown,the current method of spark is letting 
the master exit to revoke the leadership.However,this sacrifice a master 
node.According the treatment of hadoop and storm ,we should let the origin 
active master to be standby ,or Re-election for spark master,or any other ways 
to revoke leadership gracefully.

> It's not appropriate to let the original master exit while the leader of 
> zookeeper shutdown
> ---
>
> Key: SPARK-23530
> URL: https://issues.apache.org/jira/browse/SPARK-23530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: liuxianjiao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-02-27 Thread liuxianjiao (JIRA)

liuxianjiao created SPARK-23530:
---

 Summary: It's not appropriate to let the original master exit 
while the leader of zookeeper shutdown
 Key: SPARK-23530
 URL: https://issues.apache.org/jira/browse/SPARK-23530
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: liuxianjiao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23096) Migrate rate source to v2

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23096:


Assignee: Apache Spark

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23096) Migrate rate source to v2

2018-02-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379665#comment-16379665
 ] 

Apache Spark commented on SPARK-23096:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20688

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23096) Migrate rate source to v2

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23096:


Assignee: (was: Apache Spark)

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype

2018-02-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23448.
--
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 20666
[https://github.com/apache/spark/pull/20666]

> Dataframe returns wrong result when column don't respect datatype
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.1
>
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype

2018-02-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23448:


Assignee: Liang-Chi Hsieh

> Dataframe returns wrong result when column don't respect datatype
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.1
>
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-02-27 Thread Jepson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson resolved SPARK-22968.

Resolution: Fixed

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.g

[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-02-27 Thread Jepson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379639#comment-16379639
 ] 

Jepson commented on SPARK-22968:


It's been a long time.I close it first.

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.sc

[jira] [Comment Edited] (SPARK-13210) NPE in Sort

2018-02-27 Thread Lulu Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379614#comment-16379614
 ] 

Lulu Cheng edited comment on SPARK-13210 at 2/28/18 1:23 AM:
-

[~Aspecting] Were you able to get around that? Have the same stack trace in 
2.1.1


was (Author: l1990790120):
[~Aspecting] Were you able to get around that, looking at a similar stack trace

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13210) NPE in Sort

2018-02-27 Thread Lulu Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379614#comment-16379614
 ] 

Lulu Cheng commented on SPARK-13210:


[~Aspecting] Were you able to get around that, looking at a similar stack trace

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-02-27 Thread Suman Somasundar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suman Somasundar updated SPARK-23529:
-
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-18278)

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Anirudh Ramanathan
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23417:


Assignee: Bruce Robbins

> pyspark tests give wrong sbt instructions
> -
>
> Key: SPARK-23417
> URL: https://issues.apache.org/jira/browse/SPARK-23417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> When running python/run-tests, the script indicates that I must run 
> "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -Pkafka-0-8 package'". The sbt command fails:
>  
> [error] Expected ID character
> [error] Not a valid command: streaming-kafka-0-8-assembly
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: streaming-kafka-0-8-assembly
> [error] streaming-kafka-0-8-assembly/assembly
> [error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23417.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20638
[https://github.com/apache/spark/pull/20638]

> pyspark tests give wrong sbt instructions
> -
>
> Key: SPARK-23417
> URL: https://issues.apache.org/jira/browse/SPARK-23417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> When running python/run-tests, the script indicates that I must run 
> "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -Pkafka-0-8 package'". The sbt command fails:
>  
> [error] Expected ID character
> [error] Not a valid command: streaming-kafka-0-8-assembly
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: streaming-kafka-0-8-assembly
> [error] streaming-kafka-0-8-assembly/assembly
> [error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379516#comment-16379516
 ] 

Liang-Chi Hsieh commented on SPARK-23471:
-

I can't reproduce this. With `fit`, the params are copied and saved correctly.

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-02-27 Thread Suman Somasundar (JIRA)

Suman Somasundar created SPARK-23529:


 Summary: Specify hostpath volume and mount the volume in Spark 
driver and executor pods in Kubernetes
 Key: SPARK-23529
 URL: https://issues.apache.org/jira/browse/SPARK-23529
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Suman Somasundar
Assignee: Anirudh Ramanathan
 Fix For: 2.3.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-02-27 Thread Suman Somasundar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suman Somasundar updated SPARK-23529:
-
Priority: Minor  (was: Blocker)

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Anirudh Ramanathan
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23528) Expose vital statistics of GaussianMixtureModel

2018-02-27 Thread Erich Schubert (JIRA)

Erich Schubert created SPARK-23528:
--

 Summary: Expose vital statistics of GaussianMixtureModel
 Key: SPARK-23528
 URL: https://issues.apache.org/jira/browse/SPARK-23528
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.1
Reporter: Erich Schubert


Spark ML should expose vital statistics of the GMM model:
 * *Number of iterations* (actual, not max) until the tolerance threshold was 
hit: we can set a maximum, but how do we know the limit was large enough, and 
how many iterations it really took?
 * Final *log likelihood* of the model: if we run multiple times with different 
starting conditions, how do we know which run converged to the better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296
 ] 

Bago Amirbekian edited comment on SPARK-2 at 2/27/18 9:24 PM:
--

[~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include 
`numAttributes` in the dataframe column metadata and avoid the call to `first`. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala#L42


was (Author: bago.amirbekian):
[~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include 
`numAttributes` in the dataframe column metadata and avoid the call to `first`. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296
 ] 

Bago Amirbekian commented on SPARK-2:
-

[~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include 
`numAttributes` in the dataframe column metadata and avoid the call to `first`. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379282#comment-16379282
 ] 

Bago Amirbekian commented on SPARK-19947:
-

I think this was resolved by [https://github.com/apache/spark/pull/18496] &; 
[https://github.com/apache/spark/pull/18613].

> RFormulaModel always throws Exception on transforming data with NULL or 
> Unseen labels
> -
>
> Key: SPARK-19947
> URL: https://issues.apache.org/jira/browse/SPARK-19947
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Andrey Yatsuk
>Priority: Major
>
> I have trained ML model and big data table in parquet. I want add new column 
> to this table with predicted values. I can't lose any data, but can having 
> null values in it.
> RFormulaModel.fit() method creates new StringIndexer with default 
> (handleInvalid="error") parameter. Also VectorAssembler on NULL values 
> throwing Exception. So I must call df.na.drop() to transform this DataFrame 
> and I don't want to do this.
> Need add to RFormula new parameter like handleInvalid in StringIndexer.
> Or add transform(Seq): Vector method which user can use as UDF method 
> in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, 
> Seq))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23527) Error with spark-submit and kerberos with TLS-enabled Hadoop cluster

2018-02-27 Thread Ron Gonzalez (JIRA)

Ron Gonzalez created SPARK-23527:


 Summary: Error with spark-submit and kerberos with TLS-enabled 
Hadoop cluster
 Key: SPARK-23527
 URL: https://issues.apache.org/jira/browse/SPARK-23527
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.1
 Environment: core-site.xml



    hadoop.security.key.provider.path

    kms://ht...@host1.domain.com;host2.domain.com:16000/kms



hdfs-site.xml



    dfs.encryption.key.provider.uri

    kms://ht...@host1.domain.com;host2.domain.com:16000/kms


Reporter: Ron Gonzalez


For current configuration of our enterprise cluster, I submit using 
spark-submit:

./spark-submit --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --conf 
spark.yarn.jars=hdfs:/user/user1/spark/lib/*.jar 
../examples/jars/spark-examples_2.11-2.2.1.jar 10

I am getting the following problem:

 

18/02/27 21:03:48 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
3351181 for svchdc236d on ha-hdfs:nameservice1

Exception in thread "main" java.lang.IllegalArgumentException: 
java.net.UnknownHostException: host1.domain.com;host2.domain.com

 at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)

 at 
org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:825)

 at 
org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:781)

 at 
org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)

 at 
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)

 at 
org.apache.spark.deploy.yarn.security.HadoopFSCredentialProvider$$anonfun$obtainCredentials$1.apply(HadoopFSCredentialProvider.scala:52)

 

If I get rid of the other host for the properties so instead of 
kms://ht...@host1.domain.com;host2.domain.com:16000/kms, I convert it to:

kms://ht...@host1.domain.com:16000/kms

it fails with a different error:

java.io.IOException: javax.net.ssl.SSLHandshakeException: 
sun.security.validator.ValidatorException: PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target

If I do the same thing using spark 1.6, it works so it seems like a 
regression...

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22915:


Assignee: (was: Apache Spark)

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22915:


Assignee: Apache Spark

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379214#comment-16379214
 ] 

Apache Spark commented on SPARK-22915:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/20686

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:07 PM:
--

[~Keepun], `train` is a protected API, it's called by `Predictor.fit` which 
also copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use `RandomForestClassifier.fit`?


was (Author: bago.amirbekian):
[~Keepun] 
{noformat} train {noformat} is a protected API, it's called by {Predictor.fit} 
which also copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM:
--

[~Keepun] 
{code:java}
train
{code}
 is a protected API, it's called by {Predictor.fit} which also copies the 
values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?


was (Author: bago.amirbekian):
[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM:
--

[~Keepun] 
{noformat} train {noformat} is a protected API, it's called by {Predictor.fit} 
which also copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?


was (Author: bago.amirbekian):
[~Keepun] 
{code:java}
train
{code}
 is a protected API, it's called by {Predictor.fit} which also copies the 
values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM:
--

[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit|?


was (Author: bago.amirbekian):
[~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use `RandomForestClassifier.fit`?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM:
--

[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?


was (Author: bago.amirbekian):
[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit|?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23500) Filters on named_structs could be pushed into scans

2018-02-27 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379201#comment-16379201
 ] 

Henry Robinson commented on SPARK-23500:


Ok, I figured it out! {{SimplifyCreateStructOps}} does not get applied 
recursively across the whole plan, just across the expressions recursively 
reachable from the root. So even the following:

{{df.filter("named_struct('id', id, 'id2', id2).id > 1").select("id2")}}

doesn't trigger the rule because the {{named_struct}} is never seen. 

Changing the rule to walk the plan, and then walk the expression trees rooted 
at each node, caused the optimization to trigger.

{code}
object SimplifyCreateStructOps extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case p 
=>
p.transformExpressionsUp {
  case GetStructField(createNamedStructLike: CreateNamedStructLike, 
ordinal, _) =>
createNamedStructLike.valExprs(ordinal)
}
  }
}
{code}



> Filters on named_structs could be pushed into scans
> ---
>
> Key: SPARK-23500
> URL: https://issues.apache.org/jira/browse/SPARK-23500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Priority: Major
>
> Simple filters on dataframes joined with {{joinWith()}} are missing an 
> opportunity to get pushed into the scan because they're written in terms of 
> {{named_struct}} that could be removed by the optimizer.
> Given the following simple query over two dataframes:
> {code:java}
> scala> val df = spark.read.parquet("one_million")
> df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = spark.read.parquet("one_million")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > 
> 30").explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight
> :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94]
> :  +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> struct, false].id))
>+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95]
>   +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30)
>  +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and 
> is then pushed down. When the filter is just above the scan, the 
> wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be 
> removed. Then the filter can be pushed down to Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian commented on SPARK-23471:
-

[~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use `RandomForestClassifier.fit`?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23501) Refactor AllStagesPage in order to avoid redundant code

2018-02-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23501:
--

Assignee: Marco Gaido

> Refactor AllStagesPage in order to avoid redundant code
> ---
>
> Key: SPARK-23501
> URL: https://issues.apache.org/jira/browse/SPARK-23501
> Project: Spark
>  Issue Type: Task
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.4.0
>
>
> AllStagesPage contains a lot of copy-pasted code for the different statuses 
> of the stages. This can be improved in order to make it easier to add new 
> stages statuses here, make changes and make less error prone these tasks.
> cc [~vanzin]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23501) Refactor AllStagesPage in order to avoid redundant code

2018-02-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23501.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20663
[https://github.com/apache/spark/pull/20663]

> Refactor AllStagesPage in order to avoid redundant code
> ---
>
> Key: SPARK-23501
> URL: https://issues.apache.org/jira/browse/SPARK-23501
> Project: Spark
>  Issue Type: Task
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.4.0
>
>
> AllStagesPage contains a lot of copy-pasted code for the different statuses 
> of the stages. This can be improved in order to make it easier to add new 
> stages statuses here, make changes and make less error prone these tasks.
> cc [~vanzin]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-02-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23365:
--

Assignee: Imran Rashid

> DynamicAllocation with failure in straggler task can lead to a hung spark job
> -
>
> Key: SPARK-23365
> URL: https://issues.apache.org/jira/browse/SPARK-23365
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
> requested when the executors in the last tasks of a taskset fail (eg. with an 
> OOM).
> This happens when {{ExecutorAllocationManager}} s internal target number of 
> executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
> number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many 
> tasks are active or pending in submitted stages, and computes how many 
> executors would be needed for them.  And as tasks finish, it will actively 
> decrease that count, informing the {{CGSB}} along the way.  (2) When it 
> decides executors are inactive for long enough, then it requests that 
> {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its 
> target number of executors: 
> https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622
> So when there is just one task left, you could have the following sequence of 
> events:
> (1) the {{EAM}} sets the desired number of executors to 1, and updates the 
> {{CGSB}} too
> (2) while that final task is still running, the other executors cross the 
> idle timeout, and the {{EAM}} requests the {{CGSB}} kill them
> (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
> of 0 executors
> If the final task completed normally now, everything would be OK; the next 
> taskset would get submitted, the {{EAM}} would increase the target number of 
> executors and it would update the {{CGSB}}.
> But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
> thinks it [doesn't need to update 
> anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
>  because its target is already 1, which is all it needs for that final task; 
> and the {{CGSB}} doesn't update anything either since its target is 0.
> I think you can determine if this is the cause of a stuck app by looking for
> {noformat}
> yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
> {noformat}
> in the logs of the ApplicationMaster (at least on yarn).
> You can reproduce this with this test app, run with {{--conf 
> "spark.dynamicAllocation.minExecutors=1" --conf 
> "spark.dynamicAllocation.maxExecutors=5" --conf 
> "spark.dynamicAllocation.executorIdleTimeout=5s"}}
> {code}
> import org.apache.spark.SparkEnv
> sc.setLogLevel("INFO")
> sc.parallelize(1 to 1, 1).count()
> val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
> SparkEnv.get.executorId}.collect().toSet
> val badExec = execs.head
> println("will kill exec " + badExec)
> new Thread() {
>   override def run(): Unit = {
> Thread.sleep(1)
> println("about to kill exec " + badExec)
> sc.killExecutor(badExec)
>   }
> }.start()
> sc.parallelize(1 to 5, 5).mapPartitions { itr =>
>   val exec = SparkEnv.get.executorId
>   if (exec == badExec) {
> Thread.sleep(2) // long enough that all the other tasks finish, and 
> the executors cross the idle timeout
> // meanwhile, something else should kill this executor
> itr
>   } else {
> itr
>   }
> }.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-02-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23365.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20604
[https://github.com/apache/spark/pull/20604]

> DynamicAllocation with failure in straggler task can lead to a hung spark job
> -
>
> Key: SPARK-23365
> URL: https://issues.apache.org/jira/browse/SPARK-23365
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
> requested when the executors in the last tasks of a taskset fail (eg. with an 
> OOM).
> This happens when {{ExecutorAllocationManager}} s internal target number of 
> executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
> number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many 
> tasks are active or pending in submitted stages, and computes how many 
> executors would be needed for them.  And as tasks finish, it will actively 
> decrease that count, informing the {{CGSB}} along the way.  (2) When it 
> decides executors are inactive for long enough, then it requests that 
> {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its 
> target number of executors: 
> https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622
> So when there is just one task left, you could have the following sequence of 
> events:
> (1) the {{EAM}} sets the desired number of executors to 1, and updates the 
> {{CGSB}} too
> (2) while that final task is still running, the other executors cross the 
> idle timeout, and the {{EAM}} requests the {{CGSB}} kill them
> (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
> of 0 executors
> If the final task completed normally now, everything would be OK; the next 
> taskset would get submitted, the {{EAM}} would increase the target number of 
> executors and it would update the {{CGSB}}.
> But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
> thinks it [doesn't need to update 
> anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
>  because its target is already 1, which is all it needs for that final task; 
> and the {{CGSB}} doesn't update anything either since its target is 0.
> I think you can determine if this is the cause of a stuck app by looking for
> {noformat}
> yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
> {noformat}
> in the logs of the ApplicationMaster (at least on yarn).
> You can reproduce this with this test app, run with {{--conf 
> "spark.dynamicAllocation.minExecutors=1" --conf 
> "spark.dynamicAllocation.maxExecutors=5" --conf 
> "spark.dynamicAllocation.executorIdleTimeout=5s"}}
> {code}
> import org.apache.spark.SparkEnv
> sc.setLogLevel("INFO")
> sc.parallelize(1 to 1, 1).count()
> val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
> SparkEnv.get.executorId}.collect().toSet
> val badExec = execs.head
> println("will kill exec " + badExec)
> new Thread() {
>   override def run(): Unit = {
> Thread.sleep(1)
> println("about to kill exec " + badExec)
> sc.killExecutor(badExec)
>   }
> }.start()
> sc.parallelize(1 to 5, 5).mapPartitions { itr =>
>   val exec = SparkEnv.get.executorId
>   if (exec == badExec) {
> Thread.sleep(2) // long enough that all the other tasks finish, and 
> the executors cross the idle timeout
> // meanwhile, something else should kill this executor
> itr
>   } else {
> itr
>   }
> }.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379056#comment-16379056
 ] 

Attila Zsolt Piros commented on SPARK-22915:


Thank you as I have already a lot of work in it. I can do a quick PR right now, 
but I would prefer to do one more self review and test executions beforehand. I 
can promise finishing it in the following two days (as I am now on a journey). 

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector

2018-02-27 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18062.
---
Resolution: Duplicate

> ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should 
> return probabilities when given all-0 vector
> 
>
> Key: SPARK-18062
> URL: https://issues.apache.org/jira/browse/SPARK-18062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns 
> a vector of all-0 when given a rawPrediction vector of all-0.  It should 
> return a valid probability vector with the uniform distribution.
> Note: This will be a *behavior change* but it should be very minor and affect 
> few if any users.  But we should note it in release notes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods

2018-02-27 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379034#comment-16379034
 ] 

Joseph K. Bradley commented on SPARK-19416:
---

[~rxin] Shall we close this as Won't Do, or shall we mark it as a thing to 
break in 3.0?

> Dataset.schema is inconsistent with Dataset in handling columns with periods
> 
>
> Key: SPARK-19416
> URL: https://issues.apache.org/jira/browse/SPARK-19416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> When you have a DataFrame with a column with a period in its name, the API is 
> inconsistent about how to quote the column name.
> Here's a reproduction:
> {code}
> import org.apache.spark.sql.functions.col
> val rows = Seq(
>   ("foo", 1),
>   ("bar", 2)
> )
> val df = spark.createDataFrame(rows).toDF("a.b", "id")
> {code}
> These methods are all consistent:
> {code}
> df.select("a.b") // fails
> df.select("`a.b`") // succeeds
> df.select(col("a.b")) // fails
> df.select(col("`a.b`")) // succeeds
> df("a.b") // fails
> df("`a.b`") // succeeds
> {code}
> But {{schema}} is inconsistent:
> {code}
> df.schema("a.b") // succeeds
> df.schema("`a.b`") // fails
> {code}
> "fails" produces error messages like:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input 
> columns: [a.b, id];;
> 'Project ['a.b]
> +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517]
>+- LocalRelation [_1#1511, _2#1512]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
>   at 
> line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw$$

[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread yogesh garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379035#comment-16379035
 ] 

yogesh garg commented on SPARK-22915:
-

Ah, doesn't make sense for me to take it then. Thanks! Please go ahead.

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery

2018-02-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23523:
--
Affects Version/s: 2.1.2

> Incorrect result caused by the rule OptimizeMetadataOnlyQuery
> -
>
> Key: SPARK-23523
> URL: https://issues.apache.org/jira/browse/SPARK-23523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> {code:scala}
>  val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
>  Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
>  .write.json(tablePath.getCanonicalPath)
>  val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", 
> "CoL3").distinct()
>  df.show()
> {code}
> This returns a wrong result 
> {{[c,e,a]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378987#comment-16378987
 ] 

Attila Zsolt Piros commented on SPARK-22915:


I am also very close to a PR.

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread yogesh garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378978#comment-16378978
 ] 

yogesh garg commented on SPARK-22915:
-

I have started working on this and can raise a PR soon. Thanks for the help!

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2018-02-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378963#comment-16378963
 ] 

Dongjoon Hyun commented on SPARK-19809:
---

I meant 2.1.1 literally. :)

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>

[jira] [Resolved] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery

2018-02-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23523.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Incorrect result caused by the rule OptimizeMetadataOnlyQuery
> -
>
> Key: SPARK-23523
> URL: https://issues.apache.org/jira/browse/SPARK-23523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> {code:scala}
>  val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
>  Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
>  .write.json(tablePath.getCanonicalPath)
>  val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", 
> "CoL3").distinct()
>  df.show()
> {code}
> This returns a wrong result 
> {{[c,e,a]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery

2018-02-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23523:

Fix Version/s: (was: 2.3.0)
   2.4.0

> Incorrect result caused by the rule OptimizeMetadataOnlyQuery
> -
>
> Key: SPARK-23523
> URL: https://issues.apache.org/jira/browse/SPARK-23523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> {code:scala}
>  val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
>  Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
>  .write.json(tablePath.getCanonicalPath)
>  val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", 
> "CoL3").distinct()
>  df.show()
> {code}
> This returns a wrong result 
> {{[c,e,a]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10856) SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of TIMESTAMP

2018-02-27 Thread Greg Michalopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378904#comment-16378904
 ] 

Greg Michalopoulos commented on SPARK-10856:


I am facing an issue with one of the datatypes (bit) suggested by 
[~henrikbehrens] for patching.  Was this addressed in another bug?  I am using 
Spark 2.1.1.

> SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of 
> TIMESTAMP
> ---
>
> Key: SPARK-10856
> URL: https://issues.apache.org/jira/browse/SPARK-10856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Henrik Behrens
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: patch
> Fix For: 1.6.0
>
>
> When saving a DataFrame to MS SQL Server, en error is thrown if there is more 
> than one TIMESTAMP column:
> df.printSchema
> root
>  |-- Id: string (nullable = false)
>  |-- TypeInformation_CreatedBy: string (nullable = false)
>  |-- TypeInformation_ModifiedBy: string (nullable = true)
>  |-- TypeInformation_TypeStatus: integer (nullable = false)
>  |-- TypeInformation_CreatedAtDatabase: timestamp (nullable = false)
>  |-- TypeInformation_ModifiedAtDatabase: timestamp (nullable = true)
> df.write.mode("overwrite").jdbc(url, tablename, props)
> com.microsoft.sqlserver.jdbc.SQLServerException: A table can only have one 
> timestamp column. Because table 'DebtorTypeSet1' already has one, the column 
> 'TypeInformation_ModifiedAtDatabase' cannot be added.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError
> (SQLServerException.java:217)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServ
> erStatement.java:1635)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePrep
> aredStatement(SQLServerPreparedStatement.java:426)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecC
> md.doExecute(SQLServerPreparedStatement.java:372)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:6276)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLSe
> rverConnection.java:1793)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLSer
> verStatement.java:184)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLS
> erverStatement.java:159)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeUpdate
> (SQLServerPreparedStatement.java:315)
> I tested this on Windows and SQL Server 12 using Spark 1.4.1.
> I think this can be fixed in a similar way to Spark-10419.
> As a refererence, here is the type mapping according to the SQL Server JDBC 
> driver (basicDT.java, extracted from sqljdbc_4.2.6420.100_enu.exe):
>private static void displayRow(String title, ResultSet rs) {
>   try {
>  System.out.println(title);
>  System.out.println(rs.getInt(1) + " , " +// SQL integer 
> type.
>rs.getString(2) + " , " +  // SQL char 
> type.
>rs.getString(3) + " , " +  // SQL varchar 
> type.
>rs.getBoolean(4) + " , " + // SQL bit type.
>rs.getDouble(5) + " , " +  // SQL decimal 
> type.
>rs.getDouble(6) + " , " +  // SQL money 
> type.
>rs.getTimestamp(7) + " , " +   // SQL datetime 
> type.
>rs.getDate(8) + " , " +// SQL date 
> type.
>rs.getTime(9) + " , " +// SQL time 
> type.
>rs.getTimestamp(10) + " , " +  // SQL 
> datetime2 type.
>((SQLServerResultSet)rs).getDateTimeOffset(11)); // SQL 
> datetimeoffset type. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-27 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378879#comment-16378879
 ] 

Edwina Lu commented on SPARK-23206:
---

[~assia6], I haven't submitted the PR yet – the code is currently in Spark 2.1 
and needs to be forward ported. I will update the ticket when the PR is 
available.

 

[~felixcheung], I will let you know when we schedule the next meeting. Is there 
more information on what you are looking into for shuffle, and metric 
collections?

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-02-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17147.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20572
[https://github.com/apache/spark/pull/20572]

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Assignee: Cody Koeninger
>Priority: Major
> Fix For: 2.4.0
>
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-02-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-17147:
-

Assignee: Cody Koeninger

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Assignee: Cody Koeninger
>Priority: Major
> Fix For: 2.4.0
>
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1

2018-02-27 Thread PandaMonkey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378636#comment-16378636
 ] 

PandaMonkey commented on SPARK-23509:
-

[~srowen]Thx for your reply. 

> Upgrade commons-net from 2.2 to 3.1
> ---
>
> Key: SPARK-23509
> URL: https://issues.apache.org/jira/browse/SPARK-23509
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: PandaMonkey
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark.txt
>
>
> Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core 
> depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity 
> introduced commons-net:3.1. At the same time, Spark-core directly depends on 
> a older version of commons-net:2.2. By further look into the source code, 
> these two versions of commons-net have many different features. The 
> dependency conflict problem brings high risks of "NotClassDefFoundError:" or 
> "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe 
> upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can 
> help you. Thanks!
>  
> Regards,
> Panda



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1

2018-02-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23509:
--
Priority: Minor  (was: Major)

> Upgrade commons-net from 2.2 to 3.1
> ---
>
> Key: SPARK-23509
> URL: https://issues.apache.org/jira/browse/SPARK-23509
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: PandaMonkey
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark.txt
>
>
> Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core 
> depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity 
> introduced commons-net:3.1. At the same time, Spark-core directly depends on 
> a older version of commons-net:2.2. By further look into the source code, 
> these two versions of commons-net have many different features. The 
> dependency conflict problem brings high risks of "NotClassDefFoundError:" or 
> "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe 
> upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can 
> help you. Thanks!
>  
> Regards,
> Panda



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1

2018-02-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23509:
-

Assignee: Kazuaki Ishizaki

> Upgrade commons-net from 2.2 to 3.1
> ---
>
> Key: SPARK-23509
> URL: https://issues.apache.org/jira/browse/SPARK-23509
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: PandaMonkey
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: spark.txt
>
>
> Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core 
> depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity 
> introduced commons-net:3.1. At the same time, Spark-core directly depends on 
> a older version of commons-net:2.2. By further look into the source code, 
> these two versions of commons-net have many different features. The 
> dependency conflict problem brings high risks of "NotClassDefFoundError:" or 
> "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe 
> upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can 
> help you. Thanks!
>  
> Regards,
> Panda



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1

2018-02-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23509.
---
Resolution: Fixed

Issue resolved by pull request 20672
[https://github.com/apache/spark/pull/20672]

> Upgrade commons-net from 2.2 to 3.1
> ---
>
> Key: SPARK-23509
> URL: https://issues.apache.org/jira/browse/SPARK-23509
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: PandaMonkey
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: spark.txt
>
>
> Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core 
> depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity 
> introduced commons-net:3.1. At the same time, Spark-core directly depends on 
> a older version of commons-net:2.2. By further look into the source code, 
> these two versions of commons-net have many different features. The 
> dependency conflict problem brings high risks of "NotClassDefFoundError:" or 
> "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe 
> upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can 
> help you. Thanks!
>  
> Regards,
> Panda



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-27 Thread Pavlo Skliar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378550#comment-16378550
 ] 

Pavlo Skliar commented on SPARK-23525:
--

Unbelievably quick response as for an open-sourced project. so appreciated, 
thanks

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-27 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378546#comment-16378546
 ] 

Jiang Xingbo commented on SPARK-23525:
--

I'm working on a fix for it, and will try to backport the fix to 2.2.

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-27 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378533#comment-16378533
 ] 

Jiang Xingbo commented on SPARK-23525:
--

Thank you for reporting this. I believe the bug is caused by: 
https://github.com/apache/spark/blob/8077bb04f350fd35df83ef896135c0672dc3f7b0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L613

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23525) ALTER TABLE CHANGE COLUMN doesn't work for external hive table

2018-02-27 Thread Jiang Xingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-23525:
-
Affects Version/s: 2.3.0
 Priority: Major  (was: Minor)
  Summary: ALTER TABLE CHANGE COLUMN doesn't work for external hive 
table  (was: Update column comment doesn't work from spark)

> ALTER TABLE CHANGE COLUMN doesn't work for external hive table
> --
>
> Key: SPARK-23525
> URL: https://issues.apache.org/jira/browse/SPARK-23525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Pavlo Skliar
>Priority: Major
>
> {code:java}
> print(spark.sql("""
> SHOW CREATE TABLE test.trends
> """).collect()[0].createtab_stmt)
> /// OUTPUT
> CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
> COMMENT '', `amount` bigint COMMENT '')
> COMMENT ''
> PARTITIONED BY (`date` string COMMENT '')
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://x/x/'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1519729384',
>   'last_modified_time' = '1519645652',
>   'last_modified_by' = 'pavlo',
>   'last_castor_run_ts' = '1513561658.0'
> )
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''),
>  Row(col_name='metric', data_type='string', comment=''),
>  Row(col_name='amount', data_type='bigint', comment=''),
>  Row(col_name='date', data_type='string', comment=''),
>  Row(col_name='# Partition Information', data_type='', comment=''),
>  Row(col_name='# col_name', data_type='data_type', comment='comment'),
>  Row(col_name='date', data_type='string', comment='')]
> spark.sql("""alter table test.trends change column id id string comment 
> 'unique identifier'""")
> spark.sql("""
> DESCRIBE test.trends
> """).collect()
> // OUTPUT
> [Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
> data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
> comment=''), Row(col_name='date', data_type='string', comment=''), 
> Row(col_name='# Partition Information', data_type='', comment=''), 
> Row(col_name='# col_name', data_type='data_type', comment='comment'), 
> Row(col_name='date', data_type='string', comment='')]
> {code}
> The strange is that I've assigned comment to the id field from hive 
> successfully, and it's visible in Hue UI, but it's still not visible in from 
> spark, and any spark requests doesn't have effect on the comments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23526) KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log

2018-02-27 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378468#comment-16378468
 ] 

Wenchen Fan commented on SPARK-23526:
-

cc [~zsxwing]

> KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only 
> one offset in offset log
> ---
>
> Key: SPARK-23526
> URL: https://issues.apache.org/jira/browse/SPARK-23526
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
>
> See it failed in PR builder with error message:
> {code:java}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 676a8b08-c89b-450b-8cd8-fbf9868cd240, runId = 
> 46bb7aae-138b-420d-9b4f-44f42a2a4a0f] terminated with exception: Job aborted 
> due to stage failure: Task 0 in stage 163.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 163.0 (TID 799, localhost, executor driver): 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) 
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:305)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:216)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
>  at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
>  at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
>  at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchDataReader.next(KafkaMicroBatchReader.scala:353)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23526) KafkaMicroBatchV2SourceSuite.ensure stream-stream self-join generates only one offset in offset log

2018-02-27 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-23526:
---

 Summary: KafkaMicroBatchV2SourceSuite.ensure stream-stream 
self-join generates only one offset in offset log
 Key: SPARK-23526
 URL: https://issues.apache.org/jira/browse/SPARK-23526
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Wenchen Fan


See it failed in PR builder with error message:
{code:java}
sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: 
Query [id = 676a8b08-c89b-450b-8cd8-fbf9868cd240, runId = 
46bb7aae-138b-420d-9b4f-44f42a2a4a0f] terminated with exception: Job aborted 
due to stage failure: Task 0 in stage 163.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 163.0 (TID 799, localhost, executor driver): 
java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) 
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:305)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:216)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
 at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
 at 
org.apache.spark.sql.kafka010.KafkaMicroBatchDataReader.next(KafkaMicroBatchReader.scala:353)
 at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23525) Update column comment doesn't work from spark

2018-02-27 Thread Pavlo Skliar (JIRA)

Pavlo Skliar created SPARK-23525:


 Summary: Update column comment doesn't work from spark
 Key: SPARK-23525
 URL: https://issues.apache.org/jira/browse/SPARK-23525
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Pavlo Skliar


{code:java}
print(spark.sql("""
SHOW CREATE TABLE test.trends
""").collect()[0].createtab_stmt)

/// OUTPUT
CREATE EXTERNAL TABLE `test`.`trends`(`id` string COMMENT '', `metric` string 
COMMENT '', `amount` bigint COMMENT '')
COMMENT ''
PARTITIONED BY (`date` string COMMENT '')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://x/x/'
TBLPROPERTIES (
  'transient_lastDdlTime' = '1519729384',
  'last_modified_time' = '1519645652',
  'last_modified_by' = 'pavlo',
  'last_castor_run_ts' = '1513561658.0'
)


spark.sql("""
DESCRIBE test.trends
""").collect()

// OUTPUT
[Row(col_name='id', data_type='string', comment=''),
 Row(col_name='metric', data_type='string', comment=''),
 Row(col_name='amount', data_type='bigint', comment=''),
 Row(col_name='date', data_type='string', comment=''),
 Row(col_name='# Partition Information', data_type='', comment=''),
 Row(col_name='# col_name', data_type='data_type', comment='comment'),
 Row(col_name='date', data_type='string', comment='')]


spark.sql("""alter table test.trends change column id id string comment 'unique 
identifier'""")


spark.sql("""
DESCRIBE test.trends
""").collect()

// OUTPUT
[Row(col_name='id', data_type='string', comment=''), Row(col_name='metric', 
data_type='string', comment=''), Row(col_name='amount', data_type='bigint', 
comment=''), Row(col_name='date', data_type='string', comment=''), 
Row(col_name='# Partition Information', data_type='', comment=''), 
Row(col_name='# col_name', data_type='data_type', comment='comment'), 
Row(col_name='date', data_type='string', comment='')]
{code}
The strange is that I've assigned comment to the id field from hive 
successfully, and it's visible in Hue UI, but it's still not visible in from 
spark, and any spark requests doesn't have effect on the comments.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.

2018-02-27 Thread Fernando Pereira (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fernando Pereira updated SPARK-17859:
-
Fix Version/s: 2.2.1

> persist should not impede with spark's ability to perform a broadcast join.
> ---
>
> Key: SPARK-17859
> URL: https://issues.apache.org/jira/browse/SPARK-17859
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.0
> Environment: spark 2.0.0 , Linux RedHat
>Reporter: Franck Tago
>Priority: Major
> Fix For: 2.0.2, 2.2.1
>
>
> I am using Spark 2.0.0 
> My investigation leads me to conclude that calling persist could prevent 
> broadcast join  from happening .
> Example
> Case1: No persist call 
> var  df1 =spark.range(100).select($"id".as("id1"))
> df1: org.apache.spark.sql.DataFrame = [id1: bigint]
>  var df2 =spark.range(1000).select($"id".as("id2"))
> df2: org.apache.spark.sql.DataFrame = [id2: bigint]
>  df1.join(df2 , $"id1" === $"id2" ).explain 
> == Physical Plan ==
> *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight
> :- *Project [id#114L AS id1#117L]
> :  +- *Range (0, 100, splits=2)
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>+- *Project [id#120L AS id2#123L]
>   +- *Range (0, 1000, splits=2)
> Case 2:  persist call 
>  df1.persist.join(df2 , $"id1" === $"id2" ).explain 
> 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
> == Physical Plan ==
> *SortMergeJoin [id1#3L], [id2#9L], Inner
> :- *Sort [id1#3L ASC], false, 0
> :  +- Exchange hashpartitioning(id1#3L, 10)
> : +- InMemoryTableScan [id1#3L]
> ::  +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :: :  +- *Project [id#0L AS id1#3L]
> :: : +- *Range (0, 100, splits=2)
> +- *Sort [id2#9L ASC], false, 0
>+- Exchange hashpartitioning(id2#9L, 10)
>   +- InMemoryTableScan [id2#9L]
>  :  +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  : :  +- *Project [id#6L AS id2#9L]
>  : : +- *Range (0, 1000, splits=2)
> Why does the persist call prevent the broadcast join . 
> My opinion is that it should not .
> I was made aware that the persist call is  lazy and that might have something 
> to do with it , but I still contend that it should not . 
> Losing broadcast joins is really costly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23524:


Assignee: Apache Spark

> Big local shuffle blocks should not be checked for corruption.
> --
>
> Key: SPARK-23524
> URL: https://issues.apache.org/jira/browse/SPARK-23524
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Major
>
> In current code, all local blocks will be checked for corruption no matter 
> it's big or not.  The reasons are as below:
>  # Size in FetchResult for local block is set to be 0 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)]
>  # SPARK-4105 meant to only check the small blocks(size but for reason 1, below check will be invalid. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.

2018-02-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23524:


Assignee: (was: Apache Spark)

> Big local shuffle blocks should not be checked for corruption.
> --
>
> Key: SPARK-23524
> URL: https://issues.apache.org/jira/browse/SPARK-23524
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: jin xing
>Priority: Major
>
> In current code, all local blocks will be checked for corruption no matter 
> it's big or not.  The reasons are as below:
>  # Size in FetchResult for local block is set to be 0 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)]
>  # SPARK-4105 meant to only check the small blocks(size but for reason 1, below check will be invalid. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.

2018-02-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378324#comment-16378324
 ] 

Apache Spark commented on SPARK-23524:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/20685

> Big local shuffle blocks should not be checked for corruption.
> --
>
> Key: SPARK-23524
> URL: https://issues.apache.org/jira/browse/SPARK-23524
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: jin xing
>Priority: Major
>
> In current code, all local blocks will be checked for corruption no matter 
> it's big or not.  The reasons are as below:
>  # Size in FetchResult for local block is set to be 0 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)]
>  # SPARK-4105 meant to only check the small blocks(size but for reason 1, below check will be invalid. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23351) checkpoint corruption in long running application

2018-02-27 Thread David Ahern (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ahern resolved SPARK-23351.
-
   Resolution: Fixed
Fix Version/s: 2.2.1

already fixed in future version... will test when available by Cloudera

> checkpoint corruption in long running application
> -
>
> Key: SPARK-23351
> URL: https://issues.apache.org/jira/browse/SPARK-23351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: David Ahern
>Priority: Major
> Fix For: 2.2.1
>
>
> hi, after leaving my (somewhat high volume) Structured Streaming application 
> running for some time, i get the following exception.  The same exception 
> also repeats when i try to restart the application.  The only way to get the 
> application back running is to clear the checkpoint directory which is far 
> from ideal.
> Maybe a stream is not being flushed/closed properly internally by Spark when 
> checkpointing?
>  
>  User class threw exception: 
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
> java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application

2018-02-27 Thread David Ahern (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378311#comment-16378311
 ] 

David Ahern commented on SPARK-23351:
-

will close this assuming it's fixed in future version... will reopen if 
necessary

> checkpoint corruption in long running application
> -
>
> Key: SPARK-23351
> URL: https://issues.apache.org/jira/browse/SPARK-23351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: David Ahern
>Priority: Major
>
> hi, after leaving my (somewhat high volume) Structured Streaming application 
> running for some time, i get the following exception.  The same exception 
> also repeats when i try to restart the application.  The only way to get the 
> application back running is to clear the checkpoint directory which is far 
> from ideal.
> Maybe a stream is not being flushed/closed properly internally by Spark when 
> checkpointing?
>  
>  User class threw exception: 
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
> java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23524) Big local shuffle blocks should not be checked for corruption.

2018-02-27 Thread jin xing (JIRA)

jin xing created SPARK-23524:


 Summary: Big local shuffle blocks should not be checked for 
corruption.
 Key: SPARK-23524
 URL: https://issues.apache.org/jira/browse/SPARK-23524
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: jin xing


In current code, all local blocks will be checked for corruption no matter it's 
big or not.  The reasons are as below:
 # Size in FetchResult for local block is set to be 0 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)]
 # SPARK-4105 meant to only check the small blocks(sizehttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-27 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378303#comment-16378303
 ] 

吴志龙 commented on SPARK-23346:
-

This bug is found and solved? This bug will result in incorrect data.

> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Critical
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
>  !企业微信截图_15179715023606.png!  !企业微信截图_15179714603307.png! We have many other 
> failure reasons, such as TaskResultLost,but the status is success. In the web 
> ui, we count non-ExceptionFailure failures as successful tasks, which is 
> highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

88 matches

Mail list logo