date:20150922

[jira] [Assigned] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10720:


Assignee: (was: Apache Spark)

> Add a java wrapper to create dataframe from a local list of Java Beans.
> ---
>
> Key: SPARK-10720
> URL: https://issues.apache.org/jira/browse/SPARK-10720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>Priority: Minor
>
> Similar to SPARK-10630 it would be nice if Java users didn't have to 
> parallelize there data explicitly (as Scala users already can skip). Issue 
> came up in 
> http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10720:


Assignee: Apache Spark

> Add a java wrapper to create dataframe from a local list of Java Beans.
> ---
>
> Key: SPARK-10720
> URL: https://issues.apache.org/jira/browse/SPARK-10720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> Similar to SPARK-10630 it would be nice if Java users didn't have to 
> parallelize there data explicitly (as Scala users already can skip). Issue 
> came up in 
> http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904059#comment-14904059
 ] 

Apache Spark commented on SPARK-10720:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8879

> Add a java wrapper to create dataframe from a local list of Java Beans.
> ---
>
> Key: SPARK-10720
> URL: https://issues.apache.org/jira/browse/SPARK-10720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>Priority: Minor
>
> Similar to SPARK-10630 it would be nice if Java users didn't have to 
> parallelize there data explicitly (as Scala users already can skip). Issue 
> came up in 
> http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10771) Implement the shuffle encryption with AES-CTR crypto using JCE key provider.

2015-09-22 Thread Ferdinand Xu (JIRA)

Ferdinand Xu created SPARK-10771:


 Summary: Implement the shuffle encryption with AES-CTR crypto 
using JCE key provider.
 Key: SPARK-10771
 URL: https://issues.apache.org/jira/browse/SPARK-10771
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Reporter: Ferdinand Xu


We will use the credentials stored in user group information to encrypt/ 
decrypt shuffle data. We will use JCE key provider to implement AES-CTR crypto.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10770) SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904046#comment-14904046
 ] 

Apache Spark commented on SPARK-10770:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8878

> SparkPlan.executeCollect/executeTake should return InternalRow rather than 
> external Row
> ---
>
> Key: SPARK-10770
> URL: https://issues.apache.org/jira/browse/SPARK-10770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10770) SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10770:


Assignee: Reynold Xin  (was: Apache Spark)

> SparkPlan.executeCollect/executeTake should return InternalRow rather than 
> external Row
> ---
>
> Key: SPARK-10770
> URL: https://issues.apache.org/jira/browse/SPARK-10770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10770) SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10770:


Assignee: Apache Spark  (was: Reynold Xin)

> SparkPlan.executeCollect/executeTake should return InternalRow rather than 
> external Row
> ---
>
> Key: SPARK-10770
> URL: https://issues.apache.org/jira/browse/SPARK-10770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10770) SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row

2015-09-22 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-10770:
---

 Summary: SparkPlan.executeCollect/executeTake should return 
InternalRow rather than external Row
 Key: SPARK-10770
 URL: https://issues.apache.org/jira/browse/SPARK-10770
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10742.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10652) Set meaningful job descriptions for streaming related jobs

2015-09-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10652.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> Set meaningful job descriptions for streaming related jobs
> --
>
> Key: SPARK-10652
> URL: https://issues.apache.org/jira/browse/SPARK-10652
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0, 1.5.1
>
>
> Job descriptions will help distinguish jobs of one batch from the other in 
> the Jobs and Stages pages in the Spark UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10769) Fix o.a.s.streaming.CheckpointSuite.maintains rate controller

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10769:


Assignee: (was: Apache Spark)

> Fix o.a.s.streaming.CheckpointSuite.maintains rate controller
> -
>
> Key: SPARK-10769
> URL: https://issues.apache.org/jira/browse/SPARK-10769
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>  Labels: flaky-test
>
> Fixed the following failure in 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 660 times over 10.4439201 seconds. Last failure 
> message: 9223372036854775807 did not equal 200.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10769) Fix o.a.s.streaming.CheckpointSuite.maintains rate controller

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10769:


Assignee: Apache Spark

> Fix o.a.s.streaming.CheckpointSuite.maintains rate controller
> -
>
> Key: SPARK-10769
> URL: https://issues.apache.org/jira/browse/SPARK-10769
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>  Labels: flaky-test
>
> Fixed the following failure in 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 660 times over 10.4439201 seconds. Last failure 
> message: 9223372036854775807 did not equal 200.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10769) Fix o.a.s.streaming.CheckpointSuite.maintains rate controller

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903998#comment-14903998
 ] 

Apache Spark commented on SPARK-10769:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8877

> Fix o.a.s.streaming.CheckpointSuite.maintains rate controller
> -
>
> Key: SPARK-10769
> URL: https://issues.apache.org/jira/browse/SPARK-10769
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>  Labels: flaky-test
>
> Fixed the following failure in 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 660 times over 10.4439201 seconds. Last failure 
> message: 9223372036854775807 did not equal 200.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10000) Consolidate cache memory management and execution memory management

2015-09-22 Thread Bowen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903995#comment-14903995
 ] 

Bowen Zhang commented on SPARK-1:
-

[~rxin], I am very interested in this new story. I am trying to understand the 
story here. In addition to consolidate these two parts of memory into one 
memory, are there other tricky things that can pose a challenge to this story 
or other use case considerations that should be taken into account for this 
story?

> Consolidate cache memory management and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>
> As a Spark user, I want Spark to manage the memory more intelligently so I do 
> not need to worry about how to statically partition the execution (shuffle) 
> memory fraction and cache memory fraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10769) Fix o.a.s.streaming.CheckpointSuite.maintains rate controller

2015-09-22 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-10769:


 Summary: Fix o.a.s.streaming.CheckpointSuite.maintains rate 
controller
 Key: SPARK-10769
 URL: https://issues.apache.org/jira/browse/SPARK-10769
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


Fixed the following failure in 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/

{code}
sbt.ForkMain$ForkError: The code passed to eventually never returned normally. 
Attempted 660 times over 10.4439201 seconds. Last failure message: 
9223372036854775807 did not equal 200.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
at 
org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
at 
org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10768) How to access columns with "." dot in their name in Spark SQL

2015-09-22 Thread Harut Martirosyan (JIRA)

Harut Martirosyan created SPARK-10768:
-

 Summary: How to access columns with "." dot in their name in Spark 
SQL
 Key: SPARK-10768
 URL: https://issues.apache.org/jira/browse/SPARK-10768
 Project: Spark
  Issue Type: Question
Reporter: Harut Martirosyan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-22 Thread holdenk (JIRA)

holdenk created SPARK-10767:
---

 Summary: Make pyspark shared params codegen more consistent 
 Key: SPARK-10767
 URL: https://issues.apache.org/jira/browse/SPARK-10767
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: holdenk
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-22 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903935#comment-14903935
 ] 

Xiangrui Meng commented on SPARK-10668:
---

[~lewuathe] any progress?

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10663:
--
Target Version/s: 1.6.0, 1.5.1

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Assignee: Matt Hagen
>Priority: Trivial
> Fix For: 1.6.0, 1.5.1
>
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10663:
--
Assignee: Matt Hagen

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Assignee: Matt Hagen
>Priority: Trivial
> Fix For: 1.6.0, 1.5.1
>
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10663.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8875
[https://github.com/apache/spark/pull/8875]

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
> Fix For: 1.6.0, 1.5.1
>
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10766) Add some configurations for the client process in yarn-cluster mode.

2015-09-22 Thread SaintBacchus (JIRA)

SaintBacchus created SPARK-10766:


 Summary: Add some configurations for the client process in 
yarn-cluster mode. 
 Key: SPARK-10766
 URL: https://issues.apache.org/jira/browse/SPARK-10766
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.6.0
Reporter: SaintBacchus


In the yarn-cluster mode, it's hard to find the correct configuration for the 
client process. 
But this is necessary such as the client process's class path: if I want to use 
hbase on spark, I have to include the hbase jars into client's classpath.
But *spark.driver.extraClassPath* can't take effect. The way I can do is set 
the hbase jars into the Enviroment of SPARK_CLASSPATH. 
It isn't a better way so I want to add some configuration for this client 
process.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10310.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8860
[https://github.com/apache/spark/pull/8860]

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Assignee: zhichao-li
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8882) A New Receiver Scheduling Mechanism to solve unbalanced receivers

2015-09-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8882:
-
Summary: A New Receiver Scheduling Mechanism to solve unbalanced receivers  
(was: A New Receiver Scheduling Mechanism)

> A New Receiver Scheduling Mechanism to solve unbalanced receivers
> -
>
> Key: SPARK-8882
> URL: https://issues.apache.org/jira/browse/SPARK-8882
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.5.0
>
>
> There are some problems in the current mechanism:
>  - If a task fails more than “spark.task.maxFailures” (default: 4) times, the 
> job will fail. For a long-running Streaming applications, it’s possible that 
> a Receiver task fails more than 4 times because of Executor lost.
> - When an executor is lost, the Receiver tasks on it will be rescheduled. 
> However, because there may be many Spark jobs at the same time, it’s possible 
> that TaskScheduler cannot schedule them to make Receivers be distributed 
> evenly.
> To solve such limitations, we need to change the receiver scheduling 
> mechanism. Here is the design doc: 
> https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8882) A New Receiver Scheduling Mechanism

2015-09-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8882:
-
Description: 
There are some problems in the current mechanism:
 - If a task fails more than “spark.task.maxFailures” (default: 4) times, the 
job will fail. For a long-running Streaming applications, it’s possible that a 
Receiver task fails more than 4 times because of Executor lost.
- When an executor is lost, the Receiver tasks on it will be rescheduled. 
However, because there may be many Spark jobs at the same time, it’s possible 
that TaskScheduler cannot schedule them to make Receivers be distributed evenly.

To solve such limitations, we need to change the receiver scheduling mechanism. 
Here is the design doc: 
https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing

  was:The design doc: 
https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing


> A New Receiver Scheduling Mechanism
> ---
>
> Key: SPARK-8882
> URL: https://issues.apache.org/jira/browse/SPARK-8882
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.5.0
>
>
> There are some problems in the current mechanism:
>  - If a task fails more than “spark.task.maxFailures” (default: 4) times, the 
> job will fail. For a long-running Streaming applications, it’s possible that 
> a Receiver task fails more than 4 times because of Executor lost.
> - When an executor is lost, the Receiver tasks on it will be rescheduled. 
> However, because there may be many Spark jobs at the same time, it’s possible 
> that TaskScheduler cannot schedule them to make Receivers be distributed 
> evenly.
> To solve such limitations, we need to change the receiver scheduling 
> mechanism. Here is the design doc: 
> https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903818#comment-14903818
 ] 

Apache Spark commented on SPARK-10731:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8876

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-09-22 Thread Amey Ghadigaonkar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903750#comment-14903750
 ] 

Amey Ghadigaonkar commented on SPARK-7442:
--

Getting the same error with Spark 1.4.1 running on Hadoop 2.6.0. I am reverting 
back to Hadoop 2.4.0 till this is fixed.

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10663:


Assignee: (was: Apache Spark)

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10663:


Assignee: Apache Spark

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Assignee: Apache Spark
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903749#comment-14903749
 ] 

Apache Spark commented on SPARK-10663:
--

User 'hagenhaus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8875

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-22 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903740#comment-14903740
 ] 

Yi Zhou commented on SPARK-10733:
-

 yes. i still got error after applying the commit.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10705) Stop converting internal rows to external rows in DataFrame.toJSON

2015-09-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10705:
---
Assignee: Liang-Chi Hsieh

> Stop converting internal rows to external rows in DataFrame.toJSON
> --
>
> Key: SPARK-10705
> URL: https://issues.apache.org/jira/browse/SPARK-10705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Liang-Chi Hsieh
>
> {{DataFrame.toJSON}} uses {{DataFrame.mapPartitions}}, which converts 
> internal rows to external rows. We can use 
> {{queryExecution.toRdd.mapPartitions}} instead for better performance.
> Another issue is that, for UDT values, {{serialize}} produces internal types. 
> So currently we must deal with both internal and external types within 
> {{toJSON}} (see 
> [here|https://github.com/apache/spark/pull/8806/files#diff-0f04c36e499d4dcf6931fbd62b3aa012R77]),
>  which is pretty weird.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10640.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
> --
>
> Key: SPARK-10640
> URL: https://issues.apache.org/jira/browse/SPARK-10640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Thomas Graves
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> I'm seeing an exception from the spark history server trying to read a 
> history file:
> {code}
> scala.MatchError: TaskCommitDenied (of class java.lang.String)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10748) Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

2015-09-22 Thread Alan Braithwaite (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903672#comment-14903672
 ] 

Alan Braithwaite commented on SPARK-10748:
--

This could be tricky with persistence too.  I noticed when bringing back the 
dispatcher when there was a misconfigured job persisted to Zookeeper that it 
would just keep crashing and would be unable to start up.

> Log error instead of crashing Spark Mesos dispatcher when a job is 
> misconfigured
> 
>
> Key: SPARK-10748
> URL: https://issues.apache.org/jira/browse/SPARK-10748
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently when a dispatcher is submitting a new driver, it simply throws a 
> SparkExecption when necessary configuration is not set. We should log and 
> keep the dispatcher running instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-22 Thread Feynman Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903657#comment-14903657
 ] 

Feynman Liang commented on SPARK-9798:
--

The actual scala doc

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9585) HiveHBaseTableInputFormat can'be cached

2015-09-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9585:
-
Assignee: meiyoula

> HiveHBaseTableInputFormat can'be cached
> ---
>
> Key: SPARK-9585
> URL: https://issues.apache.org/jira/browse/SPARK-9585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Assignee: meiyoula
> Fix For: 1.6.0
>
>
> Below exception occurs in Spark On HBase function.
> {quote}
> java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
> Task 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
>  rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
> {quote}
> When an executor has many cores, the tasks belongs to same RDD will using the 
> same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
> So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-22 Thread rerngvit yanggratoke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903653#comment-14903653
 ] 

rerngvit yanggratoke commented on SPARK-9798:
-

[~fliang] By documentation, you mean here the Scala doc description on top of 
the class definition (in the file CrossValidator.scala) or the documentation in 
/docs folder?

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10765:


Assignee: (was: Apache Spark)

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903639#comment-14903639
 ] 

Apache Spark commented on SPARK-10765:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8874

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10765:


Assignee: Apache Spark

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-22 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-10765:
---

 Summary: use new aggregate interface for hive UDAF
 Key: SPARK-10765
 URL: https://issues.apache.org/jira/browse/SPARK-10765
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10764) Add optional caching to Pipelines

2015-09-22 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-10764:
-

 Summary: Add optional caching to Pipelines
 Key: SPARK-10764
 URL: https://issues.apache.org/jira/browse/SPARK-10764
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley


We need to explore how to cache DataFrames during the execution of Pipelines.  
It's a hard problem in general to handle automatically or manually, so we 
should start with some design discussions about:
* How to control it manually
* Whether & how to handle it automatically
* API changes needed for each



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-22 Thread Lauren Moos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903553#comment-14903553
 ] 

Lauren Moos commented on SPARK-10409:
-

no problem!

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10403:


Assignee: Josh Rosen  (was: Apache Spark)

> UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
> 
>
> Key: SPARK-10403
> URL: https://issues.apache.org/jira/browse/SPARK-10403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Josh Rosen
>
> UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
> write EOF between partitions.
> {code}
> java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10403:


Assignee: Apache Spark  (was: Josh Rosen)

> UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
> 
>
> Key: SPARK-10403
> URL: https://issues.apache.org/jira/browse/SPARK-10403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
> write EOF between partitions.
> {code}
> java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10762) GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

2015-09-22 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903551#comment-14903551
 ] 

Glenn Strycker commented on SPARK-10762:


Is this related?  
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3ccaaswr-6djsk5j7me42wvlfvt5zxry85tpfna2e6iwgm_fha...@mail.gmail.com%3E

> GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame 
> to RDD from Hive table
> 
>
> Key: SPARK-10762
> URL: https://issues.apache.org/jira/browse/SPARK-10762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I have a Hive table in parquet format that was generated using
> {code}
> create table myTable (var1 int, var2 string, var3 int, var4 string, var5 
> array>) stored as parquet;
> {code}
> I am able to verify that it was filled -- here is a sample value
> {code}
> [1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])]
> {code}
> I wish to put this into a Spark RDD of the form
> {code}
> ((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello"
> {code}
> Now, using spark-shell (I get the same problem in spark-submit), I made a 
> test RDD with these values
> {code}
> scala> val tempRDD = sc.parallelize(Seq(((1,"abcdef"),((2,"ghijkl"), 
> ArrayBuffer[(Int,String)]((1,"hello"))
> tempRDD: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
> scala.collection.mutable.ArrayBuffer[(Int, String)]))] = 
> ParallelCollectionRDD[44] at parallelize at :85
> {code}
> using an iterator, I can cast the ArrayBuffer as a HashSet in the following 
> new RDD:
> {code}
> scala> val tempRDD2 = tempRDD.map(a => (a._1, (a._2._1, { var tempHashSet = 
> new HashSet[(Int,String)]; a._2._2.foreach(a => tempHashSet = tempHashSet ++ 
> HashSet(a)); tempHashSet } )))
> tempRDD2: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
> scala.collection.immutable.HashSet[(Int, String)]))] = MapPartitionsRDD[46] 
> at map at :87
> scala> tempRDD2.collect.foreach(println)
> ((1,abcdef),((2,ghijkl),Set((1,hello
> {code}
> But when I attempt to do the EXACT SAME THING with a DataFrame with a 
> HiveContext / SQLContext, I get the following error:
> {code}
> scala> val hc = new HiveContext(sc)
> scala> import hc._
> scala> import hc.implicits._
> scala> val tempHiveQL = hc.sql("""select var1, var2, var3, var4, var5 from 
> myTable""")
> scala> val tempRDDfromHive = tempHiveQL.map(a => ((a(0).toString.toInt, 
> a(1).toString), ((a(2).toString.toInt, a(3).toString), 
> a(4).asInstanceOf[ArrayBuffer[(Int,String)]] )))
> scala> val tempRDD3 = tempRDDfromHive.map(a => (a._1, (a._2._1, { var 
> tempHashSet = new HashSet[(Int,String)]; a._2._2.foreach(a => tempHashSet = 
> tempHashSet ++ HashSet(a)); tempHashSet } )))
> tempRDD3: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
> scala.collection.immutable.HashSet[(Int, String)]))] = MapPartitionsRDD[47] 
> at map at :91
> scala> tempRDD3.collect.foreach(println)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 
> (TID 5211, localhost): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to scala.Tuple2
>at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(:91)
>at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
>at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>at scala.col

[jira] [Commented] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903552#comment-14903552
 ] 

Apache Spark commented on SPARK-10403:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8873

> UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
> 
>
> Key: SPARK-10403
> URL: https://issues.apache.org/jira/browse/SPARK-10403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Josh Rosen
>
> UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
> write EOF between partitions.
> {code}
> java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4489) JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException

2015-09-22 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903548#comment-14903548
 ] 

Glenn Strycker commented on SPARK-4489:
---

I believe I am getting this issue as well.  See ticket 
https://issues.apache.org/jira/browse/SPARK-10762

The error message is "java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to scala.Tuple2" when attempting to use an ArrayBuffer in an RDD generated from 
a DataFrame from a HiveContext.  Although I persisted, checkpointed, and 
materalized in-between steps, the error is occurring during a DAG step.

{code}
scala> tempRDD3.collect.foreach(println)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 
(TID 5211, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to scala.Tuple2
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(:91)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1503)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1503)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:724)

Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{code}


> JavaPairRDD.collectAsMap from checkpoint RDD may fail with Cla

[jira] [Created] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-22 Thread holdenk (JIRA)

holdenk created SPARK-10763:
---

 Summary: Update Java MLLIB/ML tests to use simplified dataframe 
construction
 Key: SPARK-10763
 URL: https://issues.apache.org/jira/browse/SPARK-10763
 Project: Spark
  Issue Type: Test
  Components: ML, MLlib
Reporter: holdenk
Priority: Minor


As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have 
an easier way to create dataframes from local Java lists. Lets update the tests 
to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2737) ClassCastExceptions when collect()ing JavaRDDs' underlying Scala RDDs

2015-09-22 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903542#comment-14903542
 ] 

Glenn Strycker commented on SPARK-2737:
---

I am getting a similar error in Spark 1.3.0... see a new ticket I created:  
https://issues.apache.org/jira/browse/SPARK-10762

> ClassCastExceptions when collect()ing JavaRDDs' underlying Scala RDDs
> -
>
> Key: SPARK-2737
> URL: https://issues.apache.org/jira/browse/SPARK-2737
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 0.8.0, 0.9.0, 1.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> The Java API's use of fake ClassTags doesn't seem to cause any problems for 
> Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs 
> to Scala code (e.g. in the MLlib Java API wrapper code).  If we call 
> {{collect()}} on a Scala RDD with an incorrect ClassTag, this causes 
> ClassCastExceptions when we try to allocate an array of the wrong type (for 
> example, see SPARK-2197).
> There are a few possible fixes here.  An API-breaking fix would be to 
> completely remove the fake ClassTags and require Java API users to pass 
> {{java.lang.Class}} instances to all {{parallelize()}} calls and add 
> {{returnClass}} fields to all {{Function}} implementations.  This would be 
> extremely verbose.
> Instead, I propose that we add internal APIs to "repair" a Scala RDD with an 
> incorrect ClassTag by wrapping it and overriding its ClassTag.  This should 
> be okay for cases where the Scala code that calls {{collect()}} knows what 
> type of array should be allocated, which is the case in the MLlib wrappers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1040) Collect as Map throws a casting exception when run on a JavaPairRDD object

2015-09-22 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903541#comment-14903541
 ] 

Glenn Strycker commented on SPARK-1040:
---

I am getting a similar error in Spark 1.3.0... see a new ticket I created:  
https://issues.apache.org/jira/browse/SPARK-10762

> Collect as Map throws a casting exception when run on a JavaPairRDD object
> --
>
> Key: SPARK-1040
> URL: https://issues.apache.org/jira/browse/SPARK-1040
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 0.9.0
>Reporter: Kevin Mader
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 0.9.1
>
>
> The error that arises
> {code}
> Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; 
> cannot be cast to [Lscala.Tuple2;
>   at 
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:427)
>   at 
> org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:409)
> {code}
> The code being executed
> {code:java}
> public static String ImageSummary(final JavaPairRDD inImg) {
>   final Set keyList=inImg.collectAsMap().keySet();
>   for(Integer cVal: keyList) outString+=cVal+",";
>   return outString;
>   }
> {code}
> The line 426-427 from PairRDDFunctions.scala
> {code:java}
>  def collectAsMap(): Map[K, V] = {
> val data = self.toArray()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-22 Thread Kai Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903528#comment-14903528
 ] 

Kai Jiang commented on SPARK-10688:
---

May I take a try? Cause I am pretty interested in this one.

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10762) GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

2015-09-22 Thread Glenn Strycker (JIRA)

Glenn Strycker created SPARK-10762:
--

 Summary: GenericRowWithSchema exception in casting ArrayBuffer to 
HashSet in DataFrame to RDD from Hive table
 Key: SPARK-10762
 URL: https://issues.apache.org/jira/browse/SPARK-10762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Glenn Strycker


I have a Hive table in parquet format that was generated using

{code}
create table myTable (var1 int, var2 string, var3 int, var4 string, var5 
array>) stored as parquet;
{code}

I am able to verify that it was filled -- here is a sample value

{code}
[1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])]
{code}

I wish to put this into a Spark RDD of the form

{code}
((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello"
{code}

Now, using spark-shell (I get the same problem in spark-submit), I made a test 
RDD with these values

{code}
scala> val tempRDD = sc.parallelize(Seq(((1,"abcdef"),((2,"ghijkl"), 
ArrayBuffer[(Int,String)]((1,"hello"))
tempRDD: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
scala.collection.mutable.ArrayBuffer[(Int, String)]))] = 
ParallelCollectionRDD[44] at parallelize at :85
{code}

using an iterator, I can cast the ArrayBuffer as a HashSet in the following new 
RDD:

{code}
scala> val tempRDD2 = tempRDD.map(a => (a._1, (a._2._1, { var tempHashSet = new 
HashSet[(Int,String)]; a._2._2.foreach(a => tempHashSet = tempHashSet ++ 
HashSet(a)); tempHashSet } )))
tempRDD2: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
scala.collection.immutable.HashSet[(Int, String)]))] = MapPartitionsRDD[46] at 
map at :87

scala> tempRDD2.collect.foreach(println)
((1,abcdef),((2,ghijkl),Set((1,hello
{code}

But when I attempt to do the EXACT SAME THING with a DataFrame with a 
HiveContext / SQLContext, I get the following error:

{code}
scala> val hc = new HiveContext(sc)
scala> import hc._
scala> import hc.implicits._

scala> val tempHiveQL = hc.sql("""select var1, var2, var3, var4, var5 from 
myTable""")

scala> val tempRDDfromHive = tempHiveQL.map(a => ((a(0).toString.toInt, 
a(1).toString), ((a(2).toString.toInt, a(3).toString), 
a(4).asInstanceOf[ArrayBuffer[(Int,String)]] )))

scala> val tempRDD3 = tempRDDfromHive.map(a => (a._1, (a._2._1, { var 
tempHashSet = new HashSet[(Int,String)]; a._2._2.foreach(a => tempHashSet = 
tempHashSet ++ HashSet(a)); tempHashSet } )))
tempRDD3: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
scala.collection.immutable.HashSet[(Int, String)]))] = MapPartitionsRDD[47] at 
map at :91

scala> tempRDD3.collect.foreach(println)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 
(TID 5211, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to scala.Tuple2
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(:91)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC

[jira] [Updated] (SPARK-10759) Missing Python code example in ML Programming guide

2015-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10759:
--
Labels: starter  (was: )

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Lauren Moos
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10333) Add user guide for linear-methods.md columns

2015-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10333:
--
Assignee: Lauren Moos

> Add user guide for linear-methods.md columns
> 
>
> Key: SPARK-10333
> URL: https://issues.apache.org/jira/browse/SPARK-10333
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Lauren Moos
>Priority: Minor
>
> Add example code to document input output columns based on 
> https://github.com/apache/spark/pull/8491 feedback



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10759) Missing Python code example in ML Programming guide

2015-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10759:
--
Assignee: Lauren Moos

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Lauren Moos
>Priority: Minor
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-22 Thread Dan Brown (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903512#comment-14903512
 ] 

Dan Brown commented on SPARK-10685:
---

Thanks for fixing the python udf part of the issue!

What about the zip after repartition behavior? That seems like a bug since it's 
surprising and easy to do without realizing it.

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a

[jira] [Assigned] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-10403:
--

Assignee: Josh Rosen

> UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
> 
>
> Key: SPARK-10403
> URL: https://issues.apache.org/jira/browse/SPARK-10403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Josh Rosen
>
> UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
> write EOF between partitions.
> {code}
> java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-22 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903480#comment-14903480
 ] 

Xiangrui Meng commented on SPARK-10409:
---

[~lmoos] This is a major feature. Could you work on SPARK-10759 and SPARK-10333 
first? 

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10607) Scheduler should include defensive measures against infinite loops due to task commit denial

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10607:
---
Target Version/s:   (was: 1.3.2, 1.4.2, 1.5.1)

> Scheduler should include defensive measures against infinite loops due to 
> task commit denial
> 
>
> Key: SPARK-10607
> URL: https://issues.apache.org/jira/browse/SPARK-10607
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If OutputCommitter.commitTask() repeatedly fails due to the 
> OutputCommitCoordinator denying the right to commit, then scheduler may get 
> stuck in an infinite task retry loop. The reason for this behavior is the 
> fact  that DAGScheduler treats failures due to CommitDenied separately from 
> other failures: they don't count towards the typical count of maximum task 
> failures which can trigger a job failure. The correct fix is to add an 
> upper-bound on the number of times that a commit can be denied as a 
> last-ditch safety net to avoid infinite loop behavior.
> See SPARK-10381 for additional context. This is not a high priority issue to 
> fix right now, since the fix in SPARK-10381 should prevent this scenario from 
> happening in the first place. However, another layer of conservative 
> defensive limits / timeouts certainly would not hurt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10607) Scheduler should include defensive measures against infinite loops due to task commit denial

2015-09-22 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903476#comment-14903476
 ] 

Josh Rosen commented on SPARK-10607:


Retargeting; this enhancement doesn't need to be targeted at a specific point 
release.

> Scheduler should include defensive measures against infinite loops due to 
> task commit denial
> 
>
> Key: SPARK-10607
> URL: https://issues.apache.org/jira/browse/SPARK-10607
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If OutputCommitter.commitTask() repeatedly fails due to the 
> OutputCommitCoordinator denying the right to commit, then scheduler may get 
> stuck in an infinite task retry loop. The reason for this behavior is the 
> fact  that DAGScheduler treats failures due to CommitDenied separately from 
> other failures: they don't count towards the typical count of maximum task 
> failures which can trigger a job failure. The correct fix is to add an 
> upper-bound on the number of times that a commit can be denied as a 
> last-ditch safety net to avoid infinite loop behavior.
> See SPARK-10381 for additional context. This is not a high priority issue to 
> fix right now, since the fix in SPARK-10381 should prevent this scenario from 
> happening in the first place. However, another layer of conservative 
> defensive limits / timeouts certainly would not hurt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10749:


Assignee: Apache Spark

> Support multiple roles with Spark Mesos dispatcher
> --
>
> Key: SPARK-10749
> URL: https://issues.apache.org/jira/browse/SPARK-10749
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Although you can currently set the framework role of the Mesos dispatcher, it 
> doesn't correctly use the offers given to it.
> It should inherit how Coarse/Fine grain scheduler works and use multiple 
> roles offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903474#comment-14903474
 ] 

Apache Spark commented on SPARK-10749:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8872

> Support multiple roles with Spark Mesos dispatcher
> --
>
> Key: SPARK-10749
> URL: https://issues.apache.org/jira/browse/SPARK-10749
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Although you can currently set the framework role of the Mesos dispatcher, it 
> doesn't correctly use the offers given to it.
> It should inherit how Coarse/Fine grain scheduler works and use multiple 
> roles offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10749:


Assignee: (was: Apache Spark)

> Support multiple roles with Spark Mesos dispatcher
> --
>
> Key: SPARK-10749
> URL: https://issues.apache.org/jira/browse/SPARK-10749
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Although you can currently set the framework role of the Mesos dispatcher, it 
> doesn't correctly use the offers given to it.
> It should inherit how Coarse/Fine grain scheduler works and use multiple 
> roles offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8447) Test external shuffle service with all shuffle managers

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8447:
--
Target Version/s: 1.6.0  (was: 1.5.1)

> Test external shuffle service with all shuffle managers
> ---
>
> Key: SPARK-8447
> URL: https://issues.apache.org/jira/browse/SPARK-8447
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Critical
>
> There is a mismatch between the shuffle managers in Spark core and in the 
> external shuffle service. The latest unsafe shuffle manager is an example of 
> this (SPARK-8430). This issue arose because we apparently do not have 
> sufficient tests for making sure that these two components deal with the same 
> set of shuffle managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10058) Flaky test: HeartbeatReceiverSuite: normal heartbeat

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10058:
---
Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> Flaky test: HeartbeatReceiverSuite: normal heartbeat
> 
>
> Key: SPARK-10058
> URL: https://issues.apache.org/jira/browse/SPARK-10058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Davies Liu
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/
> {code}
> Error Message
> 3 did not equal 2
> Stacktrace
> sbt.ForkMain$ForkError: 3 did not equal 2
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply$mcV$sp(HeartbeatReceiverSuite.scala:104)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.runTest(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterAll$$super$run(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.run(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>

[jira] [Updated] (SPARK-6701) Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6701:
--
Target Version/s:   (was: 1.5.1)

> Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application
> -
>
> Key: SPARK-6701
> URL: https://issues.apache.org/jira/browse/SPARK-6701
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Priority: Critical
>
> Observed in Master and 1.3, both in SBT and in Maven (with YARN).
> {code}
> Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties,
>  --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) 
> exited with code 1
> sbt.ForkMain$ForkError: Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties,
>  --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) 
> exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6484:
--
Target Version/s:   (was: 1.5.1)

I'm going to untarget this from 1.5.1 because, as far as I know, this is not a 
problem with newer versions of our Ganglia dependency.

> Ganglia metrics xml reporter doesn't escape correctly
> -
>
> Key: SPARK-6484
> URL: https://issues.apache.org/jira/browse/SPARK-6484
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Josh Rosen
>Priority: Critical
>
> The following should be escaped:
> {code}
> "   "
> '   '
> <   <
> >   >
> &   &
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7420:
--
Target Version/s:   (was: 1.5.1)

> Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block 
> data too soon"
> -
>
> Key: SPARK-7420
> URL: https://issues.apache.org/jira/browse/SPARK-7420
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Andrew Or
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> The code passed to eventually never returned normally. Attempted 18 times 
> over 10.13803606001 seconds. Last failure message: 
> receiverTracker.hasUnallocatedBlocks was false.
> {code}
> It seems to be failing only in maven.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10685:
---
Assignee: Reynold Xin

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: with

[jira] [Resolved] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10685.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8835
[https://github.com/apache/spark/pull/8835]

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, Integer

[jira] [Resolved] (SPARK-10714) Refactor PythonRDD to decouple iterator computation from PythonRDD

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10714.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8835
[https://github.com/apache/spark/pull/8835]

> Refactor PythonRDD to decouple iterator computation from PythonRDD
> --
>
> Key: SPARK-10714
> URL: https://issues.apache.org/jira/browse/SPARK-10714
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0, 1.5.1
>
>
> The idea is that most of the logic of calling Python actually has nothing to 
> do with RDD (it is really just communicating with a socket -- there is 
> nothing distributed about it), and it is only currently depending on RDD 
> because it was written this way.
> If we extract that functionality out, we can apply it to area of the code 
> that doesn't depend on RDDs, and also make it easier to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8632.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8835
[https://github.com/apache/spark/pull/8835]

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-22 Thread Lauren Moos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903451#comment-14903451
 ] 

Lauren Moos commented on SPARK-10409:
-

I'd be happy to work on this

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10761) Refactor DiskBlockObjectWriter to not require BlockId

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10761:


Assignee: Josh Rosen  (was: Apache Spark)

> Refactor DiskBlockObjectWriter to not require BlockId
> -
>
> Key: SPARK-10761
> URL: https://issues.apache.org/jira/browse/SPARK-10761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> The DiskBlockObjectWriter constructor takes a BlockId parameter but never 
> uses it internally. We should try to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10761) Refactor DiskBlockObjectWriter to not require BlockId

2015-09-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10761:


Assignee: Apache Spark  (was: Josh Rosen)

> Refactor DiskBlockObjectWriter to not require BlockId
> -
>
> Key: SPARK-10761
> URL: https://issues.apache.org/jira/browse/SPARK-10761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> The DiskBlockObjectWriter constructor takes a BlockId parameter but never 
> uses it internally. We should try to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10333) Add user guide for linear-methods.md columns

2015-09-22 Thread Lauren Moos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903448#comment-14903448
 ] 

Lauren Moos commented on SPARK-10333:
-

I'd be happy to work on this 

> Add user guide for linear-methods.md columns
> 
>
> Key: SPARK-10333
> URL: https://issues.apache.org/jira/browse/SPARK-10333
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Add example code to document input output columns based on 
> https://github.com/apache/spark/pull/8491 feedback



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10761) Refactor DiskBlockObjectWriter to not require BlockId

2015-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903450#comment-14903450
 ] 

Apache Spark commented on SPARK-10761:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8871

> Refactor DiskBlockObjectWriter to not require BlockId
> -
>
> Key: SPARK-10761
> URL: https://issues.apache.org/jira/browse/SPARK-10761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> The DiskBlockObjectWriter constructor takes a BlockId parameter but never 
> uses it internally. We should try to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10761) Refactor DiskBlockObjectWriter to not require BlockId

2015-09-22 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-10761:
--

 Summary: Refactor DiskBlockObjectWriter to not require BlockId
 Key: SPARK-10761
 URL: https://issues.apache.org/jira/browse/SPARK-10761
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor


The DiskBlockObjectWriter constructor takes a BlockId parameter but never uses 
it internally. We should try to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10381) Infinite loop when OutputCommitCoordination is enabled and OutputCommitter.commitTask throws exception

2015-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10381:
---
Fix Version/s: 1.3.2

> Infinite loop when OutputCommitCoordination is enabled and 
> OutputCommitter.commitTask throws exception
> --
>
> Key: SPARK-10381
> URL: https://issues.apache.org/jira/browse/SPARK-10381
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
>
> When speculative execution is enabled, consider a scenario where the 
> authorized committer of a particular output partition fails during the 
> OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator 
> is supposed to release that committer's exclusive lock on committing once 
> that task fails. However, due to a unit mismatch the lock will not be 
> released, causing Spark to go into an infinite retry loop.
> This bug was masked by the fact that the OutputCommitCoordinator does not 
> have enough end-to-end tests (the current tests use many mocks). Other 
> factors contributing to this bug are the fact that we have many 
> similarly-named identifiers that have different semantics but the same data 
> types (e.g. attemptNumber and taskAttemptId, with inconsistent variable 
> naming which makes them difficult to distinguish).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10737.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8854
[https://github.com/apache/spark/pull/8854]

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10672) We should not fail to create a table If we cannot persist metadata of a data source table to metastore in a Hive compatible way

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10672.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8824
[https://github.com/apache/spark/pull/8824]

> We should not fail to create a table If we cannot persist metadata of a data 
> source table to metastore in a Hive compatible way
> ---
>
> Key: SPARK-10672
> URL: https://issues.apache.org/jira/browse/SPARK-10672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> It is possible that Hive has some internal restrictions on what kinds of 
> metadata of a table it accepts (e.g. Hive 0.13 does not support decimal 
> stored in parquet). If it is the case, we should not fail when we try to 
> store the metadata in a Hive compatible way. We should just save it in the 
> Spark SQL specific format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-22 Thread Meihua Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903375#comment-14903375
 ] 

Meihua Wu commented on SPARK-7129:
--

[~sethah] Thank you very much for the write up! That is a very good starting 
point and I will get back to you. 

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-22 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903357#comment-14903357
 ] 

Gayathri Murali commented on SPARK-10688:
-

If there isn't anyone working on it, i would like to work on this

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-09-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903350#comment-14903350
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

New idea: We could allow transformers to leverage RFormula. That might be the 
nicest way to specify a bunch of columns and leverage existing code for 
assembling them.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2015-09-22 Thread Lauren Moos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903263#comment-14903263
 ] 

Lauren Moos commented on SPARK-10759:
-

I can work on this 

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Priority: Minor
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10740:
-
Assignee: Wenchen Fan

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10740.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8858
[https://github.com/apache/spark/pull/8858]

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10740:
-
Target Version/s: 1.6.0, 1.5.1

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-09-22 Thread Chris Heller (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903260#comment-14903260
 ] 

Chris Heller commented on SPARK-9442:
-

Curious if the issue seen here was with a parquet file created with a small 
block size? I just ran into a similar case with a nested schema, but ad no 
problems on many of the files, except for a larger file -- all were written 
with a block size of 16MB.

> java.lang.ArithmeticException: / by zero when reading Parquet
> -
>
> Key: SPARK-9442
> URL: https://issues.apache.org/jira/browse/SPARK-9442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: DB Tsai
>
> I am counting how many records in my nested parquet file with this schema,
> {code}
> scala> u1aTesting.printSchema
> root
>  |-- profileId: long (nullable = true)
>  |-- country: string (nullable = true)
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- videoId: long (nullable = true)
>  |||-- date: long (nullable = true)
>  |||-- label: double (nullable = true)
>  |||-- weight: double (nullable = true)
>  |||-- features: vector (nullable = true)
> {code}
> and the number of the records in the nested data array is around 10k, and 
> each of the parquet file is around 600MB. The total size is around 120GB. 
> I am doing a simple count
> {code}
> scala> u1aTesting.count
> parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
> file 
> hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: / by zero
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>   ... 21 more
> {code}
> BTW, no all the tasks fail, and some of them are successful. 
> Another note: By explicitly looping through the data to count, it will works.
> {code}
> sqlContext.read.load(hdfsPath + s"/testing/u1snappy/${date}/").map(x => 
> 1L).reduce((x, y) => x + y) 
> {code}
> I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10740:
-
Priority: Blocker  (was: Major)

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903231#comment-14903231
 ] 

Cody Koeninger edited comment on SPARK-10732 at 9/22/15 7:02 PM:
-

Yeah, even if that gets implemented it will likely be at 1 minute granularity, 
which might be ok for some failure recovery situations but is unlikely to work 
for your SPARK-10734 ticket


was (Author: c...@koeninger.org):
Yeah, even if that gets implemented it will likely be at 1 minute granularity, 
which might be ok for some failure recovery situations but is unlikely to work 
for your SPARK-1074 ticket

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903231#comment-14903231
 ] 

Cody Koeninger commented on SPARK-10732:


Yeah, even if that gets implemented it will likely be at 1 minute granularity, 
which might be ok for some failure recovery situations but is unlikely to work 
for your SPARK-1074 ticket

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10704.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. -We should 
> consolidate these classes.- We should rename HashShuffleReader.
> --In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.--



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10485.
--
Resolution: Fixed

I tested on 1.5 and it seems fixed to me.  Please reopen if you have a 
reproduction that fails still.

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Bijay Singh Bisht (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903183#comment-14903183
 ] 

Bijay Singh Bisht commented on SPARK-10732:
---

I get it. Apparently there is a discussion to have explicit and granular 
timestamp based support 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index#KIP-33-Addatimebasedlogindex-Usecasediscussion.

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-22 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903180#comment-14903180
 ] 

Seth Hendrickson commented on SPARK-7129:
-

I had some time to give this topic some thought and started constructing [a 
document with notes on a generic boosting 
architecture|https://docs.google.com/document/d/1Zeoj99gwiJBF0JWL8170KicVB0U5xtUOk6VUeFj0Nz8/edit]
 and some of the concerns it raises. I don't think this is acceptable as a 
design doc because it's a bit wordy and it doesn't make an effort to follow the 
structure of other design docs, but hopefully [~meihuawu] can find something 
useful in it.

I found the [R mboost 
vignette|https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf]
 to be a good starting point. I'm still learning the ML package, but I'd love 
to be involved in the discussion and potentially take on some of the code tasks 
once we get there.

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903136#comment-14903136
 ] 

Joseph K. Bradley commented on SPARK-7129:
--

It's not really on the roadmap for 1.6, so I shouldn't make promises.  The main 
issue for me are some design questions:
* Should boosting depend on the prediction abstractions (Classifier, Regressor, 
etc.)?  If so, are those abstractions sufficient, or should they be turned into 
traits?

If you're interested, it would be valuable to get your input on designing the 
abstractions.  Would you be able to write a short design doc?  I figure we 
should:
* List the boosting algorithms of interest
* List what requirements those algorithms place on the base learner
* Design minimal abstractions which describe those requirements
* See how those abstractions compare with MLlib's current abstractions, and if 
we need to rethink them

If you have time for that, it'd be great if you could post it here as a Google 
doc or PDF to collect feedback.  Thanks!

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10593) sql lateral view same name gives wrong value

2015-09-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10593.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8755
[https://github.com/apache/spark/pull/8755]

> sql lateral view same name gives wrong value
> 
>
> Key: SPARK-10593
> URL: https://issues.apache.org/jira/browse/SPARK-10593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> This query will return wrong result:
> {code}
> select 
> insideLayer1.json as json_insideLayer1, 
> insideLayer2.json as json_insideLayer2 
> from (select '1' id) creatives 
> lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', 
> 'layer1') insideLayer1 as json 
> lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json 
> {code}
> It got 
> {code}
> ( {"layer2": "text inside layer 2"},  {"layer2": "text inside layer 2"})
> {code}
> instead of
> {code}
> ( {"layer2": "text inside layer 2"},  text inside layer 2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903119#comment-14903119
 ] 

Joseph K. Bradley commented on SPARK-7129:
--

Hi, I'd recommend starting with smaller tasks before tackling a larger one like 
this.  Here's a link with a lot more info on contributing: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

You can also find the VM link from the MOOC here: 
[http://mail-archives.us.apache.org/mod_mbox/spark-user/201505.mbox/%3CCAG5NM6SsHmv-vet85=afdyyoafgoluicmka9ck77qv3f0pk...@mail.gmail.com%3E]

But exploring installation issues might be a good way to understand the Spark 
build a bit more!  : )

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 197 matches

Mail list logo