[jira] [Updated] (SPARK-9719) spark.ml NaiveBayes doc cleanups

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9719:
-
Shepherd: Joseph K. Bradley

 spark.ml NaiveBayes doc cleanups
 

 Key: SPARK-9719
 URL: https://issues.apache.org/jira/browse/SPARK-9719
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
Priority: Minor

 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta
 Add setParam tag to NaiveBayes setModelType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9719) spark.ml NaiveBayes doc cleanups

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9719:
-
Assignee: Feynman Liang

 spark.ml NaiveBayes doc cleanups
 

 Key: SPARK-9719
 URL: https://issues.apache.org/jira/browse/SPARK-9719
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
Priority: Minor

 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta
 Add setParam tag to NaiveBayes setModelType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8890) Reduce memory consumption for dynamic partition insert

2015-08-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8890.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8010
[https://github.com/apache/spark/pull/8010]

 Reduce memory consumption for dynamic partition insert
 --

 Key: SPARK-8890
 URL: https://issues.apache.org/jira/browse/SPARK-8890
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.5.0


 Currently, InsertIntoHadoopFsRelation can run out of memory if the number of 
 table partitions is large. The problem is that we open one output writer for 
 each partition, and when data are randomized and when the number of 
 partitions is large, we open a large number of output writers, leading to OOM.
 The solution here is to inject a sorting operation once the number of active 
 partitions is beyond a certain point (e.g. 50?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8160) Tungsten style external aggregation

2015-08-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8160.

Resolution: Fixed

 Tungsten style external aggregation
 ---

 Key: SPARK-8160
 URL: https://issues.apache.org/jira/browse/SPARK-8160
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Yin Huai
 Fix For: 1.5.0


 Support using external sorting to run aggregate so we can easily process 
 aggregates where each partition is much larger than memory size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9670) ML 1.5 QA: Examples: Check for new APIs requiring example code

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9670:
-
Assignee: Ram Sriharsha

 ML 1.5 QA: Examples: Check for new APIs requiring example code
 --

 Key: SPARK-9670
 URL: https://issues.apache.org/jira/browse/SPARK-9670
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha
Priority: Minor

 Audit list of new features added to MLlib, and see which major items are 
 missing example code (in the examples folder).  We do not need examples for 
 everything, only for major items such as new ML algorithms.
 For any such items:
 * Create a JIRA for that feature, and assign it to the author of the feature 
 (or yourself if interested).
 * Link it to (a) the original JIRA which introduced that feature (related 
 to) and (b) to this JIRA (requires).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9756) Make auxillary constructors for ML decision trees private

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9756:
-
Fix Version/s: (was: 1.5.0)

 Make auxillary constructors for ML decision trees private
 -

 Key: SPARK-9756
 URL: https://issues.apache.org/jira/browse/SPARK-9756
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor

 These classes should not (and actually can not) be instantiated directly 
 because there is currently no public constructor for {{Node}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9756) Make auxillary constructors for ML decision trees private

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9756:
-
Shepherd: Joseph K. Bradley
Assignee: Feynman Liang
Target Version/s: 1.5.0

 Make auxillary constructors for ML decision trees private
 -

 Key: SPARK-9756
 URL: https://issues.apache.org/jira/browse/SPARK-9756
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor

 These classes should not (and actually can not) be instantiated directly 
 because there is currently no public constructor for {{Node}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9756) Make auxillary constructors for ML decision trees private

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9756.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8046
[https://github.com/apache/spark/pull/8046]

 Make auxillary constructors for ML decision trees private
 -

 Key: SPARK-9756
 URL: https://issues.apache.org/jira/browse/SPARK-9756
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


 These classes should not (and actually can not) be instantiated directly 
 because there is currently no public constructor for {{Node}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9719) spark.ml NaiveBayes doc cleanups

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9719.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8047
[https://github.com/apache/spark/pull/8047]

 spark.ml NaiveBayes doc cleanups
 

 Key: SPARK-9719
 URL: https://issues.apache.org/jira/browse/SPARK-9719
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta
 Add setParam tag to NaiveBayes setModelType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9066) Improve cartesian performance

2015-08-07 Thread Weizhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662707#comment-14662707
 ] 

Weizhong commented on SPARK-9066:
-

Yes, the root reaason is same, that is cause by scan HDFS too many times, in 
[PR#6454|https://github.com/apache/spark/pull/6454] use coalesce to decrease 
partitions, but add two shuffles, but if we change the cartesian order also can 
decrease the scan times, which I have done in 
[PR#7417|https://github.com/apache/spark/pull/7417]

 Improve cartesian performance 
 --

 Key: SPARK-9066
 URL: https://issues.apache.org/jira/browse/SPARK-9066
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Weizhong
Priority: Minor

 Currently, for CartesianProduct, if right plan partition record number are 
 small than left partition record number, then the performance is bad as need 
 do many times scan for right plan.
 For example:
 {noformat}
 with single_value as (
   select max(1) tpcds_val from date_dim
 )
 select sum(ss_quantity * ss_sales_price) ssales, tpcds_val
 from store_sales, single_value
 group by tpcds_val
 {noformat}
 above SQL clause, right plan only have 1 record, left plan have 1823 
 partiton(in our test) and each partition has more than 4000 records, then for 
 each left plan partition record we need scan data from hdfs for right plan.
 That is, for left plan we need scan _left_plan_partition_num_ times, for 
 right plan we need scan _left_plan_partition_num * right_plan_partition_num_ 
 times, total is  _left_plan_partition_num * (1 + right_plan_partition_num)_ 
 times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9754) Remove TypeCheck in debug package

2015-08-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9754.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Remove TypeCheck in debug package
 -

 Key: SPARK-9754
 URL: https://issues.apache.org/jira/browse/SPARK-9754
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 TypeCheck no longer applies in the new Tungsten world.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9666) ML 1.5 QA: model save/load audit

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9666:
-
Assignee: yuhao yang

 ML 1.5 QA: model save/load audit
 

 Key: SPARK-9666
 URL: https://issues.apache.org/jira/browse/SPARK-9666
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 We should check to make sure no changes broke model import/export in 
 spark.mllib.
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9738) remove FromUnsafe and add its codegen version to GenerateSafe

2015-08-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-9738:
---
Description: In https://github.com/apache/spark/pull/7752 we added 
`FromUnsafe` to convert nexted unsafe data like array/map/struct to safe 
versions. It's a quick solution and we already have `GenerateSafe` to do the 
conversion which is codegened. So we should remove `FromUnsafe` and implement 
its codegen version in `GenerateSafe`.

 remove FromUnsafe and add its codegen version to GenerateSafe
 -

 Key: SPARK-9738
 URL: https://issues.apache.org/jira/browse/SPARK-9738
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan

 In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert 
 nexted unsafe data like array/map/struct to safe versions. It's a quick 
 solution and we already have `GenerateSafe` to do the conversion which is 
 codegened. So we should remove `FromUnsafe` and implement its codegen version 
 in `GenerateSafe`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9738) remove FromUnsafe and add its codegen version to GenerateSafe

2015-08-07 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662763#comment-14662763
 ] 

Wenchen Fan commented on SPARK-9738:


[~joshrosen] sorry about the rush, I've filled in the description now :)

 remove FromUnsafe and add its codegen version to GenerateSafe
 -

 Key: SPARK-9738
 URL: https://issues.apache.org/jira/browse/SPARK-9738
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan

 In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert 
 nexted unsafe data like array/map/struct to safe versions. It's a quick 
 solution and we already have `GenerateSafe` to do the conversion which is 
 codegened. So we should remove `FromUnsafe` and implement its codegen version 
 in `GenerateSafe`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow

2015-08-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9753.

   Resolution: Fixed
Fix Version/s: 1.5.0

 TungstenAggregate should also accept InternalRow instead of just UnsafeRow
 --

 Key: SPARK-9753
 URL: https://issues.apache.org/jira/browse/SPARK-9753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.5.0


 Since we need to project out key and value out, there is no need to only 
 accept UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9748) Centriod typo in KMeansModel

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9748.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8037
[https://github.com/apache/spark/pull/8037]

 Centriod typo in KMeansModel
 

 Key: SPARK-9748
 URL: https://issues.apache.org/jira/browse/SPARK-9748
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Assignee: Bertrand Dechoux
Priority: Trivial
  Labels: typo
 Fix For: 1.6.0


 A minor typo (centriod - centroid). Readable variable names help every users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9748) Centriod typo in KMeansModel

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9748:
-
Target Version/s: 1.6.0

 Centriod typo in KMeansModel
 

 Key: SPARK-9748
 URL: https://issues.apache.org/jira/browse/SPARK-9748
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Assignee: Bertrand Dechoux
Priority: Trivial
  Labels: typo
 Fix For: 1.6.0


 A minor typo (centriod - centroid). Readable variable names help every users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9748) Centriod typo in KMeansModel

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9748:
-
Assignee: Bertrand Dechoux

 Centriod typo in KMeansModel
 

 Key: SPARK-9748
 URL: https://issues.apache.org/jira/browse/SPARK-9748
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Assignee: Bertrand Dechoux
Priority: Trivial
  Labels: typo

 A minor typo (centriod - centroid). Readable variable names help every users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9744) Add RDD method to map with lag and lead

2015-08-07 Thread Jerry Z (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662298#comment-14662298
 ] 

Jerry Z commented on SPARK-9744:


Fixed! Sorry didn't know you were referring to the title. So I think for 
performance sake, this would be a handy feature to have  also saves me a 
lot of typing and avoiding my code wrapping around.

On a semi-related note, why does cogroup need an iterator of the class? join() 
doesn't.

 Add RDD method to map with lag and lead
 ---

 Key: SPARK-9744
 URL: https://issues.apache.org/jira/browse/SPARK-9744
 Project: Spark
  Issue Type: Wish
Reporter: Jerry Z
Priority: Minor

 To avoid zipping with index and doing numerous mapping and joins, having a 
 single method call to map with an additional two parameters (1: list of 
 offsets [(-) for lag, 0 for current and (+) for lead])) and (2:default 
 value). The other difference to the map function takes an argument of ListT 
 and not just T.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9744) Add RDD method to map with lag and lead

2015-08-07 Thread Jerry Z (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Z updated SPARK-9744:
---
Comment: was deleted

(was: Fixed! Sorry didn't know you were referring to the title. So I think for 
performance sake, this would be a handy feature to have  also saves me a 
lot of typing and avoiding my code wrapping around.

On a semi-related note, why does cogroup need an iterator of the class? join() 
doesn't.)

 Add RDD method to map with lag and lead
 ---

 Key: SPARK-9744
 URL: https://issues.apache.org/jira/browse/SPARK-9744
 Project: Spark
  Issue Type: Wish
Reporter: Jerry Z
Priority: Minor

 To avoid zipping with index and doing numerous mapping and joins, having a 
 single method call to map with an additional two parameters (1: list of 
 offsets [(-) for lag, 0 for current and (+) for lead])) and (2:default 
 value). The other difference to the map function takes an argument of ListT 
 and not just T.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662303#comment-14662303
 ] 

Feynman Liang edited comment on SPARK-9660 at 8/7/15 7:23 PM:
--

Logistic regression [only supports binary 
classification|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L83],
 but various 
[scaladocs|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L50]
 assert that this is a backwards compatibility feature, suggesting that 
multiclass is supported.

This is made more confusing by the fact that inheriting from 
{{ProbabilisticClassifier}} exposes a {{setThresholds(Array[Double])}} public 
method, potentially allowing a user to set more than two thresholds on a binary 
classifier... It may make sense to consider adding {{numClasses}} to 
{{ClassifierParams}} and explicitly check that in {{HasThresholds}} (self type 
annotation?)


was (Author: fliang):
Logistic regression [only supports binary 
classification|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L83],
 but various 
[scaladocs|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L50]
 assert that this is a backwards compatibility feature, suggesting that 
multiclass is supported.

This is made more confusing by the fact that inheriting from 
{{ProbabilisticClassifier}} exposes a {{setThresholds(Array[Double])}} public 
method, potentially allowing a user to set more than two thresholds on a binary 
classifier... It may make sense to consider adding {{numClasses}} to 
{{ClassifierParams}} and explicitly check that when setting thresholds.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9744) Add RDD method to map with lag and lead

2015-08-07 Thread Jerry Z (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Z updated SPARK-9744:
---
Comment: was deleted

(was: Fixed! Sorry didn't know you were referring to the title. So I think for 
performance sake, this would be a handy feature to have  also saves me a 
lot of typing and avoiding my code wrapping around.

On a semi-related note, why does cogroup need an iterator of the class? join() 
doesn't.)

 Add RDD method to map with lag and lead
 ---

 Key: SPARK-9744
 URL: https://issues.apache.org/jira/browse/SPARK-9744
 Project: Spark
  Issue Type: Wish
Reporter: Jerry Z
Priority: Minor

 To avoid zipping with index and doing numerous mapping and joins, having a 
 single method call to map with an additional two parameters (1: list of 
 offsets [(-) for lag, 0 for current and (+) for lead])) and (2:default 
 value). The other difference to the map function takes an argument of ListT 
 and not just T.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662303#comment-14662303
 ] 

Feynman Liang commented on SPARK-9660:
--

Logistic regression [only supports binary 
classification|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L83],
 but various 
[scaladocs|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L50]
 assert that this is a backwards compatibility feature, suggesting that 
multiclass is supported.

This is made more confusing by the fact that inheriting from 
{{ProbabilisticClassifier}} exposes a {{setThresholds(Array[Double])}} public 
method, potentially allowing a user to set more than two thresholds on a binary 
classifier... It may make sense to consider adding {{numClasses}} to 
{{ClassifierParams}} and explicitly check that when setting thresholds.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662317#comment-14662317
 ] 

Feynman Liang edited comment on SPARK-9660 at 8/7/15 7:58 PM:
--

Should [RandomForestClassificationModel's aux 
constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139]
 be private? Ditto for 
[DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110]


was (Author: fliang):
Should [RandomForestClassificationModel's aux 
constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139]
 be private?

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 7:59 PM:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?


was (Author: bdechoux):
I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is do we want to enforce that identifiable types should be 
identifiable by their toString.
It does make sense. The following question is can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Andreas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662360#comment-14662360
 ] 

Andreas commented on SPARK-9746:


May be I'm to dumb: but the count for each key is always '1'


 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662367#comment-14662367
 ] 

Feynman Liang commented on SPARK-9660:
--

{{LogisticRegressionModel.toString()}} missing short description.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9749) DenseMatrix equals does not account for isTransposed

2015-08-07 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9749:


 Summary: DenseMatrix equals does not account for isTransposed
 Key: SPARK-9749
 URL: https://issues.apache.org/jira/browse/SPARK-9749
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker


A matrix is not always equal to its transpose, but the current implementation 
of {{equals}} in 
[DenseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L261]
 does not account for the {{isTransposed}} flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9740) first/last aggregate NULL behavior

2015-08-07 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-9740:


 Summary: first/last aggregate NULL behavior
 Key: SPARK-9740
 URL: https://issues.apache.org/jira/browse/SPARK-9740
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.6.0
Reporter: Herman van Hovell
Priority: Minor


The FIRST/LAST aggregates implemented as part of the new UDAF interface, return 
the first or last non-null value (if any) found. This is a departure from the 
behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE 
aggregates in Hive. These would return a null value, if that happened to be the 
first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF 
interface.

Hive makes this behavior configurable, by adding a skipNulls flag. I would 
suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Damian Guy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated SPARK-9340:
--
Affects Version/s: 1.3.0

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661830#comment-14661830
 ] 

Apache Spark commented on SPARK-9340:
-

User 'dguy' has created a pull request for this issue:
https://github.com/apache/spark/pull/8032

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661831#comment-14661831
 ] 

Damian Guy commented on SPARK-9340:
---

I created a pull request against the 1.3 branch (closest to what i am using) 
https://github.com/apache/spark/pull/8032

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9340:
---

Assignee: (was: Apache Spark)

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9340:
---

Assignee: Apache Spark

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
Assignee: Apache Spark
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2015-08-07 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661859#comment-14661859
 ] 

Herman van Hovell commented on SPARK-9740:
--

BTW: I encountered this while doing tests for SPARK-8641. Unfortunately it is 
kind of a PITA to create a proper test using an Aggregate, they do not enforce 
sorting, so the result of FIRST/LAST is undeterministic. 

 first/last aggregate NULL behavior
 --

 Key: SPARK-9740
 URL: https://issues.apache.org/jira/browse/SPARK-9740
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.6.0
Reporter: Herman van Hovell
Priority: Minor

 The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
 return the first or last non-null value (if any) found. This is a departure 
 from the behavior of the old FIRST/LAST aggregates and from the 
 FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
 if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
 this behavior for the old UDAF interface.
 Hive makes this behavior configurable, by adding a skipNulls flag. I would 
 suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662317#comment-14662317
 ] 

Feynman Liang edited comment on SPARK-9660 at 8/7/15 8:04 PM:
--

Should [RandomForestClassificationModel's aux 
constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139]
 be private? Ditto for 
[DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110],
 
[RandomForestRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110]


was (Author: fliang):
Should [RandomForestClassificationModel's aux 
constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139]
 be private? Ditto for 
[DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110]

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662317#comment-14662317
 ] 

Feynman Liang edited comment on SPARK-9660 at 8/7/15 8:04 PM:
--

Should [RandomForestClassificationModel's aux 
constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139]
 be private? Ditto for 
[DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110],
 
[RandomForestRegressionModel|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala#L128]


was (Author: fliang):
Should [RandomForestClassificationModel's aux 
constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139]
 be private? Ditto for 
[DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110],
 
[RandomForestRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110]

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Andreas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662341#comment-14662341
 ] 

Andreas edited comment on SPARK-9746 at 8/7/15 8:04 PM:


Sorry, but I don't agree.

cntxt..parallelize(List ((a, 1), (a, 2))).groupBy(_._1).countByKey()

returns 'Map(a - 1)' but should in my opinion return 'Map(a - 2)'

If the values (counts) are irrelevant then why this function is called 
*count*ByKey and why does it return a Map instead of a Set?
The current implementation has no added value compared to 
'pairRDD.keys.collect().toSet'



was (Author: agrothe1):
Sorry, but I don't agree.

cntxt..parallelize(List ((a, 1), (a, 2))).groupBy(_._1).countByKey()

returns 'Map(a - 1)' but should in my opinion return 'Map(a - 2)'

If the values (counts) are irrelevant then why this function is called 
*count*ByKey and why does it return a Map instead of a Set?
The current implementation has no added value compared to 
'pairRDD.keys.collect().toSet'

cntxt.paralize

 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662356#comment-14662356
 ] 

Joseph K. Bradley commented on SPARK-9660:
--

Sure, sounds good.  (same as for DTClassificationModel)

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662358#comment-14662358
 ] 

Joseph K. Bradley commented on SPARK-9660:
--

I want to add it as public for all PredictionModel types eventually, so I don't 
see harm in leaving it public.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662378#comment-14662378
 ] 

Feynman Liang commented on SPARK-9660:
--

[~josephkb] Don't users have to provide {{thresholds}} when configuring the 
model, which would require knowing number of classes before training?

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:41 PM:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking changes in order to do so?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?


was (Author: bdechoux):
I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:53 PM:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking changes in order to do so?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Could you, or a a committer, validate this proposal?


was (Author: bdechoux):
I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking changes in order to do so?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9476) Kafka stream loses leader after 2h of operation

2015-08-07 Thread Ruben Ramalho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662446#comment-14662446
 ] 

Ruben Ramalho commented on SPARK-9476:
--

Sorry for the late reply, I promise to keep my response delay much smaller from 
now on.

There aren't any error logs, but this problem compromises the normal operation 
of analytics server.

Yes, simpler jobs do run in the same environment. This same setup manages to 
run correctly for two hours, it's after 2h of operation that this problem 
arises, which is strange.
Unfortunately I cannot share the relevant code, at least as an integral part, 
but I can share with you what I am doing. I am consuming data from apache 
kafka, as positional updates, doing window operations over this data and 
extracting features. This features are then feed to machine learning algorithms 
and tips are generated and feed back to kafka (a different topic). If you want 
specific parts of the code I can provide you with that!

I was using apache kafka 0.8.2.0 with this issue then I updated to 0.8.2.1 (in 
hopes of this problem being fixed), the issue persists. I think apache spark at 
some point is corrupting the apache kafka topics, I cannot isolate why that is 
happening tough. I have used both the kafka direct stream and regular stream 
and the problem seems to persist.

Thanks you,

R. Ramalho

 Kafka stream loses leader after 2h of operation 
 

 Key: SPARK-9476
 URL: https://issues.apache.org/jira/browse/SPARK-9476
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
 Environment: Docker, Centos, Spark standalone, core i7, 8Gb
Reporter: Ruben Ramalho

 This seems to happen every 2h, it happens both with the direct stream and 
 regular stream, I'm doing window operations over a 1h period (if that can 
 help).
 Here's part of the error message:
 2015-07-30 13:27:23 WARN  ClientUtils$:89 - Fetching topic metadata with 
 correlation id 10 for topics [Set(updates)] from broker 
 [id:0,host:192.168.3.23,port:3000] failed
 java.nio.channels.ClosedChannelException
   at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
   at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
   at 
 kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
   at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
   at 
 kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
 2015-07-30 13:27:23 INFO  SyncProducer:68 - Disconnecting from 
 192.168.3.23:3000
 2015-07-30 13:27:23 WARN  ConsumerFetcherManager$LeaderFinderThread:89 - 
 [spark-group_81563e123e9f-1438259236988-fc3d82bf-leader-finder-thread], 
 Failed to find leader for Set([updates,0])
 kafka.common.KafkaException: fetching topic metadata for topics 
 [Set(oversight-updates)] from broker 
 [ArrayBuffer(id:0,host:192.168.3.23,port:3000)] failed
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
   at 
 kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
 Caused by: java.nio.channels.ClosedChannelException
   at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
   at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
   at 
 kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
   at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
 After the crash I tried to communicate with kafka with a simple scala 
 consumer and producer and have no problem at all. Spark tough needs a kafka 
 container restart to start normal operaiton. There are no errors on the kafka 
 log, apart from an improper closed connection.
 I have been trying to solve this problem for days, I suspect this has 
 something to do with spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662440#comment-14662440
 ] 

Sean Owen commented on SPARK-9746:
--

RDDs are not maps. An RDD of (K,V) is merely a collection of (K,V). K is not 
unique.
Otherwise what would countByKey mean? if K were unique, then all of the counts 
would be 1 and this method would make no sense.

 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662442#comment-14662442
 ] 

Feynman Liang commented on SPARK-9660:
--

{{GradientDescent$.runMiniBatchSGD}} should either use default argument or 
specify the default convergence tolerance in the [method 
overload|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L267]

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662455#comment-14662455
 ] 

Feynman Liang commented on SPARK-9660:
--

Most documentation in 
[MultivariateOnlineSummarizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L224]
 was lost and should be readded.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9677) Enable SQLQuerySuite.aggregation with codegen updates peak execution memory

2015-08-07 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662494#comment-14662494
 ] 

Andrew Or commented on SPARK-9677:
--

Resolved by https://github.com/apache/spark/pull/8015

 Enable SQLQuerySuite.aggregation with codegen updates peak execution memory
 -

 Key: SPARK-9677
 URL: https://issues.apache.org/jira/browse/SPARK-9677
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.5.0


 It was disabled in https://github.com/apache/spark/pull/7983
 Looked like the test case was written against the old aggregate. We need to 
 rewrite it to work for the new aggregate (and make sure the memory usage 
 reporting works for the new aggregate).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9677) Enable SQLQuerySuite.aggregation with codegen updates peak execution memory

2015-08-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-9677.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Enable SQLQuerySuite.aggregation with codegen updates peak execution memory
 -

 Key: SPARK-9677
 URL: https://issues.apache.org/jira/browse/SPARK-9677
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.5.0


 It was disabled in https://github.com/apache/spark/pull/7983
 Looked like the test case was written against the old aggregate. We need to 
 rewrite it to work for the new aggregate (and make sure the memory usage 
 reporting works for the new aggregate).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8481) GaussianMixtureModel predict accepting single vector

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-8481:
--

Reopening before merging version fix PR

 GaussianMixtureModel predict accepting single vector
 

 Key: SPARK-8481
 URL: https://issues.apache.org/jira/browse/SPARK-8481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Dariusz Kobylarz
Assignee: Dariusz Kobylarz
Priority: Minor
  Labels: GaussianMixtureModel, MLlib
 Fix For: 1.5.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 GaussianMixtureModel lacks a method to predict a cluster for a single input 
 vector where no spark context would be involved, i.e.
 /** Maps given point to its cluster index. */
 def predict(point: Vector) : Int



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Andreas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662522#comment-14662522
 ] 

Andreas commented on SPARK-9746:


Many thanks for your responsiveness and patience.
I admire your contribution to this awesome project.

BR from a very thankful user.

 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9745) Applications hangs when the last executor fails with dynamic allocation

2015-08-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9745:
-
Priority: Blocker  (was: Critical)

 Applications hangs when the last executor fails with dynamic allocation
 ---

 Key: SPARK-9745
 URL: https://issues.apache.org/jira/browse/SPARK-9745
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Scheduler, YARN
Affects Versions: 1.5.0
 Environment: YARN + Pyspark + Dynamic Allocation
Reporter: Alex Angelini
Assignee: Andrew Or
Priority: Blocker
 Attachments: am_hung_job.png, executors_hung_job.png, 
 logs_hung_job.png, tasks_hung_job.png


 When a job has only a single executor remaining and that executor dies (due 
 to something like an OOM), the application fails to notice that there are no 
 executors left and it hangs indefinitely.
 This only happens when dynamic allocation is enabled.
 The following images were taken from a hung application with no executors:
 !logs_hung_job.png!
 ^^ *Notice how 1 executor was lost, but the application never requested it to 
 be removed*
 !am_hung_job.png!
 !executors_hung_job.png!
 !tasks_hung_job.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9375) The total number of executor(s) requested by the driver may be negative

2015-08-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9375:
-
Priority: Critical  (was: Major)

 The total number of  executor(s) requested by  the driver may be negative
 -

 Key: SPARK-9375
 URL: https://issues.apache.org/jira/browse/SPARK-9375
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: KaiXinXIaoLei
Priority: Critical
 Attachments: The total number of executor(s) is negative in AM log.png


 I set spark.dynamicAllocation.enabled = true”. I run a big job. I find a 
 problem in ApplicationMaster log: the total number of  executor(s) requested 
 by  the driver is negative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9754) Remove TypeCheck in debug package

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9754:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Remove TypeCheck in debug package
 -

 Key: SPARK-9754
 URL: https://issues.apache.org/jira/browse/SPARK-9754
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 TypeCheck no longer applies in the new Tungsten world.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9750) SparseMatrix should override equals

2015-08-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662538#comment-14662538
 ] 

Joseph K. Bradley commented on SPARK-9750:
--

[~fliang] Are you working on this?

 SparseMatrix should override equals
 ---

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9750) SparseMatrix should override equals

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662536#comment-14662536
 ] 

Apache Spark commented on SPARK-9750:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8042

 SparseMatrix should override equals
 ---

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9750) SparseMatrix should override equals

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9750:
---

Assignee: Apache Spark

 SparseMatrix should override equals
 ---

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Assignee: Apache Spark
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9750) SparseMatrix should override equals

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9750:
---

Assignee: (was: Apache Spark)

 SparseMatrix should override equals
 ---

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9754) Remove TypeCheck in debug package

2015-08-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9754:
--

 Summary: Remove TypeCheck in debug package
 Key: SPARK-9754
 URL: https://issues.apache.org/jira/browse/SPARK-9754
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


TypeCheck no longer applies in the new Tungsten world.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9568) Spark MLlib 1.5.0 testing umbrella

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9568:
-
Description: 
h2. API

* Check binary API compatibility (SPARK-9658)
* Audit new public APIs (from the generated html doc)
** Scala (SPARK-9660)
** Java compatibility (SPARK-9661)
** Python coverage (SPARK-9662)
* Check Experimental, DeveloperApi tags (SPARK-9665)

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)

*Correctness*
* model save/load (SPARK-9666)

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide (SPARK-9668)
* For major components, create JIRAs for example code (SPARK-9670)
* Update Programming Guide for 1.4 (towards end of QA) (SPARK-9671)

  was:
h2. API

* Check binary API compatibility
* Audit new public APIs (from the generated html doc)
** Scala
** Java compatibility
** Python coverage
* Check Experimental, DeveloperApi tags

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)

*Correctness*
* model save/load (SPARK-9666)

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide (SPARK-9668)
* For major components, create JIRAs for example code (SPARK-9670)
* Update Programming Guide for 1.4 (towards end of QA) (SPARK-9671)


 Spark MLlib 1.5.0 testing umbrella
 --

 Key: SPARK-9568
 URL: https://issues.apache.org/jira/browse/SPARK-9568
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Reporter: Reynold Xin
Assignee: Xiangrui Meng

 h2. API
 * Check binary API compatibility (SPARK-9658)
 * Audit new public APIs (from the generated html doc)
 ** Scala (SPARK-9660)
 ** Java compatibility (SPARK-9661)
 ** Python coverage (SPARK-9662)
 * Check Experimental, DeveloperApi tags (SPARK-9665)
 h2. Algorithms and performance
 *Performance*
 * _List any other missing performance tests from spark-perf here_
 * LDA online/EM (SPARK-7455)
 * ElasticNet for linear regression and logistic regression (SPARK-7456)
 * PIC (SPARK-7454)
 * ALS.recommendAll (SPARK-7457)
 * perf-tests in Python (SPARK-7539)
 *Correctness*
 * model save/load (SPARK-9666)
 h2. Documentation and example code
 * For new algorithms, create JIRAs for updating the user guide (SPARK-9668)
 * For major components, create JIRAs for example code (SPARK-9670)
 * Update Programming Guide for 1.4 (towards end of QA) (SPARK-9671)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer

2015-08-07 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9755:


 Summary: Add method documentation to MultivariateOnlineSummarizer
 Key: SPARK-9755
 URL: https://issues.apache.org/jira/browse/SPARK-9755
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Priority: Minor


Docs present in 1.4 are lost in current 1.5 branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662569#comment-14662569
 ] 

Feynman Liang commented on SPARK-9755:
--

Working on this.

 Add method documentation to MultivariateOnlineSummarizer
 

 Key: SPARK-9755
 URL: https://issues.apache.org/jira/browse/SPARK-9755
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Priority: Minor

 Docs present in 1.4 are lost in current 1.5 branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9714) Cannot insert into a table using pySpark

2015-08-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9714:

Description: 
This is a bug on the master branch. After creating the table (yun is the 
table name) with the corresponding fields, I ran the following command.

from pyspark.sql import *
sc.parallelize([Row(id=1, name=test, 
description=)]).toDF().write.mode(append).saveAsTable(yun)

I get the following error:

Py4JJavaError: An error occurred while calling o100.saveAsTable.
: org.apache.spark.SparkException: Task not serializable

Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.Path, value: 
/user/hive/warehouse/yun)
- field (class: org.apache.hadoop.hive.ql.metadata.Table, name: path, 
type: class org.apache.hadoop.fs.Path)
- object (class org.apache.hadoop.hive.ql.metadata.Table, yun)
- field (class: org.apache.hadoop.hive.ql.metadata.Partition, name: 
table, type: class org.apache.hadoop.hive.ql.metadata.Table)
- object (class org.apache.hadoop.hive.ql.metadata.Partition, yun())
- field (class: scala.collection.immutable.Stream$Cons, name: hd, type: 
class java.lang.Object)
- object (class scala.collection.immutable.Stream$Cons, Stream(yun()))
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, 
function0)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream(HivePartition(List(),HiveStorageDescriptor(/user/hive/warehouse/yun,org.apache.hadoop.mapred.TextInputFormat,org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,Map(serialization.format
 - 1)
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, 
function0)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream(/user/hive/warehouse/yun))
- field (class: org.apache.spark.sql.hive.MetastoreRelation, name: 
paths, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.hive.MetastoreRelation, 
MetastoreRelation default, yun, None
)
- field (class: 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable, name: table, type: 
class org.apache.spark.sql.hive.MetastoreRelation)
- object (class 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable, InsertIntoHiveTable 
(MetastoreRelation default, yun, None), Map(), false, false
 ConvertToSafe
  TungstenProject [CAST(description#10, FloatType) AS 
description#16,CAST(id#11L, StringType) AS id#17,name#12]
   PhysicalRDD [description#10,id#11L,name#12], MapPartitionsRDD[17] at 
applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
)
- field (class: 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3,
 name: $outer, type: class 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable)
- object (class 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3,
 function2)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 30 more


  was:
This is a bug on the master branch. After creating the table (yun is the 
table name) with the corresponding fields, I ran the following command.

from pyspark.sql import *
sc.parallelize([Row(id=1, name=test, 
description=)]).toDF().write.mode(append).saveAsTable(yun)

I get the following error:

Py4JJavaError: An error occurred while calling o100.saveAsTable.
: org.apache.spark.SparkException: Task not serializable

Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.Path, value: 
dbfs:/user/hive/warehouse/yun)
- field (class: org.apache.hadoop.hive.ql.metadata.Table, name: path, 
type: class org.apache.hadoop.fs.Path)
- object (class org.apache.hadoop.hive.ql.metadata.Table, yun)
- field (class: org.apache.hadoop.hive.ql.metadata.Partition, name: 
table, type: class org.apache.hadoop.hive.ql.metadata.Table)

[jira] [Created] (SPARK-9756) Make auxillary constructors for ML decision trees private

2015-08-07 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9756:


 Summary: Make auxillary constructors for ML decision trees private
 Key: SPARK-9756
 URL: https://issues.apache.org/jira/browse/SPARK-9756
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


These classes should not (and actually can not) be instantiated directly 
because there is currently no public constructor for {{Node}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow

2015-08-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9752:
--

 Summary: Sample operator should avoid row copying and support 
UnsafeRow
 Key: SPARK-9752
 URL: https://issues.apache.org/jira/browse/SPARK-9752
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9751) Audit operators to make sure they can support UnsafeRows

2015-08-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9751:
--

 Summary: Audit operators to make sure they can support UnsafeRows
 Key: SPARK-9751
 URL: https://issues.apache.org/jira/browse/SPARK-9751
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


An umbrella ticket to track various operators that should be able to support 
UnsafeRow to avoid copying.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Andreas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662510#comment-14662510
 ] 

Andreas commented on SPARK-9746:


Still the scaladoc Count the number of elements for each key, collecting the 
results to a local Map for me is misleading. Maybe it should read Count the 
number of (distinct ? or whatever) keys.
For whatever purpose this is needed.


 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662514#comment-14662514
 ] 

Joseph K. Bradley commented on SPARK-9720:
--

I like the proposal, but I don't think we should break APIs...which 
unfortunately means we will need to stick with encouragement instead of 
enforcement.  Would you mind sending a PR to update those classes with issues?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9753:
---

Assignee: Yin Huai  (was: Apache Spark)

 TungstenAggregate should also accept InternalRow instead of just UnsafeRow
 --

 Key: SPARK-9753
 URL: https://issues.apache.org/jira/browse/SPARK-9753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 Since we need to project out key and value out, there is no need to only 
 accept UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9753:
---

Assignee: Apache Spark  (was: Yin Huai)

 TungstenAggregate should also accept InternalRow instead of just UnsafeRow
 --

 Key: SPARK-9753
 URL: https://issues.apache.org/jira/browse/SPARK-9753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Blocker

 Since we need to project out key and value out, there is no need to only 
 accept UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662523#comment-14662523
 ] 

Sean Owen commented on SPARK-9746:
--

It does not count the number of distinct keys, nor does it count distinct 
values for the key, so I don't think that's accurate. It counts the number of 
times each key appears. I suppose there are many ways of saying this; here it 
says it counts the number of elements that include each key, which seems like a 
reasonable description of the behavior.

 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662534#comment-14662534
 ] 

Joseph K. Bradley commented on SPARK-7454:
--

[~javadba] I should have pinged you before, but could you please send a PR for 
that perf-test?  Thank you!

 Perf test for power iteration clustering (PIC)
 --

 Key: SPARK-7454
 URL: https://issues.apache.org/jira/browse/SPARK-7454
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Stephen Boesch





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9754) Remove TypeCheck in debug package

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662540#comment-14662540
 ] 

Apache Spark commented on SPARK-9754:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8043

 Remove TypeCheck in debug package
 -

 Key: SPARK-9754
 URL: https://issues.apache.org/jira/browse/SPARK-9754
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 TypeCheck no longer applies in the new Tungsten world.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9754) Remove TypeCheck in debug package

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9754:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Remove TypeCheck in debug package
 -

 Key: SPARK-9754
 URL: https://issues.apache.org/jira/browse/SPARK-9754
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 TypeCheck no longer applies in the new Tungsten world.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9750) SparseMatrix should override equals

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662566#comment-14662566
 ] 

Feynman Liang commented on SPARK-9750:
--

Yep.

 SparseMatrix should override equals
 ---

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9620) generated UnsafeProjection does not support many columns or large exressions

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9620:
---

Assignee: (was: Apache Spark)

 generated UnsafeProjection does not support many columns or large exressions
 

 Key: SPARK-9620
 URL: https://issues.apache.org/jira/browse/SPARK-9620
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Priority: Critical

 We put all the expressions in one function of UnsafeProjection, that could 
 reach the 65k code size limit in JVM.
 We should split them into multiple functions, like that we do for 
 MutableProjection and SafeProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9620) generated UnsafeProjection does not support many columns or large exressions

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9620:
---

Assignee: Apache Spark

 generated UnsafeProjection does not support many columns or large exressions
 

 Key: SPARK-9620
 URL: https://issues.apache.org/jira/browse/SPARK-9620
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Critical

 We put all the expressions in one function of UnsafeProjection, that could 
 reach the 65k code size limit in JVM.
 We should split them into multiple functions, like that we do for 
 MutableProjection and SafeProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9620) generated UnsafeProjection does not support many columns or large exressions

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662579#comment-14662579
 ] 

Apache Spark commented on SPARK-9620:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8044

 generated UnsafeProjection does not support many columns or large exressions
 

 Key: SPARK-9620
 URL: https://issues.apache.org/jira/browse/SPARK-9620
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Priority: Critical

 We put all the expressions in one function of UnsafeProjection, that could 
 reach the 65k code size limit in JVM.
 We should split them into multiple functions, like that we do for 
 MutableProjection and SafeProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9757) Can't create persistent data source tables with decimal

2015-08-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-9757:
---

 Summary: Can't create persistent data source tables with decimal
 Key: SPARK-9757
 URL: https://issues.apache.org/jira/browse/SPARK-9757
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Michael Armbrust
Priority: Blocker


{code}
Caused by: java.lang.UnsupportedOperationException: Parquet does not support 
decimal. See HIVE-6384
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:102)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:60)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:358)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:356)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:356)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:356)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:351)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:198)
at 
org.apache.spark.sql.hive.execution.CreateMetastoreDataSource.run(commands.scala:152)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9719) spark.ml NaiveBayes doc cleanups

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9719:
---

Assignee: (was: Apache Spark)

 spark.ml NaiveBayes doc cleanups
 

 Key: SPARK-9719
 URL: https://issues.apache.org/jira/browse/SPARK-9719
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Priority: Minor

 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta
 Add setParam tag to NaiveBayes setModelType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9719) spark.ml NaiveBayes doc cleanups

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662616#comment-14662616
 ] 

Apache Spark commented on SPARK-9719:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8047

 spark.ml NaiveBayes doc cleanups
 

 Key: SPARK-9719
 URL: https://issues.apache.org/jira/browse/SPARK-9719
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Priority: Minor

 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta
 Add setParam tag to NaiveBayes setModelType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9719) spark.ml NaiveBayes doc cleanups

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9719:
---

Assignee: Apache Spark

 spark.ml NaiveBayes doc cleanups
 

 Key: SPARK-9719
 URL: https://issues.apache.org/jira/browse/SPARK-9719
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta
 Add setParam tag to NaiveBayes setModelType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Andreas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas reopened SPARK-9746:


Sorry, but I don't agree.

cntxt..parallelize(List ((a, 1), (a, 2))).groupBy(_._1).countByKey()

returns 'Map(a - 1)' but should in my opinion return 'Map(a - 2)'

If the values (counts) are irrelevant then why this function is called 
*count*ByKey and why does it return a Map instead of a Set?
The current implementation has no added value compared to 
'pairRDD.keys.collect().toSet'

cntxt.paralize

 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662364#comment-14662364
 ] 

Feynman Liang commented on SPARK-9660:
--

{{LogisticRegressionModel$.load}} missing short description.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662374#comment-14662374
 ] 

Feynman Liang edited comment on SPARK-9660 at 8/7/15 8:26 PM:
--

{{SVMModel}} missing short descriptions for {{save}}, {{load}}, and {{toString}}


was (Author: fliang):
{{SVMModel}} missing short descriptions for {{save}} and {{toString}}

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9738) remove FromUnsafe and add its codegen version to GenerateSafe

2015-08-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662403#comment-14662403
 ] 

Josh Rosen commented on SPARK-9738:
---

[~davies], [~cloud_fan], [~rxin], should this JIRA be converted to a subtask or 
targeted in a Tungsten epic? Can we add a description saying the motivation for 
this change?

 remove FromUnsafe and add its codegen version to GenerateSafe
 -

 Key: SPARK-9738
 URL: https://issues.apache.org/jira/browse/SPARK-9738
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9747) Avoid starving an unsafe operator in an aggregate

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9747:
---

Assignee: Andrew Or  (was: Apache Spark)

 Avoid starving an unsafe operator in an aggregate
 -

 Key: SPARK-9747
 URL: https://issues.apache.org/jira/browse/SPARK-9747
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This mainly concerns TungstenAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9747) Avoid starving an unsafe operator in an aggregate

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9747:
---

Assignee: Apache Spark  (was: Andrew Or)

 Avoid starving an unsafe operator in an aggregate
 -

 Key: SPARK-9747
 URL: https://issues.apache.org/jira/browse/SPARK-9747
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Apache Spark
Priority: Blocker

 This mainly concerns TungstenAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9747) Avoid starving an unsafe operator in an aggregate

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662413#comment-14662413
 ] 

Apache Spark commented on SPARK-9747:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8038

 Avoid starving an unsafe operator in an aggregate
 -

 Key: SPARK-9747
 URL: https://issues.apache.org/jira/browse/SPARK-9747
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This mainly concerns TungstenAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1

2015-08-07 Thread Andreas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662430#comment-14662430
 ] 

Andreas commented on SPARK-9746:


Sorry to waste your time.

But in my understanding in a PairRDD[K,V] each key (K) should occurre only once 
(its like a Map[K,V]). It's by design that the keys in a map are uique (occurre 
only one), there is no sense in counting the # of occurences of a key in a Map 
(always one by design). 

 PairRDDFunctions.countByKey: values/counts always 1
 ---

 Key: SPARK-9746
 URL: https://issues.apache.org/jira/browse/SPARK-9746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andreas

 org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = 
 self.withScope {
 self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap
   }
 obviously always returns count 1 for each key.
 If I understand the docs correctly I would expect this implementation:
 self.mapValues(_.size).reduceByKey(_ + _).collect().toMap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9476) Kafka stream loses leader after 2h of operation

2015-08-07 Thread Ruben Ramalho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662446#comment-14662446
 ] 

Ruben Ramalho edited comment on SPARK-9476 at 8/7/15 8:58 PM:
--

Sorry for the late reply, I promise to keep my response delay much smaller from 
now on.

There aren't any error logs, but this problem compromises the normal operation 
of the analytics server.

Yes, simpler jobs do run in the same environment. This same setup manages to 
run correctly for two hours, it's after 2h of operation that this problem 
arises, which is strange.
Unfortunately I cannot share the relevant code, at least as an integral part, 
but I can share with you what I am doing. I am consuming data from apache 
kafka, as positional updates, doing window operations over this data and 
extracting features. This features are then feed to machine learning algorithms 
and tips are generated and feed back to kafka (a different topic). If you want 
specific parts of the code I can provide you with that!

I was using apache kafka 0.8.2.0 with this issue then I updated to 0.8.2.1 (in 
hopes of this problem being fixed), the issue persists. I think apache spark at 
some point is corrupting the apache kafka topics, I cannot isolate why that is 
happening tough. I have used both the kafka direct stream and regular stream 
and the problem seems to persist.

Thanks you,

R. Ramalho


was (Author: r.ramalho):
Sorry for the late reply, I promise to keep my response delay much smaller from 
now on.

There aren't any error logs, but this problem compromises the normal operation 
of analytics server.

Yes, simpler jobs do run in the same environment. This same setup manages to 
run correctly for two hours, it's after 2h of operation that this problem 
arises, which is strange.
Unfortunately I cannot share the relevant code, at least as an integral part, 
but I can share with you what I am doing. I am consuming data from apache 
kafka, as positional updates, doing window operations over this data and 
extracting features. This features are then feed to machine learning algorithms 
and tips are generated and feed back to kafka (a different topic). If you want 
specific parts of the code I can provide you with that!

I was using apache kafka 0.8.2.0 with this issue then I updated to 0.8.2.1 (in 
hopes of this problem being fixed), the issue persists. I think apache spark at 
some point is corrupting the apache kafka topics, I cannot isolate why that is 
happening tough. I have used both the kafka direct stream and regular stream 
and the problem seems to persist.

Thanks you,

R. Ramalho

 Kafka stream loses leader after 2h of operation 
 

 Key: SPARK-9476
 URL: https://issues.apache.org/jira/browse/SPARK-9476
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
 Environment: Docker, Centos, Spark standalone, core i7, 8Gb
Reporter: Ruben Ramalho

 This seems to happen every 2h, it happens both with the direct stream and 
 regular stream, I'm doing window operations over a 1h period (if that can 
 help).
 Here's part of the error message:
 2015-07-30 13:27:23 WARN  ClientUtils$:89 - Fetching topic metadata with 
 correlation id 10 for topics [Set(updates)] from broker 
 [id:0,host:192.168.3.23,port:3000] failed
 java.nio.channels.ClosedChannelException
   at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
   at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
   at 
 kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
   at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
   at 
 kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
 2015-07-30 13:27:23 INFO  SyncProducer:68 - Disconnecting from 
 192.168.3.23:3000
 2015-07-30 13:27:23 WARN  ConsumerFetcherManager$LeaderFinderThread:89 - 
 [spark-group_81563e123e9f-1438259236988-fc3d82bf-leader-finder-thread], 
 Failed to find leader for Set([updates,0])
 kafka.common.KafkaException: fetching topic metadata for topics 
 [Set(oversight-updates)] from broker 
 [ArrayBuffer(id:0,host:192.168.3.23,port:3000)] failed
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72)
   at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
   at 
 kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
 Caused by: 

[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662452#comment-14662452
 ] 

Feynman Liang commented on SPARK-9660:
--

StreamingLinearRegressionWithSGD's 
[setConvergenceTol|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.scala#L88]
 and 
[setInitialWeights|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.scala#L83]
 should document default values.

 ML 1.5 QA: API: New Scala APIs, docs
 

 Key: SPARK-9660
 URL: https://issues.apache.org/jira/browse/SPARK-9660
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley

 Audit new public Scala APIs added to MLlib.  Take note of:
 * Protected/public classes or methods.  If access can be more private, then 
 it should be.
 * Also look for non-sealed traits.
 * Documentation: Missing?  Bad links or formatting?
 *Make sure to check the object doc!*
 As you find issues, please comment here, or better yet create JIRAs and link 
 them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9749) DenseMatrix equals does not account for isTransposed

2015-08-07 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662475#comment-14662475
 ] 

Feynman Liang commented on SPARK-9749:
--

Working on this.

 DenseMatrix equals does not account for isTransposed
 

 Key: SPARK-9749
 URL: https://issues.apache.org/jira/browse/SPARK-9749
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker

 A matrix is not always equal to its transpose, but the current implementation 
 of {{equals}} in 
 [DenseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L261]
  does not account for the {{isTransposed}} flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow

2015-08-07 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9753:
---

 Summary: TungstenAggregate should also accept InternalRow instead 
of just UnsafeRow
 Key: SPARK-9753
 URL: https://issues.apache.org/jira/browse/SPARK-9753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker


Since we need to project out key and value out, there is no need to only accept 
UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9752:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Sample operator should avoid row copying and support UnsafeRow
 --

 Key: SPARK-9752
 URL: https://issues.apache.org/jira/browse/SPARK-9752
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow

2015-08-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9752:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Sample operator should avoid row copying and support UnsafeRow
 --

 Key: SPARK-9752
 URL: https://issues.apache.org/jira/browse/SPARK-9752
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662506#comment-14662506
 ] 

Apache Spark commented on SPARK-9752:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8040

 Sample operator should avoid row copying and support UnsafeRow
 --

 Key: SPARK-9752
 URL: https://issues.apache.org/jira/browse/SPARK-9752
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow

2015-08-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662513#comment-14662513
 ] 

Apache Spark commented on SPARK-9753:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8041

 TungstenAggregate should also accept InternalRow instead of just UnsafeRow
 --

 Key: SPARK-9753
 URL: https://issues.apache.org/jira/browse/SPARK-9753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 Since we need to project out key and value out, there is no need to only 
 accept UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8481) GaussianMixtureModel predict accepting single vector

2015-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-8481.
--
Resolution: Fixed

Issue resolved by pull request 8039
[https://github.com/apache/spark/pull/8039]

 GaussianMixtureModel predict accepting single vector
 

 Key: SPARK-8481
 URL: https://issues.apache.org/jira/browse/SPARK-8481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Dariusz Kobylarz
Assignee: Dariusz Kobylarz
Priority: Minor
  Labels: GaussianMixtureModel, MLlib
 Fix For: 1.5.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 GaussianMixtureModel lacks a method to predict a cluster for a single input 
 vector where no spark context would be involved, i.e.
 /** Maps given point to its cluster index. */
 def predict(point: Vector) : Int



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9749) DenseMatrix equals does not account for isTransposed

2015-08-07 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang closed SPARK-9749.

Resolution: Not A Problem

 DenseMatrix equals does not account for isTransposed
 

 Key: SPARK-9749
 URL: https://issues.apache.org/jira/browse/SPARK-9749
 Project: Spark
  Issue Type: Bug
Reporter: Feynman Liang
Priority: Blocker

 A matrix is not always equal to its transpose, but the current implementation 
 of {{equals}} in 
 [DenseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L261]
  does not account for the {{isTransposed}} flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >