[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325829#comment-14325829
 ] 

Manoj Kumar commented on SPARK-5436:


The idea sounds great. I shall come up with a Pull Request in a day or two.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-15 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322332#comment-14322332
 ] 

Manoj Kumar commented on SPARK-5016:


[~mengxr] Can you please clarify a few things.

1. How to key the BreezeData in order to effect parallelization across k 
gaussians. (considering the fact that it is a soft assignment)?
2. Even if we are able to do so, there are a few lines of code corresponding to 
the log-likelihood computation as pointed by [~tgaloppo] , which are 
interdependent, How can that be done?

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-14 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321466#comment-14321466
 ] 

Manoj Kumar commented on SPARK-5016:


[~tgaloppo] If I understand, [~mengxr] 's description correctly, that seems to 
be the way, i.e to have a keyed RDD of Expectation Sums, so that the k updates 
are parallel. But why is it awkward that the each entry should be operated by 
every reducer?

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-14 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321611#comment-14321611
 ] 

Manoj Kumar commented on SPARK-5016:


Is there a possibility that memory is shared between all k reducers (I haven't 
tried anything, just speculating here)?

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-14 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-5016:
---
Comment: was deleted

(was: Is there a possibility that memory is shared between all k reducers (I 
haven't tried anything, just speculating here)?)

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-12 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318021#comment-14318021
 ] 

Manoj Kumar commented on SPARK-5436:


Hi, I would like to give this a go. [~ChrisT] are you still working on this? 
Otherwise I would love to carry this forward.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-10 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315695#comment-14315695
 ] 

Manoj Kumar commented on SPARK-5016:


[~tgaloppo] How about a method setParallelGaussianUpdate(bool) (defaulting to 
False) which would allow the user to decide whether to use this feature or not?

[~mengxr] I would like to your know your thoughts on this as well.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-09 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311956#comment-14311956
 ] 

Manoj Kumar commented on SPARK-5016:


[~tgaloppo] I would like your inputs on this as well.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-09 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312284#comment-14312284
 ] 

Manoj Kumar commented on SPARK-5016:


Well, I got mislead by the Jira description which says Gaussian Initialization. 
I was thinking it was this block of code, that initializes the k Gaussian 
distributions that needs to be parallelized.

{code}
val samples = breezeData.takeSample(withReplacement = true, k * nSamples, seed)
(Array.fill(k)(1.0 / k), Array.tabulate(k) { i =
val slice = samples.view(i * nSamples, (i + 1) * nSamples)
new MultivariateGaussian(vectorMean(slice), initCovariance(slice))
})
{code}

And next time, please please don't post the code (or atleast give a spoiler 
alert), it spoils the fun of fixing it :P

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-08 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311532#comment-14311532
 ] 

Manoj Kumar commented on SPARK-5021:


I have created a working pull request. Let us please take the discussion there.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-05 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306688#comment-14306688
 ] 

Manoj Kumar edited comment on SPARK-5016 at 2/5/15 8:09 AM:


Hi, I would like to fix this (since I'm familiar to an extent with this part of 
the code) and maybe we could merge this before the sparseinput issue.

1. As a heuristic, how large should k be?
2. By distribute do you mean, to store samples 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140)
 as a collection using sc.parallelize, so that it can be operated on paraalel 
across k? What role does n_features have?

Thanks.


was (Author: mechcoder):
Hi, I would like to fix this (since I'm familiar to an extent with this part of 
the code) and maybe we could merge this before the sparseinput issue.

1. As a heuristic, how large should k be?
2. By distribute do you mean, to store samples 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140)
 as a collection using sc.parallelize, so that it can be operated on paraalel 
across k.

Thanks.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305605#comment-14305605
 ] 

Manoj Kumar commented on SPARK-5021:


Thanks for the comment. That also seems to fail, since I use properties like 
index, and valueAt which are exclusive to BSV.
: error: value index is not a member of breeze.linalg.Vector[Double]

How about method overloading?

   // Dense Case
  def vectorMean(x: IndexedSeq[BDV[Double]]): BDV[Double] = {

   // Sparse Case
   def vectorMean(x: IndexedSeq[BSV[Double]]): BDV[Double] = {


 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305703#comment-14305703
 ] 

Manoj Kumar commented on SPARK-5021:


Oops, I was thinking along the completely wrong line :/ . I  was rewriting 
SparseVector addition and subtraction.

On a side note, does it help in keeping the mean sparse in your code? Typically 
isn't the mean dense, for a large number of SparseVectors.

In that case, we can remove the matching and just do

{code}
private def vectorMean(x: IndexedSeq[BV[Double]]): BDV[Double] = {
  val v = BDV.zeros[Double](x(0).length)
  x.foreach(xi = v += xi)
  v / x.length.toDouble
}
{code}

wdyt?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306538#comment-14306538
 ] 

Manoj Kumar commented on SPARK-5021:


Can you please explain, what do you mean by soft assignments?

Anyhow, maybe it might not be beneficial to keep the means sparse as you said, 
however we might benefit in not converting the original sample points to dense 
while making the calculations (of updating the means, cov matrix etc). What do 
you say?



 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306549#comment-14306549
 ] 

Manoj Kumar commented on SPARK-5021:


Ah, I see what you mean (Google helped me), I never knew that was called soft 
assignment. But I still think there would be benefits if we do not convert the 
input vectors to dense and keep everything else dense.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305032#comment-14305032
 ] 

Manoj Kumar commented on SPARK-5021:


I fixed it up, and it works for sparse input. However refactoring the code 
seems to be a huge pain and I get a lot of unrelated errors.

Would you like to have a look at the working pre-refactored code or the post 
refactored one, which is cleaner, but I'm unable to figure the error out myself 
and you might be able to help.

Thanks.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:48 PM:


Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) 
  {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else
   {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this? ie without separating the sparse and 
dense cases completely.


was (Author: mechcoder):
Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) 
  {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else
   {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:19 PM:


Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) 
  {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else
   {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?


was (Author: mechcoder):
Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)

  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305066#comment-14305066
 ] 

Manoj Kumar commented on SPARK-5021:


I just realized that it renders badly. Here is the code that causes the error.

https://gist.github.com/MechCoder/b015fdd266584ba6b8ff

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058
 ] 

Manoj Kumar commented on SPARK-5021:


Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:18 PM:


Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)

  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?


was (Author: mechcoder):
Hmm. I figured it out, it is because I have something like this.

val trainData = {
if sparse
data.map(sample = sample.asInstanceOf[SparseVector]).cache()
else
data.map(u = u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

  val sums = {
if (isSparse) {
  val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
else {
  val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
  trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
}
  }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = 
org.apache.spark.mllib.clustering.ExpectationSum
[error]   trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306688#comment-14306688
 ] 

Manoj Kumar commented on SPARK-5016:


Hi, I would like to fix this (since I'm familiar to an extent with this part of 
the code) and maybe we could merge this before the sparseinput issue.

1. As a heuristic, how large should k be?
2. By distribute do you mean, to store samples 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140)
 as a collection using sc.parallelize, so that it can be operated on paraalel 
across k.

Thanks.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306691#comment-14306691
 ] 

Manoj Kumar commented on SPARK-5021:


[~tgaloppo] Is there are any way we could have a quick 3 - 5 minute chat on 
this issue, so that we could clear the way forward (maybe IRC?). 

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306549#comment-14306549
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/5/15 4:05 AM:


Ah, I see what you mean (Google helped me), I never knew that was called soft 
assignment. But I still think there would be benefits if we do not convert the 
input vectors to dense and keep everything else dense, i.e prevernt converting 
the input to a dense form which was what the original issue was about.


was (Author: mechcoder):
Ah, I see what you mean (Google helped me), I never knew that was called soft 
assignment. But I still think there would be benefits if we do not convert the 
input vectors to dense and keep everything else dense.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304099#comment-14304099
 ] 

Manoj Kumar commented on SPARK-5021:


Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas. Is that okay?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304099#comment-14304099
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:01 PM:
-

Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas for sparse data. Is that okay?


was (Author: mechcoder):
Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas. Is that okay?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304099#comment-14304099
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:02 PM:
-

Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas for a SparseVector. Is that okay?


was (Author: mechcoder):
Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas for sparse data. Is that okay?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-01 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300889#comment-14300889
 ] 

Manoj Kumar commented on SPARK-5021:


Sorry for the delay, I just started going through the source. Just a random 
question, why is this model named GaussianMixtureEM? Shouldn't it be renamed 
just GaussianMixtureModel since EM is just an optimization algo.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-01 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300927#comment-14300927
 ] 

Manoj Kumar commented on SPARK-5021:


I see that it is resolved in master.

What do you think should be the preferred datatype, to handle an array of 
SparseVectors? Do we use CoordinateMatrix? This might involve improving 
CoordinateMatrix to add additional functionality.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296948#comment-14296948
 ] 

Manoj Kumar commented on SPARK-5021:


Sorry for being dense, but how do I access the GaussianMixtureEM docs? It 
should be out in the recent version, but I'm not sure how to view them. 

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5257) SparseVector indices must be non-negative

2015-01-21 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285336#comment-14285336
 ] 

Manoj Kumar commented on SPARK-5257:


Sure. I thought it was something I could patch up quickly, hence. Next time, I 
would ask.

 SparseVector indices must be non-negative
 -

 Key: SPARK-5257
 URL: https://issues.apache.org/jira/browse/SPARK-5257
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Priority: Minor
   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The description of SparseVector suggests only that the indices have to be 
 distinct integers.  However the code for the constructor that takes an array 
 of (index, value) tuples assumes that the indices are non-negative.
 Either the code must be changed or the description should be changed.  
 This arose when I generated indices via hashing and converting the hash 
 values to (signed) integers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5257) SparseVector indices must be non-negative

2015-01-20 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285139#comment-14285139
 ] 

Manoj Kumar commented on SPARK-5257:


[~mengxr] Can you please mark this as resolved. It sometimes creates confusion 
for new people who are trying to contribute.

 SparseVector indices must be non-negative
 -

 Key: SPARK-5257
 URL: https://issues.apache.org/jira/browse/SPARK-5257
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Priority: Minor
   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The description of SparseVector suggests only that the indices have to be 
 distinct integers.  However the code for the constructor that takes an array 
 of (index, value) tuples assumes that the indices are non-negative.
 Either the code must be changed or the description should be changed.  
 This arose when I generated indices via hashing and converting the hash 
 values to (signed) integers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-20 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284299#comment-14284299
 ] 

Manoj Kumar commented on SPARK-5021:


[~josephkb] Can you please assign this to me? I can work on this in the coming 
week.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3381) DecisionTree: eliminate bins for unordered features

2015-01-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281820#comment-14281820
 ] 

Manoj Kumar commented on SPARK-3381:


Hi, I would like to work on this, but preferably after the sampling_rate PR is 
merged, because I do not want to clutter the PR queue.

 DecisionTree: eliminate bins for unordered features
 ---

 Key: SPARK-3381
 URL: https://issues.apache.org/jira/browse/SPARK-3381
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Trivial

 Code simplification: DecisionTree currently allocates bins for unordered 
 features (in findSplitsBins).  However, those bins are not needed; only the 
 splits are required.  This change will require modifying findSplitsBins, as 
 well as modifying a few other functions to use splits instead of bins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280658#comment-14280658
 ] 

Manoj Kumar commented on SPARK-3726:


Ah I see. I had my doubts when I started looking at the code, but was in a 
hurry to send a Pull Request. So this can be closed?

 RandomForest: Support for bootstrap options
 ---

 Key: SPARK-3726
 URL: https://issues.apache.org/jira/browse/SPARK-3726
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Priority: Minor

 RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
 The expected size of each sample is the same as the original data (sampling 
 rate = 1.0), and sampling is done with replacement.  Adding support for other 
 sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-14 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277643#comment-14277643
 ] 

Manoj Kumar commented on SPARK-3726:


[~josephkb] You seem to report issues that I always think I can have a decent 
shot at :) I would like to submit a PR for this by the end of the week.

 RandomForest: Support for bootstrap options
 ---

 Key: SPARK-3726
 URL: https://issues.apache.org/jira/browse/SPARK-3726
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
 The expected size of each sample is the same as the original data (sampling 
 rate = 1.0), and sampling is done with replacement.  Adding support for other 
 sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2909) Indexing for SparseVector in pyspark

2015-01-12 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273880#comment-14273880
 ] 

Manoj Kumar commented on SPARK-2909:


[~josephkb] Sorry for spamming your inbox, but just a heads up that I'm working 
on this. Will mostly submit a Pull Request by tomorrow.

 Indexing for SparseVector in pyspark
 

 Key: SPARK-2909
 URL: https://issues.apache.org/jira/browse/SPARK-2909
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Priority: Minor

 SparseVector in pyspark does not currently support indexing, except by 
 examining the internal representation.  Though indexing is a pricy operation, 
 it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) 
 and operating on a single feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5022) Change VectorUDT to object

2015-01-09 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270817#comment-14270817
 ] 

Manoj Kumar edited comment on SPARK-5022 at 1/9/15 9:50 AM:


[~josephkb] I want to have a go at this one. Should I wait for my other PR to 
get merged, or is it ok if I submit one here, before it gets merged?


was (Author: mechcoder):
@josephkb I want to have a go at this one. Should I wait for my other PR to get 
merged, or is it ok if I submit one here, before it gets merged?

 Change VectorUDT to object
 --

 Key: SPARK-5022
 URL: https://issues.apache.org/jira/browse/SPARK-5022
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 VectorUDT DataTypes are all identical, so VectorUDT should probably be an 
 object instead of a class.
 Once this is done, we can remove equals().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5022) Change VectorUDT to object

2015-01-09 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270817#comment-14270817
 ] 

Manoj Kumar commented on SPARK-5022:


@josephkb I want to have a go at this one. Should I wait for my other PR to get 
merged, or is it ok if I submit one here, before it gets merged?

 Change VectorUDT to object
 --

 Key: SPARK-5022
 URL: https://issues.apache.org/jira/browse/SPARK-5022
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 VectorUDT DataTypes are all identical, so VectorUDT should probably be an 
 object instead of a class.
 Once this is done, we can remove equals().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5022) Change VectorUDT to object

2015-01-09 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271608#comment-14271608
 ] 

Manoj Kumar commented on SPARK-5022:


Hi, Thanks for the reply. I have worked on implemening linear, clustering 
models and a number of metrics in scikit-learn. Do you have any specific issues 
or feature requests in mind that you would like to see done, or do I keep 
searching?

 Change VectorUDT to object
 --

 Key: SPARK-5022
 URL: https://issues.apache.org/jira/browse/SPARK-5022
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 VectorUDT DataTypes are all identical, so VectorUDT should probably be an 
 object instead of a class.
 Once this is done, we can remove equals().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4406) SVD should check for k 1

2015-01-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180
 ] 

Manoj Kumar commented on SPARK-4406:


Hi Joseph, I believe this issue would be simple enough for me to start working 
on? Does it require you to assign it to me, or can I send a Pull Request right 
away?

 SVD should check for k  1
 --

 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 When SVD is called with k  1, it still tries to compute the SVD, causing a 
 lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4406) SVD should check for k 1

2015-01-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180
 ] 

Manoj Kumar edited comment on SPARK-4406 at 1/7/15 8:40 PM:


Hi Joseph, I believe this issue would be simple enough for me to start working 
on. Does it require you to assign it to me, or can I send a Pull Request right 
away?


was (Author: mechcoder):
Hi Joseph, I believe this issue would be simple enough for me to start working 
on? Does it require you to assign it to me, or can I send a Pull Request right 
away?

 SVD should check for k  1
 --

 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 When SVD is called with k  1, it still tries to compute the SVD, causing a 
 lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3