[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325829#comment-14325829 ] Manoj Kumar commented on SPARK-5436: The idea sounds great. I shall come up with a Pull Request in a day or two. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322332#comment-14322332 ] Manoj Kumar commented on SPARK-5016: [~mengxr] Can you please clarify a few things. 1. How to key the BreezeData in order to effect parallelization across k gaussians. (considering the fact that it is a soft assignment)? 2. Even if we are able to do so, there are a few lines of code corresponding to the log-likelihood computation as pointed by [~tgaloppo] , which are interdependent, How can that be done? GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321466#comment-14321466 ] Manoj Kumar commented on SPARK-5016: [~tgaloppo] If I understand, [~mengxr] 's description correctly, that seems to be the way, i.e to have a keyed RDD of Expectation Sums, so that the k updates are parallel. But why is it awkward that the each entry should be operated by every reducer? GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321611#comment-14321611 ] Manoj Kumar commented on SPARK-5016: Is there a possibility that memory is shared between all k reducers (I haven't tried anything, just speculating here)? GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-5016: --- Comment: was deleted (was: Is there a possibility that memory is shared between all k reducers (I haven't tried anything, just speculating here)?) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318021#comment-14318021 ] Manoj Kumar commented on SPARK-5436: Hi, I would like to give this a go. [~ChrisT] are you still working on this? Otherwise I would love to carry this forward. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315695#comment-14315695 ] Manoj Kumar commented on SPARK-5016: [~tgaloppo] How about a method setParallelGaussianUpdate(bool) (defaulting to False) which would allow the user to decide whether to use this feature or not? [~mengxr] I would like to your know your thoughts on this as well. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311956#comment-14311956 ] Manoj Kumar commented on SPARK-5016: [~tgaloppo] I would like your inputs on this as well. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312284#comment-14312284 ] Manoj Kumar commented on SPARK-5016: Well, I got mislead by the Jira description which says Gaussian Initialization. I was thinking it was this block of code, that initializes the k Gaussian distributions that needs to be parallelized. {code} val samples = breezeData.takeSample(withReplacement = true, k * nSamples, seed) (Array.fill(k)(1.0 / k), Array.tabulate(k) { i = val slice = samples.view(i * nSamples, (i + 1) * nSamples) new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) }) {code} And next time, please please don't post the code (or atleast give a spoiler alert), it spoils the fun of fixing it :P GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311532#comment-14311532 ] Manoj Kumar commented on SPARK-5021: I have created a working pull request. Let us please take the discussion there. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306688#comment-14306688 ] Manoj Kumar edited comment on SPARK-5016 at 2/5/15 8:09 AM: Hi, I would like to fix this (since I'm familiar to an extent with this part of the code) and maybe we could merge this before the sparseinput issue. 1. As a heuristic, how large should k be? 2. By distribute do you mean, to store samples (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140) as a collection using sc.parallelize, so that it can be operated on paraalel across k? What role does n_features have? Thanks. was (Author: mechcoder): Hi, I would like to fix this (since I'm familiar to an extent with this part of the code) and maybe we could merge this before the sparseinput issue. 1. As a heuristic, how large should k be? 2. By distribute do you mean, to store samples (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140) as a collection using sc.parallelize, so that it can be operated on paraalel across k. Thanks. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305605#comment-14305605 ] Manoj Kumar commented on SPARK-5021: Thanks for the comment. That also seems to fail, since I use properties like index, and valueAt which are exclusive to BSV. : error: value index is not a member of breeze.linalg.Vector[Double] How about method overloading? // Dense Case def vectorMean(x: IndexedSeq[BDV[Double]]): BDV[Double] = { // Sparse Case def vectorMean(x: IndexedSeq[BSV[Double]]): BDV[Double] = { GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305703#comment-14305703 ] Manoj Kumar commented on SPARK-5021: Oops, I was thinking along the completely wrong line :/ . I was rewriting SparseVector addition and subtraction. On a side note, does it help in keeping the mean sparse in your code? Typically isn't the mean dense, for a large number of SparseVectors. In that case, we can remove the matching and just do {code} private def vectorMean(x: IndexedSeq[BV[Double]]): BDV[Double] = { val v = BDV.zeros[Double](x(0).length) x.foreach(xi = v += xi) v / x.length.toDouble } {code} wdyt? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306538#comment-14306538 ] Manoj Kumar commented on SPARK-5021: Can you please explain, what do you mean by soft assignments? Anyhow, maybe it might not be beneficial to keep the means sparse as you said, however we might benefit in not converting the original sample points to dense while making the calculations (of updating the means, cov matrix etc). What do you say? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306549#comment-14306549 ] Manoj Kumar commented on SPARK-5021: Ah, I see what you mean (Google helped me), I never knew that was called soft assignment. But I still think there would be benefits if we do not convert the input vectors to dense and keep everything else dense. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305032#comment-14305032 ] Manoj Kumar commented on SPARK-5021: I fixed it up, and it works for sparse input. However refactoring the code seems to be a huge pain and I get a lot of unrelated errors. Would you like to have a look at the working pre-refactored code or the post refactored one, which is cleaner, but I'm unable to figure the error out myself and you might be able to help. Thanks. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058 ] Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:48 PM: Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? ie without separating the sparse and dense cases completely. was (Author: mechcoder): Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058 ] Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:19 PM: Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? was (Author: mechcoder): Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305066#comment-14305066 ] Manoj Kumar commented on SPARK-5021: I just realized that it renders badly. Here is the code that causes the error. https://gist.github.com/MechCoder/b015fdd266584ba6b8ff GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058 ] Manoj Kumar commented on SPARK-5021: Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305058#comment-14305058 ] Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:18 PM: Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? was (Author: mechcoder): Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample = sample.asInstanceOf[SparseVector]).cache() else data.map(u = u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) = org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) = org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306688#comment-14306688 ] Manoj Kumar commented on SPARK-5016: Hi, I would like to fix this (since I'm familiar to an extent with this part of the code) and maybe we could merge this before the sparseinput issue. 1. As a heuristic, how large should k be? 2. By distribute do you mean, to store samples (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L140) as a collection using sc.parallelize, so that it can be operated on paraalel across k. Thanks. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306691#comment-14306691 ] Manoj Kumar commented on SPARK-5021: [~tgaloppo] Is there are any way we could have a quick 3 - 5 minute chat on this issue, so that we could clear the way forward (maybe IRC?). GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306549#comment-14306549 ] Manoj Kumar edited comment on SPARK-5021 at 2/5/15 4:05 AM: Ah, I see what you mean (Google helped me), I never knew that was called soft assignment. But I still think there would be benefits if we do not convert the input vectors to dense and keep everything else dense, i.e prevernt converting the input to a dense form which was what the original issue was about. was (Author: mechcoder): Ah, I see what you mean (Google helped me), I never knew that was called soft assignment. But I still think there would be benefits if we do not convert the input vectors to dense and keep everything else dense. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304099#comment-14304099 ] Manoj Kumar commented on SPARK-5021: Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas. Is that okay? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304099#comment-14304099 ] Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:01 PM: - Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas for sparse data. Is that okay? was (Author: mechcoder): Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas. Is that okay? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304099#comment-14304099 ] Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:02 PM: - Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas for a SparseVector. Is that okay? was (Author: mechcoder): Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas for sparse data. Is that okay? GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300889#comment-14300889 ] Manoj Kumar commented on SPARK-5021: Sorry for the delay, I just started going through the source. Just a random question, why is this model named GaussianMixtureEM? Shouldn't it be renamed just GaussianMixtureModel since EM is just an optimization algo. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300927#comment-14300927 ] Manoj Kumar commented on SPARK-5021: I see that it is resolved in master. What do you think should be the preferred datatype, to handle an array of SparseVectors? Do we use CoordinateMatrix? This might involve improving CoordinateMatrix to add additional functionality. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296948#comment-14296948 ] Manoj Kumar commented on SPARK-5021: Sorry for being dense, but how do I access the GaussianMixtureEM docs? It should be out in the recent version, but I'm not sure how to view them. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5257) SparseVector indices must be non-negative
[ https://issues.apache.org/jira/browse/SPARK-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285336#comment-14285336 ] Manoj Kumar commented on SPARK-5257: Sure. I thought it was something I could patch up quickly, hence. Next time, I would ask. SparseVector indices must be non-negative - Key: SPARK-5257 URL: https://issues.apache.org/jira/browse/SPARK-5257 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Priority: Minor Original Estimate: 0.25h Remaining Estimate: 0.25h The description of SparseVector suggests only that the indices have to be distinct integers. However the code for the constructor that takes an array of (index, value) tuples assumes that the indices are non-negative. Either the code must be changed or the description should be changed. This arose when I generated indices via hashing and converting the hash values to (signed) integers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5257) SparseVector indices must be non-negative
[ https://issues.apache.org/jira/browse/SPARK-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285139#comment-14285139 ] Manoj Kumar commented on SPARK-5257: [~mengxr] Can you please mark this as resolved. It sometimes creates confusion for new people who are trying to contribute. SparseVector indices must be non-negative - Key: SPARK-5257 URL: https://issues.apache.org/jira/browse/SPARK-5257 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Priority: Minor Original Estimate: 0.25h Remaining Estimate: 0.25h The description of SparseVector suggests only that the indices have to be distinct integers. However the code for the constructor that takes an array of (index, value) tuples assumes that the indices are non-negative. Either the code must be changed or the description should be changed. This arose when I generated indices via hashing and converting the hash values to (signed) integers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284299#comment-14284299 ] Manoj Kumar commented on SPARK-5021: [~josephkb] Can you please assign this to me? I can work on this in the coming week. GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3381) DecisionTree: eliminate bins for unordered features
[ https://issues.apache.org/jira/browse/SPARK-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281820#comment-14281820 ] Manoj Kumar commented on SPARK-3381: Hi, I would like to work on this, but preferably after the sampling_rate PR is merged, because I do not want to clutter the PR queue. DecisionTree: eliminate bins for unordered features --- Key: SPARK-3381 URL: https://issues.apache.org/jira/browse/SPARK-3381 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Trivial Code simplification: DecisionTree currently allocates bins for unordered features (in findSplitsBins). However, those bins are not needed; only the splits are required. This change will require modifying findSplitsBins, as well as modifying a few other functions to use splits instead of bins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280658#comment-14280658 ] Manoj Kumar commented on SPARK-3726: Ah I see. I had my doubts when I started looking at the code, but was in a hurry to send a Pull Request. So this can be closed? RandomForest: Support for bootstrap options --- Key: SPARK-3726 URL: https://issues.apache.org/jira/browse/SPARK-3726 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Manoj Kumar Priority: Minor RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. The expected size of each sample is the same as the original data (sampling rate = 1.0), and sampling is done with replacement. Adding support for other sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277643#comment-14277643 ] Manoj Kumar commented on SPARK-3726: [~josephkb] You seem to report issues that I always think I can have a decent shot at :) I would like to submit a PR for this by the end of the week. RandomForest: Support for bootstrap options --- Key: SPARK-3726 URL: https://issues.apache.org/jira/browse/SPARK-3726 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. The expected size of each sample is the same as the original data (sampling rate = 1.0), and sampling is done with replacement. Adding support for other sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2909) Indexing for SparseVector in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273880#comment-14273880 ] Manoj Kumar commented on SPARK-2909: [~josephkb] Sorry for spamming your inbox, but just a heads up that I'm working on this. Will mostly submit a Pull Request by tomorrow. Indexing for SparseVector in pyspark Key: SPARK-2909 URL: https://issues.apache.org/jira/browse/SPARK-2909 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Joseph K. Bradley Priority: Minor SparseVector in pyspark does not currently support indexing, except by examining the internal representation. Though indexing is a pricy operation, it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) and operating on a single feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5022) Change VectorUDT to object
[ https://issues.apache.org/jira/browse/SPARK-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270817#comment-14270817 ] Manoj Kumar edited comment on SPARK-5022 at 1/9/15 9:50 AM: [~josephkb] I want to have a go at this one. Should I wait for my other PR to get merged, or is it ok if I submit one here, before it gets merged? was (Author: mechcoder): @josephkb I want to have a go at this one. Should I wait for my other PR to get merged, or is it ok if I submit one here, before it gets merged? Change VectorUDT to object -- Key: SPARK-5022 URL: https://issues.apache.org/jira/browse/SPARK-5022 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor VectorUDT DataTypes are all identical, so VectorUDT should probably be an object instead of a class. Once this is done, we can remove equals(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5022) Change VectorUDT to object
[ https://issues.apache.org/jira/browse/SPARK-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270817#comment-14270817 ] Manoj Kumar commented on SPARK-5022: @josephkb I want to have a go at this one. Should I wait for my other PR to get merged, or is it ok if I submit one here, before it gets merged? Change VectorUDT to object -- Key: SPARK-5022 URL: https://issues.apache.org/jira/browse/SPARK-5022 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor VectorUDT DataTypes are all identical, so VectorUDT should probably be an object instead of a class. Once this is done, we can remove equals(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5022) Change VectorUDT to object
[ https://issues.apache.org/jira/browse/SPARK-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271608#comment-14271608 ] Manoj Kumar commented on SPARK-5022: Hi, Thanks for the reply. I have worked on implemening linear, clustering models and a number of metrics in scikit-learn. Do you have any specific issues or feature requests in mind that you would like to see done, or do I keep searching? Change VectorUDT to object -- Key: SPARK-5022 URL: https://issues.apache.org/jira/browse/SPARK-5022 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor VectorUDT DataTypes are all identical, so VectorUDT should probably be an object instead of a class. Once this is done, we can remove equals(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4406) SVD should check for k 1
[ https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180 ] Manoj Kumar commented on SPARK-4406: Hi Joseph, I believe this issue would be simple enough for me to start working on? Does it require you to assign it to me, or can I send a Pull Request right away? SVD should check for k 1 -- Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4406) SVD should check for k 1
[ https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180 ] Manoj Kumar edited comment on SPARK-4406 at 1/7/15 8:40 PM: Hi Joseph, I believe this issue would be simple enough for me to start working on. Does it require you to assign it to me, or can I send a Pull Request right away? was (Author: mechcoder): Hi Joseph, I believe this issue would be simple enough for me to start working on? Does it require you to assign it to me, or can I send a Pull Request right away? SVD should check for k 1 -- Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org