[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

2015-02-20 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4654#issuecomment-75236020 @MechCoder I mean making sure this is run on a cluster and not just on a single machine. My hypothesis is that cost of distributing the tasks to the cluster nodes

[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

2015-02-17 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4654#discussion_r24845959 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -135,25 +135,39 @@ class GaussianMixture private

[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

2015-02-17 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4654#discussion_r24848991 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -135,25 +135,39 @@ class GaussianMixture private

[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

2015-02-17 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4654#discussion_r24846194 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -135,25 +135,39 @@ class GaussianMixture private

[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

2015-02-17 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4654#issuecomment-74737757 Please be sure to test in cluster setting, not just on a multicore machine... I believe the computation/communication ratio is going to be too low to make

[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

2015-02-17 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4654#discussion_r24850641 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -168,16 +182,26 @@ class GaussianMixture private

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-10 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73666468 LGTM cc: @jkbradley @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-10 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24397429 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -80,4 +81,60 @@ class GaussianMixtureSuite extends

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24355867 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -80,4 +81,60 @@ class GaussianMixtureSuite extends

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73568637 @MechCoder Do you mean the negative values in the covariance (sigma) matrices? Negative covariance indicates, roughly speaking, that variables move in opposite

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24326410 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -215,20 +217,29 @@ private object ExpectationSum { def

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73510029 @MechCoder Nothing else stands out to me... I will give it another look after your next commit. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73571213 @MechCoder Getting close; just need to finish up the sparse single cluster test. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24326951 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -255,6 +255,20 @@ private[spark] object BLAS extends Serializable

[GitHub] spark pull request: [SPARK-5021] [MLlib] Gaussian Mixture now supp...

2015-02-09 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24326687 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -40,10 +41,15 @@ class GaussianMixtureSuite extends

[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302522 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -19,10 +19,12 @@ package org.apache.spark.mllib.clustering

[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302656 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -215,20 +217,29 @@ private object ExpectationSum { def

[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302677 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -40,10 +41,15 @@ class GaussianMixtureSuite extends

[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302764 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -255,6 +255,20 @@ private[spark] object BLAS extends Serializable

[GitHub] spark pull request: [SPARK-5013] [MLlib] Added documentation and s...

2015-02-06 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4401#issuecomment-73243150 @mengxr I was able to build the docs (I had to do a clean build on my source tree for some reason). I have checked all links (that I added, anyway). I also updated

[GitHub] spark pull request: [SPARK-5013] [MLlib] [WIP] Added documentation...

2015-02-05 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/4401 [SPARK-5013] [MLlib] [WIP] Added documentation and sample data file for GaussianMixture Simple description and code samples (and sample data) for GaussianMixture You can merge this pull request

[GitHub] spark pull request: [SPARK-5013] [MLlib] [WIP] Added documentation...

2015-02-05 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4401#issuecomment-73166676 @mengxr I was considering adding a graphic of the test data with the recovered 2-d gaussians... I'm not sure if it would really be beneficial or not. --- If your

[GitHub] spark pull request: [SPARK-5013] [MLlib] [WIP] Added documentation...

2015-02-05 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4401#issuecomment-73171324 @mengxr I have made the fixes you pointed out. I am having trouble building the API docs so I can not verify that the link to the python GaussianMixture class resolves

[GitHub] spark pull request: SPARK-5400 [MLlib] Changed name of GaussianMix...

2015-01-30 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/4290 SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture Decoupling the model and the algorithm You can merge this pull request into a Git repository by running: $ git pull

[GitHub] spark pull request: SPARK-5019 [MLlib] - GaussianMixtureModel expo...

2015-01-19 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/4088#issuecomment-70595369 @jkbradley I considered making those plural for the initial commit. I guess I should have. Update has been made. --- If your project is set up for it, you can reply

[GitHub] spark pull request: SPARK-5019 - GaussianMixtureModel exposes inst...

2015-01-17 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/4088 SPARK-5019 - GaussianMixtureModel exposes instances of MultivariateGauss... This PR modifies GaussianMixtureModel to expose instances of MutlivariateGaussian rather than separate mean

[GitHub] spark pull request: SPARK-5018 [MLlib] [WIP] Make MultivariateGaus...

2015-01-10 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3923#issuecomment-69461424 I have made the requested changes and resolved the merge conflicts. Question: MutlivariateGuassian now keeps a private Breeze version of the mean vector rather

[GitHub] spark pull request: SPARK-5018 [MLlib] [WIP] Make MultivariateGaus...

2015-01-10 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3923#issuecomment-69479478 Thanks, @jkbradley I have made the style correction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-5015] [mllib] Random seed for GMM + mak...

2015-01-09 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3981#discussion_r22735493 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximizationSuite.scala --- @@ -35,12 +35,14 @@ class

[GitHub] spark pull request: [SPARK-5015] [mllib] Random seed for GMM + mak...

2015-01-09 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3981#discussion_r22738055 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximizationSuite.scala --- @@ -35,12 +35,14 @@ class

[GitHub] spark pull request: SPARK-5018 [MLlib] [WIP] Make MultivariateGaus...

2015-01-07 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3923#discussion_r22628165 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -91,7 +127,7 @@ private[mllib] class MultivariateGaussian

[GitHub] spark pull request: SPARK-5018 [MLlib] [WIP] Make MultivariateGaus...

2015-01-07 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3923#issuecomment-69122845 @jkbradley Thanks! I have made the requested changes. Are there any other public methods that you think would be useful to add at this time? --- If your project

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-06 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22521704 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,84 @@ package

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-06 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22545409 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,84 @@ package

[GitHub] spark pull request: SPARK-5018 [MLlib] [WIP] Make MultivariateGaus...

2015-01-06 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/3923 SPARK-5018 [MLlib] [WIP] Make MultivariateGaussian public Moving MutlivariateGaussian from private[mllib] to public. The class uses Breeze vectors internally, so this involves creating a public

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-05 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68806762 @mengxr Changes made. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-05 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22504722 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,84 @@ package

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-05 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22508610 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,84 @@ package

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-04 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68649845 @jkbradley Good call on the test suite; I have added some non-center points to the tests. I also added the brackets to the in-comment link. --- If your project

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-03 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22434487 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussianSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68581355 @jkbradley I used Octave's mvnpdf from the statistics package for the non-singular cases; it can not handle singular covariance matrices, so I was only able to recreate

[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68575728 @jkbradley I think performing the pdf calculation in log-space (and providing a logpdf() method) is a good idea. Perhaps we can make this part of transitioning

[GitHub] spark pull request: SPARK-5017 - Use SVD to compute determinant an...

2015-01-01 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/3871 SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgaloppo/spark

[GitHub] spark pull request: SPARK-5020 [MLlib] GaussianMixtureModel.predic...

2014-12-31 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3854#issuecomment-68465266 @jkbradley I am not crazy about the name predictMembership() ... to me it implies the hard assignment; a simple change like predictMemberships() might be more clear

[GitHub] spark pull request: SPARK-5020 [MLlib] GaussianMixtureModel.predic...

2014-12-31 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3854#issuecomment-68467812 Ok, I will rename the method to predictSoft() Not sure what to make of the streaming failure ?? --- If your project is set up for it, you can reply

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-31 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-68475553 @jkbradley Please assign me SPARK-5017, and I will take care of this in preparation for 5018 and 5019. --- If your project is set up for it, you can reply

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-30 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-68401923 @jkbradley Please assign 5017, 5018, 5019, and 5020 to me. Regarding 5018, can you refer me to other PR's that are bringing in common distributions? I can work toward

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-30 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-68415536 @jkbradley No problem. Let's start with 5020, and I'll move on from there. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: SPARK-5020 [MLlib] GaussianMixtureModel.predic...

2014-12-30 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/3854 SPARK-5020 [MLlib] GaussianMixtureModel.predictMembership() should take an RDD only Removed unnecessary parameters to predictMembership() CC: @jkbradley You can merge this pull request

[GitHub] spark pull request: SPARK-5020 [MLlib] GaussianMixtureModel.predic...

2014-12-30 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3854#issuecomment-68425054 My repo seems to have some lingering commit tags. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: SPARK-5020 [MLlib] GaussianMixtureModel.predic...

2014-12-30 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3854#issuecomment-68427096 @jkbradley No, private modifier was a gaffe on my part. This is corrected. I think I have corrected the lingering commits... rebase did not work for me

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-29 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-68313864 @jkbradley Thank you for your help and feedback along the way. Please assign some (or all) of those tickets to me and I will continue to improve the implementation

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-22 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67885369 Ok. I changed the privacy of EPSILON and am now using it in this code. I changed the name from GaussianMixtureModelEM to GaussianMixtureEM. I've changed

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-19 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22136408 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala --- @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-19 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22136877 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala --- @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-19 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67717252 I've performed most of the requested changes. I do not see the BLAS function mentioned (dsyr), so I left this as a TODO. Also, I could not find EPSILON in MLUtils

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67486366 Ok, I have addressed (I think) all of those issues, with the exception of modifying GaussianMixtureModel to carry instances of MultivariateGaussian. I do like

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22059411 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala --- @@ -0,0 +1,284 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22066566 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67550199 Sorry, I forgot to comment on this issue. That would be fine with me. The prediction methods were contributed by @FlytxtRnD , so perhaps we can solicit their opinion

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67582826 Excellent. 100 features is probably a bit of a stretch for the algorithm,,, the density at any point (especially with respect to the initial random gaussians) is going

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22084037 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala --- @@ -0,0 +1,244 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22084185 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala --- @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67586420 Great! I've pushed the requested changes. I will open a ticket on Jira about making the MultivariateGaussian more widely applicable. --- If your project is set up

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-17 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67387864 I have replaced the accumulators with RDD.aggregate functionality. I added functionality allowing the user to provide their own initial GMM, bypassing the random

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-17 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22017550 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala --- @@ -0,0 +1,284 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-17 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22018162 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-17 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67445452 Working on these changes; still a few left. Great feedback; really helping to improve my scala! --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-16 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67158536 I've merged in the predict() method from @FlytxtRnD I am working on the changeover from accumulators to RDD.aggregate; I should have this up soon. --- If your

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-15 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67076315 No worries; it'll get there. I appreciate the comments and pointers. On Dec 15, 2014, at 4:52 PM, jkbradley notificati...@github.com wrote

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-15 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67097150 Ok, I will look into swapping the accumulators out for aggregate(). In the mean time I have worked to correct some of the style issues. --- If your project is set up

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-12 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-66822813 @jkbradley I have pushed commits addressing [hopefully] all of the issues you pointed out. Of particular concern to me the movement of the utility MultivariateGaussian

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-11 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r21683030 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala --- @@ -0,0 +1,283 @@ +/* + * Licensed

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-11 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r21683119 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala --- @@ -0,0 +1,283 @@ +/* + * Licensed

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-11 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-66636308 @jkbradley Thank you for your comments. I am working to resolve these issues and will push these changes in a day or two. --- If your project is set up for it, you

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-11-17 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-63405397 Merged with the latest master branch to hopefully fix any merge issues. Updated scala test suite to use new MLlibSparkTestContext Improved cluster initialization

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-11-13 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-62929422 Thanks, @squito ... while I expect the array to only have a few elements, I have made changes according to your advice. --- If your project is set up for it, you can

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-11-10 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-62386233 Please advise how to resolve merge issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-11-08 Thread tgaloppo
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-62256712 This test appeared to fail due to some form of timeout during the pull; is there any action I need to take? --- If your project is set up for it, you can reply

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-10-30 Thread tgaloppo
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/3022 SPARK-4156 [MLLIB] EM algorithm for GMMs Implementation of Expectation-Maximization for Gaussian Mixture Models. This is my maiden contribution to Apache Spark, so I apologize now if I