Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4654#issuecomment-75236020
@MechCoder I mean making sure this is run on a cluster and not just on a
single machine. My hypothesis is that cost of distributing the tasks to the
cluster nodes (and
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4654#discussion_r24850641
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -168,16 +182,26 @@ class GaussianMixture private
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4654#discussion_r24848991
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -135,25 +135,39 @@ class GaussianMixture private
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4654#issuecomment-74737757
Please be sure to test in cluster setting, not just on a multicore
machine... I believe the computation/communication ratio is going to be too low
to make this
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4654#discussion_r24846194
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -135,25 +135,39 @@ class GaussianMixture private
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4654#discussion_r24845959
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -135,25 +135,39 @@ class GaussianMixture private
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4459#issuecomment-73666468
LGTM
cc: @jkbradley @mengxr
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24397429
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
---
@@ -80,4 +81,60 @@ class GaussianMixtureSuite extends
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4459#issuecomment-73571213
@MechCoder Getting close; just need to finish up the sparse single cluster
test.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24355867
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
---
@@ -80,4 +81,60 @@ class GaussianMixtureSuite extends
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4459#issuecomment-73568637
@MechCoder Do you mean the negative values in the covariance (sigma)
matrices? Negative covariance indicates, roughly speaking, that variables move
in opposite
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4459#issuecomment-73510029
@MechCoder Nothing else stands out to me... I will give it another look
after your next commit.
---
If your project is set up for it, you can reply to this email and
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24326951
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -255,6 +255,20 @@ private[spark] object BLAS extends Serializable with
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24326687
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
---
@@ -40,10 +41,15 @@ class GaussianMixtureSuite extends
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24326410
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -215,20 +217,29 @@ private object ExpectationSum {
def
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24302764
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -255,6 +255,20 @@ private[spark] object BLAS extends Serializable with
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24302677
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
---
@@ -40,10 +41,15 @@ class GaussianMixtureSuite extends
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24302656
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -215,20 +217,29 @@ private object ExpectationSum {
def
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/4459#discussion_r24302522
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -19,10 +19,12 @@ package org.apache.spark.mllib.clustering
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4401#issuecomment-73243150
@mengxr I was able to build the docs (I had to do a clean build on my
source tree for some reason). I have checked all links (that I added, anyway).
I also updated
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4401#issuecomment-73171324
@mengxr I have made the fixes you pointed out. I am having trouble
building the API docs so I can not verify that the link to the python
GaussianMixture class resolves
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4401#issuecomment-73166676
@mengxr I was considering adding a graphic of the test data with the
recovered 2-d gaussians... I'm not sure if it would really be beneficial or not.
---
If
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/4401
[SPARK-5013] [MLlib] [WIP] Added documentation and sample data file for
GaussianMixture
Simple description and code samples (and sample data) for GaussianMixture
You can merge this pull request
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/4290
SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture
Decoupling the model and the algorithm
You can merge this pull request into a Git repository by running:
$ git pull
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/4088#issuecomment-70595369
@jkbradley I considered making those plural for the initial commit. I
guess I should have. Update has been made.
---
If your project is set up for it, you can reply
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/4088
SPARK-5019 - GaussianMixtureModel exposes instances of MultivariateGauss...
This PR modifies GaussianMixtureModel to expose instances of
MutlivariateGaussian rather than separate mean and
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3923#issuecomment-69479478
Thanks, @jkbradley
I have made the style correction.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3923#issuecomment-69461424
I have made the requested changes and resolved the merge conflicts.
Question: MutlivariateGuassian now keeps a private Breeze version of the
mean vector rather
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3981#discussion_r22738055
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximizationSuite.scala
---
@@ -35,12 +35,14 @@ class
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3981#discussion_r22735493
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximizationSuite.scala
---
@@ -35,12 +35,14 @@ class
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3923#issuecomment-69245914
@jkbradley Changes made. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3923#issuecomment-69122845
@jkbradley Thanks! I have made the requested changes. Are there any other
public methods that you think would be useful to add at this time?
---
If your project is
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3923#discussion_r22628165
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
---
@@ -91,7 +127,7 @@ private[mllib] class MultivariateGaussian
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/3923
SPARK-5018 [MLlib] [WIP] Make MultivariateGaussian public
Moving MutlivariateGaussian from private[mllib] to public. The class uses
Breeze vectors internally, so this involves creating a public
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3871#discussion_r22545409
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
---
@@ -17,23 +17,84 @@
package
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3871#discussion_r22521704
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
---
@@ -17,23 +17,84 @@
package
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3871#discussion_r22508610
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
---
@@ -17,23 +17,84 @@
package
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3871#discussion_r22504722
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
---
@@ -17,23 +17,84 @@
package
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3871#issuecomment-68806762
@mengxr Changes made.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3871#issuecomment-68649845
@jkbradley Good call on the test suite; I have added some non-center points
to the tests. I also added the brackets to the in-comment link.
---
If your project is
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3871#discussion_r22434487
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussianSuite.scala
---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3871#issuecomment-68581355
@jkbradley I used Octave's mvnpdf from the statistics package for the
non-singular cases; it can not handle singular covariance matrices, so I was
only able to rec
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3871#issuecomment-68575728
@jkbradley I think performing the pdf calculation in log-space (and
providing a logpdf() method) is a good idea. Perhaps we can make this part of
transitioning
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/3871
SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tgaloppo/spark
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-68475553
@jkbradley Please assign me SPARK-5017, and I will take care of this in
preparation for 5018 and 5019.
---
If your project is set up for it, you can reply to this
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3854#issuecomment-68467812
Ok, I will rename the method to predictSoft()
Not sure what to make of the streaming failure ??
---
If your project is set up for it, you can reply to this
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3854#issuecomment-68465266
@jkbradley I am not crazy about the name predictMembership() ... to me it
implies the hard assignment; a simple change like predictMemberships() might
be more clear
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3854#issuecomment-68427096
@jkbradley No, private modifier was a gaffe on my part. This is corrected.
I think I have corrected the lingering commits... rebase did not work for
me, but a
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3854#issuecomment-68425054
My repo seems to have some lingering commit tags.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/3854
SPARK-5020 [MLlib] GaussianMixtureModel.predictMembership() should take an
RDD only
Removed unnecessary parameters to predictMembership()
CC: @jkbradley
You can merge this pull request
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-68415536
@jkbradley No problem. Let's start with 5020, and I'll move on from there.
---
If your project is set up for it, you can reply to this email and have your
re
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-68401923
@jkbradley Please assign 5017, 5018, 5019, and 5020 to me. Regarding 5018,
can you refer me to other PR's that are bringing in common distributions? I
can work t
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-68313864
@jkbradley Thank you for your help and feedback along the way. Please
assign some (or all) of those tickets to me and I will continue to improve the
implementation
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67885369
Ok. I changed the privacy of EPSILON and am now using it in this code.
I changed the name from GaussianMixtureModelEM to GaussianMixtureEM.
I've ch
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67717252
I've performed most of the requested changes. I do not see the BLAS
function mentioned (dsyr), so I left this as a TODO. Also, I could not find
EPSILON in ML
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22136877
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala
---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22136408
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala
---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67586420
Great! I've pushed the requested changes. I will open a ticket on Jira
about making the MultivariateGaussian more widely applicable.
---
If your project is s
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22084185
--- Diff:
examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22084037
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala
---
@@ -0,0 +1,244 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67582826
Excellent. 100 features is probably a bit of a stretch for the
algorithm,,, the density at any point (especially with respect to the initial
random gaussians) is going
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67550199
Sorry, I forgot to comment on this issue. That would be fine with me. The
prediction methods were contributed by @FlytxtRnD , so perhaps we can solicit
their opinion
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22066566
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
---
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22059411
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala
---
@@ -0,0 +1,284 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67486366
Ok, I have addressed (I think) all of those issues, with the exception of
modifying GaussianMixtureModel to carry instances of MultivariateGaussian. I
do like that
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67445452
Working on these changes; still a few left.
Great feedback; really helping to improve my scala!
---
If your project is set up for it, you can reply to this email
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22018162
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala
---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r22017550
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala
---
@@ -0,0 +1,284 @@
+/*
+ * Licensed to the Apache
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67387864
I have replaced the accumulators with RDD.aggregate functionality.
I added functionality allowing the user to provide their own initial GMM,
bypassing the random
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67158536
I've merged in the predict() method from @FlytxtRnD
I am working on the changeover from accumulators to RDD.aggregate; I should
have this up soon.
---
If
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67097150
Ok, I will look into swapping the accumulators out for aggregate(). In the
mean time I have worked to correct some of the style issues.
---
If your project is set up
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-67076315
No worries; it'll get there. I appreciate the comments and pointers.
> On Dec 15, 2014, at 4:52 PM, jkbradley wrote:
>
> @tgaloppo
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-66822813
@jkbradley I have pushed commits addressing [hopefully] all of the issues
you pointed out. Of particular concern to me the movement of the utility
MultivariateGaussian
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-66636308
@jkbradley Thank you for your comments. I am working to resolve these
issues and will push these changes in a day or two.
---
If your project is set up for it, you
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r21683119
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala
---
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the
Github user tgaloppo commented on a diff in the pull request:
https://github.com/apache/spark/pull/3022#discussion_r21683030
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala
---
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-63405397
Merged with the latest master branch to hopefully fix any merge issues.
Updated scala test suite to use new MLlibSparkTestContext
Improved cluster initialization
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-62929422
Thanks, @squito ... while I expect the array to only have a few elements, I
have made changes according to your advice.
---
If your project is set up for it, you can
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-62386233
Please advise how to resolve merge issues.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user tgaloppo commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-62256712
This test appeared to fail due to some form of timeout during the pull; is
there any action I need to take?
---
If your project is set up for it, you can reply to this
GitHub user tgaloppo opened a pull request:
https://github.com/apache/spark/pull/3022
SPARK-4156 [MLLIB] EM algorithm for GMMs
Implementation of Expectation-Maximization for Gaussian Mixture Models.
This is my maiden contribution to Apache Spark, so I apologize now if I
81 matches
Mail list logo