Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693295
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -80,69 +50,157 @@ class LogisticRegression extends
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22692953
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693428
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693403
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -80,69 +50,157 @@ class LogisticRegression extends
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693073
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693524
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/sharedParams.scala
---
@@ -17,6 +17,10 @@
package org.apache.spark.ml.param
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693364
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -80,69 +50,157 @@ class LogisticRegression extends
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-69275360
I've left a set of comments on the changes as they are. At a high level
these APIs feel heavy and there seem to be a lot of things you have to get just
right to add a new
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22691031
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala
---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22693206
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -80,69 +50,157 @@ class LogisticRegression extends
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-69275417
I'll plan to try using this new API for a couple of examples tomorrow and
see if anything stands out.
---
If your project is set up for it, you can reply to this email
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r22679713
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala
---
@@ -101,10 +102,10 @@ object CrossValidatorExample
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62497848
Hi guys - I spent a good deal of time today writing a new pipeline against
this PR, and currently find it pretty difficult to use:
Some issues:
1. LOTS
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61977531
I've got several other comments on this PR - mostly good, and will leave
some more detailed comments in a bit. TL;DR - What's wrong with Java
Serialization
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61978539
Oh, and I'm not saying let's not support PMML - I'm saying let's have a
sensible way of handing off models to other JVM programs that doesn't involve
writing to XML
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/2919#issuecomment-60969356
I've looked at the code here, and basically seems reasonable. One
high-level concern I have is around the programming pattern that this
encourages: complex nesting
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/886#discussion_r14836561
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -768,104 +973,157 @@ object DecisionTree extends Serializable
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-48767401
I've gone through this in some depth, and aside from a couple of minor
style nits - the logic looks good to me. Manish - have you compared output vs.
scikit-learn
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/886#discussion_r13982852
--- Diff:
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
case class
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/886#discussion_r13926351
--- Diff:
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
case class
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/886#discussion_r13926460
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -45,7 +46,7 @@ class DecisionTree (private val strategy: Strategy
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/886#discussion_r13926555
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -212,7 +211,9 @@ object DecisionTree extends Serializable with Logging
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/886#discussion_r13926606
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -233,13 +234,73 @@ object DecisionTree extends Serializable with Logging
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-46465872
I've taken a first pass at this and at a high level it looks good.
The main two things I'd say are
1) I think an implicit that converts LabeledPoint
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-44360446
I am worried that exponential growth in the number of split possibilities
kills us when we check for all splits when we get to even 20-30
categorical values. That's
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12125916
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126002
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12125991
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126007
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126093
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126088
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126094
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126106
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126176
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126191
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126247
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/476#discussion_r12126271
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -20,15 +20,17 @@ package org.apache.spark.mllib.util
import
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/476#issuecomment-41753574
Before I get too deep into this review - I want to step back and think
about whether we expect the model in this case to be on the order of the size
of the data - I think
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/476#issuecomment-41753636
Also, speaking of @jegonzal maybe this is a natural first point of
integration between MLlib and GraphX - I know GraphX has an implementation of
LDA built in, and maybe
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/475#issuecomment-41200739
Can one of the admins take a look at this? The Travis CI error seems to be
in StreamingContext tests, which have nothing to do with this change.
---
If your project
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/458#issuecomment-41079198
Right - the pattern is virtually identical except for an update function
call. Can we abstract this away so that we can deliver the first 3 algorithms
of ADMM with a few
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/458#issuecomment-40994263
Hey, this looks awesome! One high-level issue I see is that the ADMM
optimizer has embedded in it knowledge of the loss function it's trying to
minimize. ADMM is much more
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/79#issuecomment-39392123
Hi Hirakendu - thanks for all the detailed suggestions and information. I
will reply to that separately.
One question - you say there are 500,000 examples
Github user etrain commented on a diff in the pull request:
https://github.com/apache/spark/pull/79#discussion_r10934747
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software
44 matches
Mail list logo