[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693295 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22692953 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693428 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693403 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693073 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/sharedParams.scala --- @@ -17,6 +17,10 @@ package org.apache.spark.ml.param

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693364 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-69275360 I've left a set of comments on the changes as they are. At a high level these APIs feel heavy and there seem to be a lot of things you have to get just right to add a new

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22691031 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693206 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-69275417 I'll plan to try using this new API for a couple of examples tomorrow and see if anything stands out. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22679713 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala --- @@ -101,10 +102,10 @@ object CrossValidatorExample

[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

2014-11-10 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62497848 Hi guys - I spent a good deal of time today writing a new pipeline against this PR, and currently find it pretty difficult to use: Some issues: 1. LOTS

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61977531 I've got several other comments on this PR - mostly good, and will leave some more detailed comments in a bit. TL;DR - What's wrong with Java Serialization

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61978539 Oh, and I'm not saying let's not support PMML - I'm saying let's have a sensible way of handing off models to other JVM programs that doesn't involve writing to XML

[GitHub] spark pull request: [SPARK-3572] [sql] [mllib] User-Defined Types ...

2014-10-29 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/2919#issuecomment-60969356 I've looked at the code here, and basically seems reasonable. One high-level concern I have is around the programming pattern that this encourages: complex nesting

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-11 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r14836561 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -768,104 +973,157 @@ object DecisionTree extends Serializable

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-11 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48767401 I've gone through this in some depth, and aside from a couple of minor style nits - the logic looks good to me. Manish - have you compared output vs. scikit-learn

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13982852 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926351 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926460 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -45,7 +46,7 @@ class DecisionTree (private val strategy: Strategy

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926555 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -212,7 +211,9 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926606 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -233,13 +234,73 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-46465872 I've taken a first pass at this and at a high level it looks good. The main two things I'd say are 1) I think an implicit that converts LabeledPoint

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-27 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44360446 I am worried that exponential growth in the number of split possibilities kills us when we check for all splits when we get to even 20-30 categorical values. That's

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12125916 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126002 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12125991 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126007 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126093 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126088 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126094 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126106 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126176 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126191 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126247 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126271 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -20,15 +20,17 @@ package org.apache.spark.mllib.util import

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41753574 Before I get too deep into this review - I want to step back and think about whether we expect the model in this case to be on the order of the size of the data - I think

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41753636 Also, speaking of @jegonzal maybe this is a natural first point of integration between MLlib and GraphX - I know GraphX has an implementation of LDA built in, and maybe

[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...

2014-04-23 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/475#issuecomment-41200739 Can one of the admins take a look at this? The Travis CI error seems to be in StreamingContext tests, which have nothing to do with this change. --- If your project

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-22 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/458#issuecomment-41079198 Right - the pattern is virtually identical except for an update function call. Can we abstract this away so that we can deliver the first 3 algorithms of ADMM with a few

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/458#issuecomment-40994263 Hey, this looks awesome! One high-level issue I see is that the ADMM optimizer has embedded in it knowledge of the loss function it's trying to minimize. ADMM is much more

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-39392123 Hi Hirakendu - thanks for all the detailed suggestions and information. I will reply to that separately. One question - you say there are 500,000 examples

[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-25 Thread etrain
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/79#discussion_r10934747 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software