[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59871504
  
Jenkins, please start the test!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-10-28 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-60813678
  
@BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...

2014-10-28 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2992

[SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer

In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
through the nonZero elements in the vector. However, activeIterator doesn't 
perform well due to lots of overhead. In this PR, native while loop is used for 
both DenseVector and SparseVector.
The benchmark result with 20 executors using mnist8m dataset:
Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds
After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds
Since MultivariateOnlineSummarizer is used in several places, the overall 
performance gain in mllib library will be significant with this PR.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark SPARK-4129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2992.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2992


commit ebe3e74df70eb424aecc3170fc55008cfb6a76ec
Author: DB Tsai 
Date:   2014-10-29T05:42:50Z

First commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2207

[SPARK-3317][MLlib] The loss of regularization in Updater should use the 
oldWeights

The current loss of the regularization is computed from the newWeights 
which is not correct. The loss, R(w) = 1/2 ||w||^2 should be computed with the 
oldWeights.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-updater

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2207


commit 1447c234092339f67d1887bfc75731665264b770
Author: DB Tsai 
Date:   2014-08-29T21:13:11Z

Fixed updater bug




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
GitHub user dbtsai reopened a pull request:

https://github.com/apache/spark/pull/2207

[SPARK-3317][MLlib] The loss of regularization in Updater should use the 
oldWeights

The current loss of the regularization is computed from the newWeights 
which is not correct. The loss, R(w) = 1/2 ||w||^2 should be computed with the 
oldWeights.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-updater

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2207


commit 1447c234092339f67d1887bfc75731665264b770
Author: DB Tsai 
Date:   2014-08-29T21:13:11Z

Fixed updater bug




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2207#issuecomment-53933078
  
LBFGS needs correct loss to find next weights while SGD doesn't.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/2207


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2207#issuecomment-54002680
  
@srowen @mengxr 

I was working on OWLQN for L1 in my company, and I didn't follow the LBFGS 
code so I was confused. The current code in MLlib actually gives the correct 
result.

The Updater api is a little confusing, and after I read my note when I 
implemented LBFGS, I actually use the existing Updater api to get the current 
loss of regularization correctly with a trick by setting the gradient as zero 
vector, stepSize as zero, and iteration as one. 

For SGD, we computed the loss of regularization after weights are updated, 
and we keep this value, and add it into total loss in next iteration. I now 
remembered that I fixed a bug because of the updater design couple months ago - 
the first iteration of the loss of regularization was not properly computed.

Hope the whole design issue can be addressed by #1518 [SPARK-2505][MLlib] 
Weighted Regularizer for Generalized Linear Model once this is finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/2207


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2207#issuecomment-54002773
  
PS, it seems that I can not close 
https://issues.apache.org/jira/browse/SPARK-3317 myself. Can any of you close 
for me? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2207#issuecomment-54002970
  
You are right. Using my desktop without login session. Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1157 L-BFGS Optimizer based on Breeze L-...

2014-04-07 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/53


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-07 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/353

SPARK-1157: L-BFGS Optimizer based on Breeze's implementation.

This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !

When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input 
parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by 

val regVal = updater.compute(
  weights,
  new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

  val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without 
regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #353


commit 60c83350bb77aa640edd290a26e2a20281b7a3a8
Author: DB Tsai 
Date:   2014-04-05T00:06:50Z

L-BFGS Optimizer based on Breeze's implementation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11404094
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3 < m < 10 is recommended.
+   * Restriction: m > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
+this.lineSearchTolerance = tolerance
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvTolerance(tolerance: Int): this.type = {
+this.convTolerance = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = LBFGS

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11404515
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3 < m < 10 is recommended.
+   * Restriction: m > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
+this.lineSearchTolerance = tolerance
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvTolerance(tolerance: Int): this.type = {
+this.convTolerance = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = LBFGS

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-39895140
  
@mengxr  As you suggested, I moved the costFun to private CostFun class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11460767
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
--- End diff --

@mengxr  
I know. I pretty much follow the existing coding style in 
GradientDescent.scala 
Should I also change the one in other place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11461398
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3 < m < 10 is recommended.
+   * Restriction: m > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
--- End diff --

Good catch! It's used in RISO implementation. Just remove them. Thks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11463764
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
--- End diff --

This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should at 
least have the same performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464013
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
--- End diff --

It a reason number coming out of my mind. Just quick do a comparing. 
With 10 iterations of L-BFGS, SGD needs 40 iterations to get 2% difference.
.. ,SGD needs 90 iterations to 
get 1% difference. 
In all of the test, L-BFGS gives smaller loss.
As a result, you can see how SGD converges really slow when the # of 
iterations are high.
For here, I'll put 2% to make the test run faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not 

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464121
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
--- End diff --

I add the comment in the code as 
// GD converges a way slower than L-BFGS. To achieve 1% difference,
// it requires 90 iterations in GD. No matter how hard we increase
// the number of iterations in GD here, the lossGD will be always
// larger than lossLBFGS.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464280
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
+  }
+
+  test("Assert that LBFGS and Gradient Descent with L2 regularization get 
the same result.") {
+val regParam = 0.2
+
+// Prepare another non-zero weights to compare the loss in the first 
iteration.
+val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
+
+val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  squaredL2Updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
  

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464736
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
+  }
+
+  test("Assert that LBFGS and Gradient Descent with L2 regularization get 
the same result.") {
+val regParam = 0.2
+
+// Prepare another non-zero weights to compare the loss in the first 
iteration.
+val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
+
+val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  squaredL2Updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
  

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11521070
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.LocalSparkContext
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
LocalSparkContext with ShouldMatchers {
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  var convergenceTol = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add an extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  lazy val dataRDD = sc.parallelize(data, 2).cache()
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("LBFGS loss should be decreasing and match the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+// This 0.8 bound is copying from GradientDescentSuite, and L-BFGS 
should
+// at least have the same performance. It's based on observation, no 
theoretically guaranteed.
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
--- End diff --

You are right.  Since the cost function is convex, the loss is guaranteed 
to be monotonic decreased with L-BFGS optimizer. (SGD doesn't guarantee this, 
and the loss may be fluctuating in the optimization process.) Will add the test 
for this property.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11604731
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3 < numCorrections < 10 is recommended.
+   * Restriction: numCorrections > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFracti

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11605030
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3 < numCorrections < 10 is recommended.
+   * Restriction: numCorrections > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFracti

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11605070
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3 < numCorrections < 10 is recommended.
+   * Restriction: numCorrections > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFracti

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434555
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
GitHub user dbtsai reopened a pull request:

https://github.com/apache/spark/pull/353

[SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.

This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !

When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input 
parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by 

val regVal = updater.compute(
  weights,
  new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

  val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without 
regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #353


commit 984b18e21396eae84656e15da3539ff3b5f3bf4a
Author: DB Tsai 
Date:   2014-04-05T00:06:50Z

L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation 
issue in GradientDescent optimizer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434626
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434691
  
Timeout for lastest jenkins run. It seems that CI is not stable now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-15 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/955

[SPARK-1969][MLlib] Public available online summarizer for mean, variance, 
min, and max

It basically moved the private ColumnStatisticsAggregator class from 
RowMatrix to public available DeveloperApi.

Changes:
1) Moved the trait from 
org.apache.spark.mllib.stat.MultivariateStatisticalSummary to 
org.apache.spark.mllib.stats.Summarizer 
2) Moved the private implementation from org.apache.spark.mllib.linalg. 
ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
3) When creating OnlineSummarizer object, the number of columns is not 
needed in the constructor. It's determined when users add the first sample.
4) Added the API documentation for OnlineSummarizer
5) Added the unittest for OnlineSummarizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-summarizer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/955.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #955


commit 6d0e596a71b44c21b86ba3407d6dc62b0b684198
Author: DB Tsai 
Date:   2014-06-03T03:01:16Z

First version.

commit 1bd8e0c7ded84049371b29bc47c666957f07d091
Author: DB Tsai 
Date:   2014-06-03T20:53:50Z

Some cleanup.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45023171
  
Since the "Statistical" in MultivariateStatisticalSummary is already in the 
package name as "stat", I think it worths to have a concise name. Also, most 
people spell the abbreviation of statistics as "stats", so I changed it from 
"stat" to "stats".

Since it's already a public API, I've no problem to change it back.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870][branch-0.9] Jars added by sc.addJ...

2014-06-03 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/834


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45026777
  
Don't know why jenkins is not happy with removing "private class 
ColumnStatisticsAggregator(private val n: Int)". After all, it's a private 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed a typo

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/959

Fixed a typo

in RowMatrix.scala

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #959


commit fab0e0e77ff63a67868d7f3d8f5434b113ee48fd
Author: DB Tsai 
Date:   2014-06-03T23:14:18Z

Fixed typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-04 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45124672
  
@mengxr Get you. It's false-positive error. Do you have any comment or 
feedback moving it out as public api? I'm building a feature scaling api in 
MlUtils which depends on this. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-45277558
  
This looks good to me. 

However, we still have more of System.exit in different deployment code; we 
probably want to review and fix them. This can be a good step!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/987

[SPARK-1177] Allow SPARK_JAR to be set programmatically in system properties



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark 
dbtsai-yarn-spark-jar-from-java-property

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/987.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #987


commit 196df1c9fa0c423a30f3b118bf1dd58480cb2fee
Author: DB Tsai 
Date:   2014-05-27T23:07:27Z

Allow users to programmatically set the spark jar.

commit bdff88ac46bff5aea63e23c24d5d5f00a4e83023
Author: DB Tsai 
Date:   2014-06-05T22:43:09Z

Doc update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45286460
  
@chesterxgchen 

#560 Agree, it's a more throughout way to handle this issue. In the code 
you have, it seems that the spark jar setting is moved to conf: SparkConf in 
favor of the CONF_SPARK_JAR. But it will make users difficult to set it up 
since the Client.scala also has to be changed. Simple question, with your 
change, how users can submit job with their own spark jar by passing the 
CONF_SPARK_JAR correctly?

def sparkJar(conf: SparkConf) = {
   if (conf.contains(CONF_SPARK_JAR)) {
 conf.get(CONF_SPARK_JAR)
   } else if (System.getenv(ENV_SPARK_JAR) != null) {
 logWarning(
  s"$ENV_SPARK_JAR detected in the system environment. This 
variable has been deprecated " 
   s"in favor of the $CONF_SPARK_JAR configuration variable.")
 System.getenv(ENV_SPARK_JAR)
   } else {
 SparkContext.jarOfClass(this.getClass).head
   }
 }




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45292804
  
The app's code will only run in the application master in yarn-cluster 
mode, how can yarn client know which jar will be submitted to distributed cache 
if we set it in the app's spark conf?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45296471
  
We lunched Spark job inside our tomcat, and we directly use Client.scala 
API. With my patch, I can setup the spark jar using System.setProperty() before 

  val sparkConf = new SparkConf
  val args = getArgsFromConf(conf)
  new Client(new ClientArguments(args, sparkConf), hadoopConfig, 
sparkConf).run

Do you mean that with your work, I can setup the jar location in the 
sparkConf which will be passed into the new Client?

Can we have the following in sparkJar method

def sparkJar(conf: SparkConf) = {
   if (conf.contains(CONF_SPARK_JAR)) {
 conf.get(CONF_SPARK_JAR)
   } else if (System.getProperty(ENV_SPARK_JAR) != null) {
 logWarning(
  s"$ENV_SPARK_JAR detected in the system property. This variable 
has been deprecated " 
   s"in favor of the $CONF_SPARK_JAR configuration variable.")
 System.getProperty(ENV_SPARK_JAR)
   } else if (System.getenv(ENV_SPARK_JAR) != null) {
 logWarning(
  s"$ENV_SPARK_JAR detected in the system environment. This 
variable has been deprecated " 
   s"in favor of the $CONF_SPARK_JAR configuration variable.")
 System.getenv(ENV_SPARK_JAR)
   } else {
 SparkContext.jarOfClass(this.getClass).head
   }
 }




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45297396
  
k... better to have Mima exclude the private class automatically, or we can 
have annotation for the private class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-06 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45363846
  
Got you. Looking forward to having your patch merged. Thanks.

Sent from my Google Nexus 5
On Jun 6, 2014 9:35 AM, "Marcelo Vanzin"  wrote:

> I mean you can set system properties the same way. SparkConf initializes
> its configuration from system properties, so my patch covers not only your
> case, but also others (like using a spark-defaults.conf file for
> spark-submit users).
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/987#issuecomment-45357297>.
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1013

[SPARK-1870] Ported from 1.0 branch to 0.9 branch. 

Made deployment with --jars work in yarn-standalone mode. Sent secondary 
jars to distributed cache of all containers and add the cached jars to 
classpath before executors start.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark branch-0.9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1013.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1013


commit 0956af95e24bc37303525fde6f85e0b3aeebd946
Author: DB Tsai 
Date:   2014-06-08T23:16:53Z

Ported SPARK-1870 from 1.0 branch to 0.9 branch




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45451719
  
CC: @mengxr and @sryza


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45459920
  
Work in my local VM. Should work in real yarn cluster. Will test it 
tomorrow in the office.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-09 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45551414
  
Tested in PivotalHD 1.1 Yarn 4 node cluster. With --addjars 
file:///somePath/to/jar, launching spark application works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1013#discussion_r13573407
  
--- Diff: 
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -281,18 +280,19 @@ class Client(args: ClientArguments, conf: 
Configuration, sparkConf: SparkConf)
 }
 
 // Handle jars local to the ApplicationMaster.
+var cachedSecondaryJarLinks = ListBuffer.empty[String]
--- End diff --

thks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1013#discussion_r13573544
  
--- Diff: 
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -507,12 +508,19 @@ object Client {
   Apps.addToEnvironment(env, Environment.CLASSPATH.name, 
Environment.PWD.$() +
 Path.SEPARATOR + LOG4J_PROP)
 }
+
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse("").split(",")
--- End diff --

Thanks. You are right. It will add empty string to array, and then add the 
folder without file into classpath. Will fix in master as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Make sure that empty string is filtered out wh...

2014-06-09 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1027

Make sure that empty string is filtered out when we get the secondary jars 
from conf



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-classloader

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1027


commit c9c7ad7fc6a2cf03503fe7b19ea1da92247196c6
Author: DB Tsai 
Date:   2014-06-10T01:29:04Z

Make sure that empty string is filtered out when we get the secondary jars 
from conf.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624385
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory > maxMem) {
-  logError("Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster.".
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+"Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.".
+format(args.executorMemory, maxMem)
+  logError(errorMessage)
+  throw new IllegalArgumentException(errorMessage)
 }
 val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD
 if (amMem > maxMem) {
-  logError("Required AM memory (%d) is above the max threshold (%d) of 
this cluster".
-format(args.amMemory, maxMem))
-  System.exit(1)
+  val errorMessage ="Required AM memory (%d) is above the max 
threshold (%d) of this cluster".
--- End diff --

Please add a space after =


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624580
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory > maxMem) {
-  logError("Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster.".
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+"Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.".
+format(args.executorMemory, maxMem)
--- End diff --

Move the . to the new line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624615
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory > maxMem) {
-  logError("Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster.".
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+"Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.".
+format(args.executorMemory, maxMem)
+  logError(errorMessage)
+  throw new IllegalArgumentException(errorMessage)
 }
 val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD
 if (amMem > maxMem) {
-  logError("Required AM memory (%d) is above the max threshold (%d) of 
this cluster".
-format(args.amMemory, maxMem))
-  System.exit(1)
+  val errorMessage ="Required AM memory (%d) is above the max 
threshold (%d) of this cluster".
--- End diff --

move the . to the newline 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-11 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/1013


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-11 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-45835283
  
@mengxr Do you think it's in good shape now? This is the only issue 
blocking us using vanilla spark. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13897737
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -38,10 +38,10 @@ import org.apache.spark.mllib.linalg.{Vectors, Vector}
 class LBFGS(private var gradient: Gradient, private var updater: Updater)
   extends Optimizer with Logging {
 
-  private var numCorrections = 10
-  private var convergenceTol = 1E-4
-  private var maxNumIterations = 100
-  private var regParam = 0.0
+  private var numCorrections: Int = 10
+  private var convergenceTol: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
 
--- End diff --

In most of the mllib codebase, we don't specify the type of variable. Can 
you remove them? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13897825
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with 
LocalSparkContext with Matchers {
 assert(lossLBFGS3.length == 6)
 assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < 
convergenceTol)
   }
+
--- End diff --

The bug isn't found because we only test the static runLBFGS method instead 
of the class. We probably can change all the existing tests to use the one in 
class, so we don't need to add another test.  

@mengxr what do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1104#issuecomment-46393840
  
I think it's legacy reason to have two different way to access the API. As 
far as I know, @mengxr is working on consolidating the interface. He probably 
can talk about more on this topic. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13905548
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with 
LocalSparkContext with Matchers {
 assert(lossLBFGS3.length == 6)
 assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < 
convergenceTol)
   }
+
--- End diff --

We may add the same test to SGD as well. My bad. Our internal one is right. 
Probably when I copy and paste, I don't do thing right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-18 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1104#issuecomment-46412293
  
I think it will be a problem for MIMA to change the signature. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-06-24 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1207

SPARK-2272 [MLlib] Feature scaling which standardizes the range of 
independent variables or features of data

Feature scaling is a method used to standardize the range of independent 
variables or features of data. In data processing, it is also known as data 
normalization and is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic 
transformation of a vector. It contains two methods, `apply` which applies 
transformation on a vector and `unapply` which applies inverse transformation 
on a vector.

There are three concrete implementations of `VectorTransformer`, and they 
all can be easily extended with PMML transformation support.

1) `VectorStandardizer` - Standardises a vector given the mean and 
variance. Since the standardization will densify the output, the output is 
always in dense vector format.

2) `VectorRescaler` - Rescales a vector into target range specified by a 
tuple of two double values or two vectors as new target minimum and maximum. 
Since the rescaling will substrate the minimum of each column first, the output 
will always be in dense vector regardless of input vector type.

3) `VectorDivider` - Transforms a vector by dividing a constant or diving a 
vector with element by element basis. This transformation will preserve the 
type of input vector without densifying the result.

Utility helper methods are implemented for taking an input of RDD[Vector], 
and then transformed RDD[Vector] and transformer are returned for dividing, 
rescaling, normalization, and standardization.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1207


commit d3daa997c9a51a4af8f67cbcdb3738e5ba8c4b56
Author: DB Tsai 
Date:   2014-06-25T02:30:16Z

Feature scaling which standardizes the range of independent variables or 
features of data.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2281 [MLlib] Simplify the duplicate code...

2014-06-25 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1215

SPARK-2281 [MLlib] Simplify the duplicate code in Gradient.scala

The Gradient.compute which returns new tuple of (gradient: Vector, loss: 
Double) can be constructed by in-place version of Gradient.compute. Thus, we 
don't need to maintain the duplicate code.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-gradient-simplification

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1215.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1215


commit b2595d334c0d6246fe904b8c00ca3d51dc88f71a
Author: DB Tsai 
Date:   2014-06-25T22:08:30Z

Simplify the gradient




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-26 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1099#issuecomment-47250277
  
Seems that the jenkins is missing the python runtime. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-07-01 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47683286
  
We benchmarked treeReduce in our random forest implementation, and since 
the trees generated from each partition are fairly large (more than 100MB), we 
found that treeReduce can significantly reduce the shuffle time from 6mins to 
2mins. Nice work! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade junit_xml_listener to 0.5.1 which fixe...

2014-07-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1333

Upgrade junit_xml_listener to 0.5.1 which fixes the following issues

1) fix the class name to be fully qualified classpath
2) make sure the the reporting time is in second not in miliseond, which 
causing JUnit HTML to report incorrect number
3) make sure the duration of the tests are accumulative.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-junit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1333


commit bbeac4b1bb8635eec2b046f1c4cfd15b64d0
Author: DB Tsai 
Date:   2014-07-08T18:44:47Z

Upgrade junit_xml_listener to 0.5.1 which fixes the following issues

1) fix the class name to be fully qualified classpath
2)  make sure the the reporting time is in second not in miliseond, which 
causing JUnit HTML to report incorrect number
3)  make sure the duration of the tests are accumulative.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade junit_xml_listener to 0.5.1 which fixe...

2014-07-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1333#issuecomment-48417558
  
done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2281 [MLlib] Simplify the duplicate code...

2014-07-09 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/1215


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50982699
  
@mengxr  Is there any problem with asfgit? This is not finished yet, why 
asfgit said it's merged into apache:master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733217
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
--- End diff --

This is Int. As long as we require p > 0; it implies p >= 0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n > 0)
--- End diff --

I made it more explicit for not saving one cpu cycle. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733244
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
--- End diff --

sklearn.preprocessing.StandardScaler has this API. If we want to minimize 
the set of parameters now, we can remove it for this release.


http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733248
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Trait for transformation of a vector
+ */
+@DeveloperApi
+trait VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  def transform(vector: Vector): Vector
+
+  /**
+   * Applies transformation on a RDD[Vector].
+   *
+   * @param data RDD[Vector] to be transformed.
+   * @return transformed RDD[Vector].
+   */
+  def transform(data: RDD[Vector]): RDD[Vector] = data.map(x => 
this.transform(x))
--- End diff --

Can you elaborate this?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1207#issuecomment-50986024
  
TODO
1) p = Double.PositiveInfinity. 1, 2, and inf.
2) Add withStd back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15738936
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^p norm
+ *
+ * For any 1 <= p < Double.Infinity, normalizes samples using 
sum(abs(vector).^p)^(1/p) as norm.
+ * For p = Double.Infinity, max(abs(vector)) will be used as norm for 
normalization.
+ * For p = Double.NegativeInfinity, min(abs(vector)) will be used as norm 
for normalization.
--- End diff --

matlab has L_{-inf}  http://www.mathworks.com/help/matlab/ref/norm.html for 
min(abs(X)). I agree that it's not useful for sparse data. Gonna remove it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15740021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^p^ norm
--- End diff --

lol...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15740240
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/StandardScalerSuite.scala 
---
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.mllib.rdd.RDDFunctions._
+import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, 
MultivariateOnlineSummarizer}
+import org.apache.spark.rdd.RDD
+
+class StandardScalerSuite extends FunSuite with LocalSparkContext {
+
+  private def computeSummary(data: RDD[Vector]): 
MultivariateStatisticalSummary = {
+data.treeAggregate(new MultivariateOnlineSummarizer)(
+  (aggregator, data) => aggregator.add(data),
+  (aggregator1, aggregator2) => aggregator1.merge(aggregator2))
+  }
+
+  test("Standardization with dense input") {
+val data = Array(
+  Vectors.dense(-2.0, 2.3, 0),
+  Vectors.dense(0.0, -1.0, -3.0),
+  Vectors.dense(0.0, -5.1, 0.0),
+  Vectors.dense(3.8, 0.0, 1.9),
+  Vectors.dense(1.7, -0.6, 0.0),
+  Vectors.dense(0.0, 1.9, 0.0)
+)
+
+val dataRDD = sc.parallelize(data, 3)
+
+val standardizer1 = new StandardScaler(withMean = true, withStd = true)
+val standardizer2 = new StandardScaler()
+val standardizer3 = new StandardScaler(withMean = true, withStd = 
false)
+
+withClue("Using a standardizer before fitting the model should throw 
exception.") {
+  intercept[IllegalStateException] {
+data.map(standardizer1.transform)
+  }
+}
+
+standardizer1.fit(dataRDD)
+standardizer2.fit(dataRDD)
+standardizer3.fit(dataRDD)
+
+val data1 = data.map(standardizer1.transform)
+val data2 = data.map(standardizer2.transform)
+val data3 = data.map(standardizer3.transform)
+
+val data1RDD = standardizer1.transform(dataRDD)
+val data2RDD = standardizer2.transform(dataRDD)
+val data3RDD = standardizer3.transform(dataRDD)
+
+val summary = computeSummary(dataRDD)
+val summary1 = computeSummary(data1RDD)
+val summary2 = computeSummary(data2RDD)
+val summary3 = computeSummary(data3RDD)
+
+assert((data, data1, data1RDD.collect()).zipped.forall(
+(v1, v2, v3) => (v1, v2, v3) match {
+  case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true
+  case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => 
true
+  case _ => false
+}
+  ), "The vector type should be preserved after standardization.")
+
+assert((data, data2, data2RDD.collect()).zipped.forall(
+(v1, v2, v3) => (v1, v2, v3) match {
+  case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true
+  case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => 
true
+  case _ => false
+}
+  ), "The vector type should be preserved after standardization.")
+
+assert((data, data3, data3RDD.collect()).zipped.forall(
+(v1, v2, v3) => (v1, v2, v3) match {
+  case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true
+  case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => 
true
+  case _ => false
+}
+  ), "The vector type should be preserved after standardization.")
+
+assert((data1, data1RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 
absTol 1E-5))
--- End diff --

For each RDD, I just call twice of 

[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...

2014-08-04 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1518#issuecomment-51151346
  
It's too late to get into 1.1, but I'll try to make it happen in 1.2. We'll 
use this at Alpine implementation first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLlib] Use this.type as return type in k-mean...

2014-08-05 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1796

[MLlib] Use this.type as return type in k-means' builder pattern

to ensure that the return object is itself.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1796


commit 658989ef591ad28f891b275ccdc8137c5c180f46
Author: DB Tsai 
Date:   2014-08-06T01:30:32Z

Alpine Data Labs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1814#discussion_r15908219
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
  * @param withStd True by default. Scales the data to unit standard 
deviation.
  */
 @Experimental
-class StandardScaler(withMean: Boolean, withStd: Boolean) extends 
VectorTransformer {
+class StandardScaler(withMean: Boolean, withStd: Boolean) {
 
   def this() = this(false, true)
 
   require(withMean || withStd, s"withMean and withStd both equal to false. 
Doing nothing.")
 
-  private var mean: BV[Double] = _
-  private var factor: BV[Double] = _
-
   /**
* Computes the mean and variance and stores as a model to be used for 
later scaling.
*
* @param data The data used to compute the mean and variance to build 
the transformation model.
-   * @return This StandardScalar object.
+   * @return a StandardScalarModel
*/
-  def fit(data: RDD[Vector]): this.type = {
+  def fit(data: RDD[Vector]): StandardScalerModel = {
 val summary = data.treeAggregate(new MultivariateOnlineSummarizer)(
   (aggregator, data) => aggregator.add(data),
   (aggregator1, aggregator2) => aggregator1.merge(aggregator2))
 
-mean = summary.mean.toBreeze
-factor = summary.variance.toBreeze
-require(mean.length == factor.length)
+val mean = summary.mean.toBreeze
+val factor = summary.variance.toBreeze
+require(mean.size == factor.size)
 
 var i = 0
-while (i < factor.length) {
+while (i < factor.size) {
   factor(i) = if (factor(i) != 0.0) 1.0 / math.sqrt(factor(i)) else 0.0
   i += 1
 }
 
-this
+new StandardScalerModel(withMean, withStd, mean, factor)
   }
+}
+
+/**
+ * :: Experimental ::
+ * Represents a StandardScaler model that can transform vectors.
+ */
+@Experimental
+class StandardScalerModel private[mllib] (
+val withMean: Boolean,
+val withStd: Boolean,
+val mean: BV[Double],
+val factor: BV[Double]) extends VectorTransformer {
 
--- End diff --

Since users may want to know the variance of the training set, should we 
have constructor 

class StandardScalerModel private[mllib] (
val withMean: Boolean,
val withStd: Boolean,
val mean: BV[Double],
val variance: BV[Double]) {

  lazy val factor = { 
val temp = variance.clone
while (i < temp.size) {
temp(i) = if (temp(i) != 0.0) 1.0 / math.sqrt(temp(i)) else 0.0
 i += 1
 temp
}
  }
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1814#discussion_r15908318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
  * @param withStd True by default. Scales the data to unit standard 
deviation.
  */
 @Experimental
-class StandardScaler(withMean: Boolean, withStd: Boolean) extends 
VectorTransformer {
+class StandardScaler(withMean: Boolean, withStd: Boolean) {
 
--- End diff --

This class is only used for keeping the state of withMean, and withStd, is 
it possible to move those states to fit function by overloading, and make it as 
object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1814#discussion_r15908504
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -177,18 +115,72 @@ private object IDF {
 private def isEmpty: Boolean = m == 0L
 
 /** Returns the current IDF vector. */
-def idf(): BDV[Double] = {
+def idf(): Vector = {
   if (isEmpty) {
 throw new IllegalStateException("Haven't seen any document yet.")
   }
   val n = df.length
-  val inv = BDV.zeros[Double](n)
+  val inv = new Array[Double](n)
   var j = 0
   while (j < n) {
 inv(j) = math.log((m + 1.0)/ (df(j) + 1.0))
 j += 1
   }
-  inv
+  Vectors.dense(inv)
 }
   }
 }
+
+/**
+ * :: Experimental ::
+ * Represents an IDF model that can transform term frequency vectors.
+ */
+@Experimental
+class IDFModel private[mllib] (val idf: Vector) extends Serializable {
+
+  /**
+   * Transforms term frequency (TF) vectors to TF-IDF vectors.
+   * @param dataset an RDD of term frequency vectors
+   * @return an RDD of TF-IDF vectors
+   */
+  def transform(dataset: RDD[Vector]): RDD[Vector] = {
+val bcIdf = dataset.context.broadcast(idf)
+dataset.mapPartitions { iter =>
+  val thisIdf = bcIdf.value
+  iter.map { v =>
+val n = v.size
+v match {
+  case sv: SparseVector =>
+val nnz = sv.indices.size
+val newValues = new Array[Double](nnz)
+var k = 0
+while (k < nnz) {
+  newValues(k) = sv.values(k) * thisIdf(sv.indices(k))
+  k += 1
+}
+Vectors.sparse(n, sv.indices, newValues)
+  case dv: DenseVector =>
+val newValues = new Array[Double](n)
+var j = 0
+while (j < n) {
+  newValues(j) = dv.values(j) * thisIdf(j)
+  j += 1
+}
+Vectors.dense(newValues)
+  case other =>
+throw new UnsupportedOperationException(
--- End diff --

The following exception is used for unsupported vector in appendBias and 
StandardScaler, maybe we could have a global definition of this in util.
case v => throw new IllegalArgumentException("Do not support vector 
type " + v.getClass)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1814#issuecomment-51511617
  
LGTM. Merged into both master and branch-1.1. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1862

[SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface

for training with LBFGS Optimizer which will converge faster than SGD.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-lbfgs-lor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1862


commit 3cf50c207e79c5f67cd5d06ff3f85f3538c23081
Author: DB Tsai 
Date:   2014-08-08T23:23:21Z

LogisticRegressionWithLBFGS interface




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16022431
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,98 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  override val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  override protected def createModel(weights: Vector, intercept: Double) = 
{
+new LogisticRegressionModel(weights, intercept)
+  }
+}
+
+/**
+ * Top-level methods for calling Logistic Regression using Limited-memory 
BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+object LogisticRegressionWithLBFGS {
--- End diff --

I don't mind about this. However, it will cause inconsistent api compared 
with LogisticRegressionWithSGD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16023077
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,54 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  // Have to be lazy since users can change the parameters after the class 
is created.
+  // PS, after the first train, the optimizer variable will be computed, 
so the parameters
+  // can not be changed anymore.
+  override lazy val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Double): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
--- End diff --

agreed! should we also change for the api in the optimizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16023299
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,54 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  // Have to be lazy since users can change the parameters after the class 
is created.
+  // PS, after the first train, the optimizer variable will be computed, 
so the parameters
+  // can not be changed anymore.
+  override lazy val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Double): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
--- End diff --

LBFGS.setMaxNumIterations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

2014-08-11 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1897

[SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition 
number

Scaling to minimize the condition number:
During the optimization process, the convergence (rate) depends on the 
condition number of the training dataset. Scaling the variables often reduces 
this condition number, thus mproving the convergence rate dramatically. Without 
reducing the condition number, some training datasets mixing the columns with 
different scales may not be able to converge.
GLMNET and LIBSVM packages perform the scaling to reduce the condition 
number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
Here, if useFeatureScaling is enabled, we will standardize the training 
features by dividing the variance of each column (without subtracting the 
mean), and train the model in the scaled space. Then we transform the 
coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
do.
Currently, it's only enabled in LogisticRegressionWithLBFGS


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1897


commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7
Author: DB Tsai 
Date:   2014-08-08T23:23:21Z

Improve the convergence rate by minimize the condition number in LOR with 
LBFGS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16153527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
@@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
   throw new SparkException("Input validation failed.")
 }
 
+/**
+ * Scaling to minimize the condition number:
+ *
+ * During the optimization process, the convergence (rate) depends on 
the condition number of
+ * the training dataset. Scaling the variables often reduces this 
condition number, thus
+ * improving the convergence rate dramatically. Without reducing the 
condition number,
+ * some training datasets mixing the columns with different scales may 
not be able to converge.
+ *
+ * GLMNET and LIBSVM packages perform the scaling to reduce the 
condition number, and return
+ * the weights in the original scale.
+ * See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
+ *
+ * Here, if useFeatureScaling is enabled, we will standardize the 
training features by dividing
+ * the variance of each column (without subtracting the mean), and 
train the model in the
+ * scaled space. Then we transform the coefficients from the scaled 
space to the original scale
+ * as GLMNET and LIBSVM do.
+ *
+ * Currently, it's only enabled in LogisticRegressionWithLBFGS
+ */
+val scaler = if (useFeatureScaling) {
+  (new StandardScaler).fit(input.map(x => x.features))
+} else {
+  null
+}
+
 // Prepend an extra variable consisting of all 1.0's for the intercept.
 val data = if (addIntercept) {
-  input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  if(useFeatureScaling) {
+input.map(labeledPoint =>
+  (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features
+  } else {
+input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  }
 } else {
-  input.map(labeledPoint => (labeledPoint.label, 
labeledPoint.features))
+  if (useFeatureScaling) {
+input.map(labeledPoint => (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
+  } else {
+input.map(labeledPoint => (labeledPoint.label, 
labeledPoint.features))
--- End diff --

It's not identical map. It's converting labeledPoint to tuple of response 
and feature vector for optimizer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149135
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149162
  
Seems that Jenkins is not stable. Failing on issues related to akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...

2014-08-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1973#discussion_r16319946
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -69,8 +69,17 @@ class LBFGS(private var gradient: Gradient, private var 
updater: Updater)
 
   /**
* Set the maximal number of iterations for L-BFGS. Default 100.
+   * @deprecated use [[setNumIterations()]] instead
*/
+  @deprecated("use setNumIterations instead", "1.1.0")
   def setMaxNumIterations(iters: Int): this.type = {
+this.setNumCorrections(iters)
--- End diff --

Should it be 

this. setNumIterations(iters)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...

2014-08-15 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1973#issuecomment-52381503
  
LGTM. Merged into both master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2068

[SPARK-2841][MLlib] Documentation for feature transformations

Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark transformer-documentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2068


commit e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31
Author: DB Tsai 
Date:   2014-08-20T22:21:26Z

documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16561045
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
 
 
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
--- End diff --

How about I say
"For example, RBF kernel of Support Vector Machines
or the L1 and L2 regularized linear models typically works better when all 
features have unit variance
and/or zero mean."

I actually have this statement from scikit documentation.  

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-22 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-53138329
  
@atalwalkar and @mengxr I just addressed the merge conflict. I think it's 
ready to merge. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103154658
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1447,7 +1447,7 @@ private class LogisticAggregator(
   label: Double): Unit = {
 
 val localFeaturesStd = bcFeaturesStd.value
-val localCoefficients = bcCoefficients.value
+val localCoefficients = bcCoefficients.value.toArray
--- End diff --

In the first version of LOR, we have the following code which avoid this 
issue you pointed out. 

```scala
  private val weightsArray = weights match {
 case dv: DenseVector => dv.values
 case _ =>
   throw new IllegalArgumentException(
 s"weights only supports dense vector but got type 
${weights.getClass}.")
   }
```

I think order approach will be more efficient since `toArray` is only 
called once (you can add the case for sparse), and for sparse initial 
coefficients, we will not convert from sparse to dense again and again. 

This can be a future work. With L1 applied, the coefficients can be very 
sparse, so we can compress the coefficients for each iteration, and have 
specialized implementation for `UpdateInPlace`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15349: [SPARK-17239][ML][DOC] Update user guide for multiclass ...

2016-10-05 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/15349
  
LGTM. Merged into master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >