[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1814#issuecomment-51511617
  
LGTM. Merged into both master and branch-1.1. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1862

[SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface

for training with LBFGS Optimizer which will converge faster than SGD.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-lbfgs-lor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1862


commit 3cf50c207e79c5f67cd5d06ff3f85f3538c23081
Author: DB Tsai 
Date:   2014-08-08T23:23:21Z

LogisticRegressionWithLBFGS interface




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16022431
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,98 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  override val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  override protected def createModel(weights: Vector, intercept: Double) = 
{
+new LogisticRegressionModel(weights, intercept)
+  }
+}
+
+/**
+ * Top-level methods for calling Logistic Regression using Limited-memory 
BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+object LogisticRegressionWithLBFGS {
--- End diff --

I don't mind about this. However, it will cause inconsistent api compared 
with LogisticRegressionWithSGD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16023077
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,54 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  // Have to be lazy since users can change the parameters after the class 
is created.
+  // PS, after the first train, the optimizer variable will be computed, 
so the parameters
+  // can not be changed anymore.
+  override lazy val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Double): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
--- End diff --

agreed! should we also change for the api in the optimizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16023299
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,54 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  // Have to be lazy since users can change the parameters after the class 
is created.
+  // PS, after the first train, the optimizer variable will be computed, 
so the parameters
+  // can not be changed anymore.
+  override lazy val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Double): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
--- End diff --

LBFGS.setMaxNumIterations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

2014-08-11 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1897

[SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition 
number

Scaling to minimize the condition number:
During the optimization process, the convergence (rate) depends on the 
condition number of the training dataset. Scaling the variables often reduces 
this condition number, thus mproving the convergence rate dramatically. Without 
reducing the condition number, some training datasets mixing the columns with 
different scales may not be able to converge.
GLMNET and LIBSVM packages perform the scaling to reduce the condition 
number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
Here, if useFeatureScaling is enabled, we will standardize the training 
features by dividing the variance of each column (without subtracting the 
mean), and train the model in the scaled space. Then we transform the 
coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
do.
Currently, it's only enabled in LogisticRegressionWithLBFGS


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1897


commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7
Author: DB Tsai 
Date:   2014-08-08T23:23:21Z

Improve the convergence rate by minimize the condition number in LOR with 
LBFGS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16153527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
@@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
   throw new SparkException("Input validation failed.")
 }
 
+/**
+ * Scaling to minimize the condition number:
+ *
+ * During the optimization process, the convergence (rate) depends on 
the condition number of
+ * the training dataset. Scaling the variables often reduces this 
condition number, thus
+ * improving the convergence rate dramatically. Without reducing the 
condition number,
+ * some training datasets mixing the columns with different scales may 
not be able to converge.
+ *
+ * GLMNET and LIBSVM packages perform the scaling to reduce the 
condition number, and return
+ * the weights in the original scale.
+ * See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
+ *
+ * Here, if useFeatureScaling is enabled, we will standardize the 
training features by dividing
+ * the variance of each column (without subtracting the mean), and 
train the model in the
+ * scaled space. Then we transform the coefficients from the scaled 
space to the original scale
+ * as GLMNET and LIBSVM do.
+ *
+ * Currently, it's only enabled in LogisticRegressionWithLBFGS
+ */
+val scaler = if (useFeatureScaling) {
+  (new StandardScaler).fit(input.map(x => x.features))
+} else {
+  null
+}
+
 // Prepend an extra variable consisting of all 1.0's for the intercept.
 val data = if (addIntercept) {
-  input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  if(useFeatureScaling) {
+input.map(labeledPoint =>
+  (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features
+  } else {
+input.map(labeledPoint => (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  }
 } else {
-  input.map(labeledPoint => (labeledPoint.label, 
labeledPoint.features))
+  if (useFeatureScaling) {
+input.map(labeledPoint => (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
+  } else {
+input.map(labeledPoint => (labeledPoint.label, 
labeledPoint.features))
--- End diff --

It's not identical map. It's converting labeledPoint to tuple of response 
and feature vector for optimizer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149135
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-13 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149162
  
Seems that Jenkins is not stable. Failing on issues related to akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...

2014-08-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1973#discussion_r16319946
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -69,8 +69,17 @@ class LBFGS(private var gradient: Gradient, private var 
updater: Updater)
 
   /**
* Set the maximal number of iterations for L-BFGS. Default 100.
+   * @deprecated use [[setNumIterations()]] instead
*/
+  @deprecated("use setNumIterations instead", "1.1.0")
   def setMaxNumIterations(iters: Int): this.type = {
+this.setNumCorrections(iters)
--- End diff --

Should it be 

this. setNumIterations(iters)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...

2014-08-15 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1973#issuecomment-52381503
  
LGTM. Merged into both master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2068

[SPARK-2841][MLlib] Documentation for feature transformations

Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark transformer-documentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2068


commit e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31
Author: DB Tsai 
Date:   2014-08-20T22:21:26Z

documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16561045
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
 
 
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
--- End diff --

How about I say
"For example, RBF kernel of Support Vector Machines
or the L1 and L2 regularized linear models typically works better when all 
features have unit variance
and/or zero mean."

I actually have this statement from scikit documentation.  

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-22 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-53138329
  
@atalwalkar and @mengxr I just addressed the merge conflict. I think it's 
ready to merge. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-10-28 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-60813678
  
@BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...

2014-10-28 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2992

[SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer

In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
through the nonZero elements in the vector. However, activeIterator doesn't 
perform well due to lots of overhead. In this PR, native while loop is used for 
both DenseVector and SparseVector.
The benchmark result with 20 executors using mnist8m dataset:
Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds
After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds
Since MultivariateOnlineSummarizer is used in several places, the overall 
performance gain in mllib library will be significant with this PR.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark SPARK-4129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2992.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2992


commit ebe3e74df70eb424aecc3170fc55008cfb6a76ec
Author: DB Tsai 
Date:   2014-10-29T05:42:50Z

First commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3220#discussion_r20206271
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala
 ---
@@ -50,6 +50,29 @@ class MultivariateOnlineSummarizer extends 
MultivariateStatisticalSummary with S
   private var currMin: BDV[Double] = _
 
   /**
+   * Adds input value to position i.
+   */
+  private[this] def add(i: Int, value: Double) = {
+if (value != 0.0) {
--- End diff --

You can add it, and get the same result. However, it's computationally 
cheap if we don't add zero into the summarizer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3220#discussion_r20207949
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala
 ---
@@ -124,37 +128,28 @@ class MultivariateOnlineSummarizer extends 
MultivariateStatisticalSummary with S
   require(n == other.n, s"Dimensions mismatch when merging with 
another summarizer. " +
 s"Expecting $n but got ${other.n}.")
   totalCnt += other.totalCnt
-  val deltaMean: BDV[Double] = currMean - other.currMean
   var i = 0
   while (i < n) {
-// merge mean together
-if (other.currMean(i) != 0.0) {
-  currMean(i) = (currMean(i) * nnz(i) + other.currMean(i) * 
other.nnz(i)) /
-(nnz(i) + other.nnz(i))
-}
-// merge m2n together
-if (nnz(i) + other.nnz(i) != 0.0) {
-  currM2n(i) += other.currM2n(i) + deltaMean(i) * deltaMean(i) * 
nnz(i) * other.nnz(i) /
-(nnz(i) + other.nnz(i))
-}
-// merge m2 together
-if (nnz(i) + other.nnz(i) != 0.0) {
+val thisNnz = nnz(i)
+val otherNnz = other.nnz(i)
+val totalNnz = thisNnz + otherNnz
+if (totalNnz != 0.0) {
+  val deltaMean = other.currMean(i) - currMean(i)
+  // merge mean together
+  currMean(i) += deltaMean * otherNnz / totalNnz
--- End diff --

This looks good. More consistent with the previous notation when we add 
single sample. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3220#discussion_r20208266
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala
 ---
@@ -50,6 +50,29 @@ class MultivariateOnlineSummarizer extends 
MultivariateStatisticalSummary with S
   private var currMin: BDV[Double] = _
 
   /**
+   * Adds input value to position i.
+   */
+  private[this] def add(i: Int, value: Double) = {
+if (value != 0.0) {
--- End diff --

Yes. However, we know the total # of samples, and # of nonzero in each 
column, so if # of samples and # of nonzero are different, and we find the min 
is some positive number, then the actually min will be zero since we have zero 
somewhere which we don't add into summarizer. 

For max, the same logic will be applied. 

For mean, we can fix this effect by   realMean(i) = currMean(i) * 
(nnz(i) / totalCnt)

As a result, for sparse dataset, we only need to add the nonzero into the 
summarizer, and it will be O(\bar{n}) instead of O(n) where \bar{n} is the 
average nonzero elements in one sample. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/3220#issuecomment-62689770
  
LGTM. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/3220#issuecomment-62694226
  
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4348] [PySpark] [MLlib] rename random.p...

2014-11-13 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/3216#issuecomment-62856261
  
It works for me as well.

᚛ |activeIterator *|$ ./bin/pyspark
Python 2.7.6 (default, Sep  9 2014, 15:04:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Warning: SPARK_MEM is deprecated, please use a more specific config 
option
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.2.0-SNAPSHOT
  /_/

Using Python version 2.7.6 (default, Sep  9 2014 15:04:36)
SparkContext available as sc.
>>> from pyspark.mllib.feature import Word2Vec
>>> from pyspark.mllib.random import RandomRDDs



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-15 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/3288

[SPARK-4431][MLlib] Implement efficient activeIterator for dense and sparse 
vector

Previously, we were using Breeze's activeIterator to access the non-zero 
elements 
in sparse vector, and explicitly skipping the zero in dense/sparse vector 
using 
pattern matching. Due to the overhead, we switched back to native `while 
loop` 
in #SPARK-4129.

However, #SPARK-4129 requires de-reference the dv.values/sv.values in 
each access to the value, and the zeros in dense vector and sparse vector 
if exist are skipped in the add function call; the overall penalty will be 
around 10% compared with de-reference once outside the while block, 
and checking if zero before calling the add function. The code is branched 
out 
for dense and sparse vector, and it's not easy to maintain in the long term.

Not only this activeIterator implementation increases the performance, 
but the abstraction of accessing the non-zero elements in different 
vector type also helps the maintainability of codebase. In this PR, 
only MultivariateOnlineSummarizer uses new API as example, 
and others can be migrated to activeIterator later.

Benchmarking with mnist8m dataset on single JVM 
with first 200 samples loaded in memory, and repeating 5000 times.

Before change: 
Sparse Vector - 30.02
Dense Vector - 38.27

After this optimization:
Sparse Vector - 27.54
Dense Vector - 35.13


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark activeIterator

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3288


commit 101c2eafb250b428f1b244e7f8057e63400f8f4e
Author: DB Tsai 
Date:   2014-11-13T07:08:13Z

Finished SPARK-4431




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3288#discussion_r20532934
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -76,6 +76,22 @@ sealed trait Vector extends Serializable {
   def copy: Vector = {
 throw new NotImplementedError(s"copy is not implemented for 
${this.getClass}.")
   }
+
+  /**
+   * It will return the iterator for the active elements of dense and 
sparse vector as
+   * (index, value) pair. Note that foreach method can be overridden for 
better performance
+   * in different vector implementation.
+   *
+   * @param skippingZeros Skipping zero elements explicitly if true. It 
will be useful when we
+   *  iterator through dense vector having lots of 
zero elements which
+   *  we want to skip. Default is false.
+   * @return Iterator[(Int, Double)] where the first element in the tuple 
is the index,
+   * and the second element is the corresponding value.
+   */
+  private[spark] def activeIterator(skippingZeros: Boolean): 
Iterator[(Int, Double)]
--- End diff --

`skippingZeros` will be very useful in `foreach` operation, and if you use 
iterator -> filter -> foreach, it will not use the optimized `foreach` which is 
implemented by native while loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3288#discussion_r20533260
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -273,6 +289,47 @@ class DenseVector(val values: Array[Double]) extends 
Vector {
   override def copy: DenseVector = {
 new DenseVector(values.clone())
   }
+
+  private[spark] override def activeIterator(skippingZeros: Boolean) = new 
Iterator[(Int, Double)] {
+private var i = 0
+private val valuesSize = values.size
+
+// If zeros are asked to be explicitly skipped, the parent `size` 
method is called to count
+// the number of nonzero elements using `hasNext` and `next` methods.
+override lazy val size: Int = if (skippingZeros) super.size else 
valuesSize
+
+override def hasNext = {
+  if (skippingZeros) {
+var found = false
+while (!found && i < valuesSize) if (values(i) != 0.0) found = 
true else i += 1
+  }
+  i < valuesSize
+}
+
+override def next = {
+  val result = (i, values(i))
+  i += 1
+  result
+}
+
+override def foreach[@specialized(Unit) U](f: ((Int, Double)) => U) {
--- End diff --

Interesting. In scala's range code, they have

@inline final override def foreach[@specialized(Unit) U](f: Int => U)

I'll do a bytecode analysis, and see if it will generate the same bytecode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3288#discussion_r20544650
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -273,6 +289,47 @@ class DenseVector(val values: Array[Double]) extends 
Vector {
   override def copy: DenseVector = {
 new DenseVector(values.clone())
   }
+
+  private[spark] override def activeIterator(skippingZeros: Boolean) = new 
Iterator[(Int, Double)] {
+private var i = 0
+private val valuesSize = values.size
+
+// If zeros are asked to be explicitly skipped, the parent `size` 
method is called to count
+// the number of nonzero elements using `hasNext` and `next` methods.
+override lazy val size: Int = if (skippingZeros) super.size else 
valuesSize
+
+override def hasNext = {
+  if (skippingZeros) {
+var found = false
+while (!found && i < valuesSize) if (values(i) != 0.0) found = 
true else i += 1
+  }
+  i < valuesSize
+}
+
+override def next = {
+  val result = (i, values(i))
+  i += 1
+  result
+}
+
+override def foreach[@specialized(Unit) U](f: ((Int, Double)) => U) {
--- End diff --

Okay, the generated bytecode of both approach are the same. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/3288#issuecomment-63566328
  
(PS, when I did the bytecode analysis, I found that accessing the 
member variables of values and values.size require two operation. 
By having a local copy of reference to make it as single call, there is
another 8% performance gain. See 

http://stackoverflow.com/questions/6602922/is-it-faster-to-access-final-local-variables-than-class-variables-in-java
 for detail)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3288#discussion_r20553260
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -76,6 +76,22 @@ sealed trait Vector extends Serializable {
   def copy: Vector = {
 throw new NotImplementedError(s"copy is not implemented for 
${this.getClass}.")
   }
+
+  /**
+   * It will return the iterator for the active elements of dense and 
sparse vector as
+   * (index, value) pair. Note that foreach method can be overridden for 
better performance
+   * in different vector implementation.
+   *
+   * @param skippingZeros Skipping zero elements explicitly if true. It 
will be useful when we
+   *  iterator through dense vector having lots of 
zero elements which
+   *  we want to skip. Default is false.
+   * @return Iterator[(Int, Double)] where the first element in the tuple 
is the index,
+   * and the second element is the corresponding value.
+   */
+  private[spark] def activeIterator(skippingZeros: Boolean): 
Iterator[(Int, Double)]
--- End diff --

With the following code, 

sample.activeIterator(false).foreach {
  case (index, value) => if(value != 0.0) add(index, value)
}

It takes 61.809 for dense vector, and 54.626 for sparse vector. 

The most expensive part is calling the anonymous function even when the 
values are zero.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3288#discussion_r20554090
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -76,6 +76,22 @@ sealed trait Vector extends Serializable {
   def copy: Vector = {
 throw new NotImplementedError(s"copy is not implemented for 
${this.getClass}.")
   }
+
+  /**
+   * It will return the iterator for the active elements of dense and 
sparse vector as
+   * (index, value) pair. Note that foreach method can be overridden for 
better performance
+   * in different vector implementation.
+   *
+   * @param skippingZeros Skipping zero elements explicitly if true. It 
will be useful when we
+   *  iterator through dense vector having lots of 
zero elements which
+   *  we want to skip. Default is false.
+   * @return Iterator[(Int, Double)] where the first element in the tuple 
is the index,
+   * and the second element is the corresponding value.
+   */
+  private[spark] def activeIterator(skippingZeros: Boolean): 
Iterator[(Int, Double)]
--- End diff --

Okay, the issue is in the anonymous function. Basically, scala will convert 
primitive index: Int and value: Double into boxed object in order to have them 
in tuple. In my testing dataset, there are so many zeros explicitly, and even 
those values with zero have to be converted to tuple before we do the `if 
statement`. That's why it's dramatically faster if we do the `if statement` 
before calling the anonymous function. 

Changing the signature of `foreach` into 

def foreach[@specialized(Unit) U](f: (Int, Double) => U)

to take two primitive variables will solve this problem, but it will not 
comply the interface of `foreach`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3288#discussion_r20615000
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -76,6 +76,22 @@ sealed trait Vector extends Serializable {
   def copy: Vector = {
 throw new NotImplementedError(s"copy is not implemented for 
${this.getClass}.")
   }
+
+  /**
+   * It will return the iterator for the active elements of dense and 
sparse vector as
+   * (index, value) pair. Note that foreach method can be overridden for 
better performance
+   * in different vector implementation.
+   *
+   * @param skippingZeros Skipping zero elements explicitly if true. It 
will be useful when we
+   *  iterator through dense vector having lots of 
zero elements which
+   *  we want to skip. Default is false.
+   * @return Iterator[(Int, Double)] where the first element in the tuple 
is the index,
+   * and the second element is the corresponding value.
+   */
+  private[spark] def activeIterator(skippingZeros: Boolean): 
Iterator[(Int, Double)]
--- End diff --

You are right; the `Tuple2[Int, Double]` is specialized, and I mistakenly 
interpreted the bytecode. 
For the flowing scala code,
```scala
def foreach[@specialized(Unit) U](f: ((Int, Double)) => U) {
  var i = 0
  val localValuesSize = values.size
  val localValues = values
  while (i < localValuesSize) {
f(i, localValues(i))
i += 1
  }
}
```
the generated bytecode will be
```
  public foreach(Lscala/Function1;)V
   L0
LINENUMBER 296 L0
ICONST_0
ISTORE 2
   L1
LINENUMBER 297 L1
GETSTATIC scala/Predef$.MODULE$ : Lscala/Predef$;
ALOAD 0
INVOKEVIRTUAL org/apache/spark/mllib/linalg/DenseVector.values ()[D
INVOKEVIRTUAL scala/Predef$.doubleArrayOps 
([D)Lscala/collection/mutable/ArrayOps;
INVOKEINTERFACE scala/collection/mutable/ArrayOps.size ()I
ISTORE 3
   L2
LINENUMBER 298 L2
ALOAD 0
INVOKEVIRTUAL org/apache/spark/mllib/linalg/DenseVector.values ()[D
ASTORE 4
   L3
LINENUMBER 299 L3
   FRAME APPEND [I I [D]
ILOAD 2
ILOAD 3
IF_ICMPGE L4
   L5
LINENUMBER 300 L5
ALOAD 1
NEW scala/Tuple2$mcID$sp
DUP
ILOAD 2
ALOAD 4
ILOAD 2
DALOAD
INVOKESPECIAL scala/Tuple2$mcID$sp. (ID)V
INVOKEINTERFACE scala/Function1.apply 
(Ljava/lang/Object;)Ljava/lang/Object;
POP
   L6
LINENUMBER 301 L6
ILOAD 2
ICONST_1
IADD
ISTORE 2
GOTO L3
```

However, 
```
INVOKESPECIAL scala/Tuple2$mcID$sp. (ID)V
INVOKEINTERFACE scala/Function1.apply 
(Ljava/lang/Object;)Ljava/lang/Object;
```
is expensive, so that's why checking zero in the anonymous function will 
slow down the whole thing. 

I agree with you, the iterator is slow by nature, and we are only 
interested in foreach implementation. I'll remove the iterator, and just have 
foreach method in vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1157 L-BFGS Optimizer based on Breeze L-...

2014-04-07 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/53


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-07 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/353

SPARK-1157: L-BFGS Optimizer based on Breeze's implementation.

This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !

When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input 
parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by 

val regVal = updater.compute(
  weights,
  new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

  val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without 
regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #353


commit 60c83350bb77aa640edd290a26e2a20281b7a3a8
Author: DB Tsai 
Date:   2014-04-05T00:06:50Z

L-BFGS Optimizer based on Breeze's implementation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11404094
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3 < m < 10 is recommended.
+   * Restriction: m > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
+this.lineSearchTolerance = tolerance
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvTolerance(tolerance: Int): this.type = {
+this.convTolerance = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = LBFGS

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11404515
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3 < m < 10 is recommended.
+   * Restriction: m > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
+this.lineSearchTolerance = tolerance
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvTolerance(tolerance: Int): this.type = {
+this.convTolerance = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = LBFGS

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-39895140
  
@mengxr  As you suggested, I moved the costFun to private CostFun class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11460767
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
--- End diff --

@mengxr  
I know. I pretty much follow the existing coding style in 
GradientDescent.scala 
Should I also change the one in other place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11461398
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3 < m < 10 is recommended.
+   * Restriction: m > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
--- End diff --

Good catch! It's used in RISO implementation. Just remove them. Thks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11463764
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
--- End diff --

This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should at 
least have the same performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464013
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
--- End diff --

It a reason number coming out of my mind. Just quick do a comparing. 
With 10 iterations of L-BFGS, SGD needs 40 iterations to get 2% difference.
.. ,SGD needs 90 iterations to 
get 1% difference. 
In all of the test, L-BFGS gives smaller loss.
As a result, you can see how SGD converges really slow when the # of 
iterations are high.
For here, I'll put 2% to make the test run faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not 

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464121
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
--- End diff --

I add the comment in the code as 
// GD converges a way slower than L-BFGS. To achieve 1% difference,
// it requires 90 iterations in GD. No matter how hard we increase
// the number of iterations in GD here, the lossGD will be always
// larger than lossLBFGS.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464280
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
+  }
+
+  test("Assert that LBFGS and Gradient Descent with L2 regularization get 
the same result.") {
+val regParam = 0.2
+
+// Prepare another non-zero weights to compare the loss in the first 
iteration.
+val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
+
+val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  squaredL2Updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
  

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464736
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext("local", "test")
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty("spark.driver.port")
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last) < 0.05,
+  "LBFGS should match GD result within 5% error.")
+  }
+
+  test("Assert that LBFGS and Gradient Descent with L2 regularization get 
the same result.") {
+val regParam = 0.2
+
+// Prepare another non-zero weights to compare the loss in the first 
iteration.
+val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
+
+val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  squaredL2Updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
  

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11521070
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.LocalSparkContext
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
LocalSparkContext with ShouldMatchers {
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  var convergenceTol = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add an extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =>
+label -> Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  lazy val dataRDD = sc.parallelize(data, 2).cache()
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("LBFGS loss should be decreasing and match the result of Gradient 
Descent.") {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head < 0, "loss isn't decreasing.")
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) => lhs - rhs
+}
+// This 0.8 bound is copying from GradientDescentSuite, and L-BFGS 
should
+// at least have the same performance. It's based on observation, no 
theoretically guaranteed.
+assert(lossDiff.count(_ > 0).toDouble / lossDiff.size > 0.8)
--- End diff --

You are right.  Since the cost function is convex, the loss is guaranteed 
to be monotonic decreased with L-BFGS optimizer. (SGD doesn't guarantee this, 
and the loss may be fluctuating in the optimization process.) Will add the test 
for this property.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11604731
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3 < numCorrections < 10 is recommended.
+   * Restriction: numCorrections > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFracti

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11605030
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3 < numCorrections < 10 is recommended.
+   * Restriction: numCorrections > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFracti

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11605070
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3 < numCorrections < 10 is recommended.
+   * Restriction: numCorrections > 0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections > 0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFracti

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434555
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
GitHub user dbtsai reopened a pull request:

https://github.com/apache/spark/pull/353

[SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.

This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !

When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input 
parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by 

val regVal = updater.compute(
  weights,
  new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

  val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without 
regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #353


commit 984b18e21396eae84656e15da3539ff3b5f3bf4a
Author: DB Tsai 
Date:   2014-04-05T00:06:50Z

L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation 
issue in GradientDescent optimizer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434626
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434691
  
Timeout for lastest jenkins run. It seems that CI is not stable now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-15 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLlib doc update for breeze dependency

2014-04-22 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/481

MLlib doc update for breeze dependency

MLlib is now using breeze linear algebra library instead of jblas; this PR 
will update the doc to help users to install the blas native libraries to have 
better performance in netlib-java which breeze depends on. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGSdocs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/481.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #481


commit eddb3ddfd036035b4b8c639450e4d48db6afd4d4
Author: DB Tsai 
Date:   2014-04-22T07:35:44Z

Fixed MLlib doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLlib doc update for breeze dependency

2014-04-22 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/481#issuecomment-41012728
  
Oh. I don't know about that. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLlib doc update for breeze dependency

2014-04-22 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/481


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1506][MLLIB] Documentation improvements...

2014-04-22 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/422#discussion_r11841916
  
--- Diff: docs/mllib-guide.md ---
@@ -3,63 +3,120 @@ layout: global
 title: Machine Learning Library (MLlib)
 ---
 
+MLlib is a Spark implementation of some common machine learning algorithms 
and utilities,
+including classification, regression, clustering, collaborative
+filtering, dimensionality reduction, as well as underlying optimization 
primitives:
 
-MLlib is a Spark implementation of some common machine learning (ML)
-functionality, as well associated tests and data generators.  MLlib
-currently supports four common types of machine learning problem settings,
-namely classification, regression, clustering and collaborative filtering,
-as well as an underlying gradient descent optimization primitive and 
several
-linear algebra methods.
-
-# Available Methods
-The following links provide a detailed explanation of the methods and 
usage examples for each of them:
-
-* Classification and 
Regression
-  * Binary Classification
-* SVM (L1 and L2 regularized)
-* Logistic Regression (L1 and L2 regularized)
-  * Linear Regression
-* Least Squares
-* Lasso
-* Ridge Regression
-  * Decision Tree (for classification and regression)
-* Clustering
-  * k-Means
-* Collaborative Filtering
-  * Matrix Factorization using Alternating Least Squares
-* Optimization
-  * Gradient Descent and Stochastic Gradient Descent
-* Linear Algebra
-  * Singular Value Decomposition
-  * Principal Component Analysis
-
-# Data Types
-
-Most MLlib algorithms operate on RDDs containing vectors. In Java and 
Scala, the
-[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class 
is used to
-represent vectors. You can create either dense or sparse vectors using the
-[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) 
factory.
-
-In Python, MLlib can take the following vector types:
-
-* [NumPy](http://www.numpy.org) arrays
-* Standard Python lists (e.g. `[1, 2, 3]`)
-* The MLlib 
[SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class
-* [SciPy sparse 
matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html)
-
-For efficiency, we recommend using NumPy arrays over lists, and using the
-[CSC 
format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
-for SciPy matrices, or MLlib's own SparseVector class.
-
-Several other simple data types are used throughout the library, e.g. the 
LabeledPoint
-class 
([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint),
-[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) 
for labeled data.
-
-# Dependencies
-MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra 
library, which itself
-depends on native Fortran routines. You may need to install the
-[gfortran runtime 
library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries)
-if it is not already present on your nodes. MLlib will throw a linking 
error if it cannot
-detect these libraries automatically.
+* [Basics](mllib-basics.html)
+  * data types 
+  * summary statistics
+* Classification and regression
+  * [linear support vector machine 
(SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm)
+  * [logistic regression](mllib-linear-methods.html#logistic-regression)
+  * [linear least squares, Lasso, and ridge 
regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
+  * [decision tree](mllib-decision-tree.html)
+  * [naive Bayes](mllib-naive-bayes.html)
+* [Collaborative filtering](mllib-collaborative-filtering.html)
+  * alternating least squares (ALS)
+* [Clustering](mllib-clustering.html)
+  * k-means
+* [Dimensionality reduction](mllib-dimensionality-reduction.html)
+  * singular value decomposition (SVD)
+  * principal component analysis (PCA)
+* [Optimization](mllib-optimization.html)
+  * stochastic gradient descent
+  * limited-memory BFGS (L-BFGS)
+
+MLlib is currently a beta component under active development.
+The APIs may be changed in the future releases, and we will provide 
migration guide between releases.
+
+## Dependencies
+
+MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), 
which depends on
+[netlib-java](https://github.com/fommil/netlib-java), and
+[jblas](https://github.com/mikiobraun/jblas).  `jblas` depend on native 
Fortran routines. You need
+to install the
+[gfortran runtime 
library](https://github.com/mikiobraun/jblas/wiki/Mi

[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-04-22 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r11883381
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -77,7 +78,8 @@ trait ClientBase extends Logging {
 ).foreach { case(cond, errStr) =>
   if (cond) {
 logError(errStr)
-args.printUsageAndExit(1)
+throw new IllegalArgumentException(args.getUsageMessage())
+
--- End diff --

Remove this empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-04-22 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-41114289
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-28 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/582

[SPARK-1157][MLlib] Bug fix: lossHistory should be monotonically decresing

Instead of recording the loss in the costFun for each time that optimizer 
calls costFun, we get the loss from the api provided in breeze to avoid 
recording the rejection steps which will cause the loss bumping up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-lbfgs-bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/582.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #582


commit d72c67908abc8546b9713c41101aa6c685ce31eb
Author: DB Tsai 
Date:   2014-04-28T20:36:13Z

Using Breeze's states to get the loss.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41740842
  
@mengxr  Just did some hack on trying to implement the right "stochastic" 
L-BFGS, and it kind of works as long as we don't change the objective function. 
But there is no good way to know which LBFGS step it is to keep the objective 
function the same in line search step, so I need to do some injection as David 
suggest. See 
https://github.com/dbtsai/spark/commit/0c699f259af7c1bd630d033a3ce960771efaf66c

What do you think now? Just remove the miniBatchFraction since the RDD 
sample is even not efficient?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41751464
  
Make sense from the inverse of hessian point of view. Just remove it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-05-04 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/458#issuecomment-42160096
  
lbfgs is not good for L1 problem. I'm working on and preparing to do 
benchmark with bfgs variant OWL-QN for L1 which is ideal to be compared with 
ADMM. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLlib documentation fix

2014-05-10 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/703

MLlib documentation fix

Fixed the documentation for that `loadLibSVMData` is changed to 
`loadLibSVMFile`. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-docfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/703.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #703


commit 71dd50804e7fd2cd4c441333020811898a6d8a30
Author: DB Tsai 
Date:   2014-05-09T00:00:41Z

loadLibSVMData is changed to loadLibSVMFile




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: L-BFGS Documentation

2014-05-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12502968
  
--- Diff: docs/mllib-optimization.md ---
@@ -163,3 +171,108 @@ each iteration, to compute the gradient direction.
 Available algorithms for gradient descent:
 
 * 
[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+
+### Limited-memory BFGS
+L-BFGS is currently only a low-level optimization primitive in `MLlib`. If 
you want to use L-BFGS in various 
+ML algorithms such as Linear Regression, and Logistic Regression, you have 
to pass the gradient of objective
+function, and updater into optimizer yourself instead of using the 
training APIs like 

+[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD).
+See the example below. It will be addressed in the next release. 
+
+The L1 regularization by using 

+[L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.L1Updater)
 will not work since the 
+soft-thresholding logic in L1Updater is designed for gradient descent. See 
the developer's note.
+
+The L-BFGS method

+[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
+has the following parameters:
+
+* `gradient` is a class that computes the gradient of the objective 
function
+being optimized, i.e., with respect to a single training example, at the
+current parameter value. MLlib includes gradient classes for common loss
+functions, e.g., hinge, logistic, least-squares.  The gradient class takes 
as
+input a training example, its label, and the current parameter value. 
+* `updater` is a class that computes the gradient and loss of objective 
function 
+of the regularization part for L-BFGS. MLlib includes updaters for cases 
without 
+regularization, as well as L2 regularizer. 
+* `numCorrections` is the number of corrections used in the L-BFGS update. 
10 is 
+recommended.
+* `maxNumIterations` is the maximal number of iterations that L-BFGS can 
be run.
+* `regParam` is the regularization parameter when using regularization.
+
+
+The `return` is a tuple containing two elements. The first element is a 
column matrix
+containing weights for every feature, and the second element is an array 
containing 
+the loss computed for every iteration.
+
+Here is an example to train binary logistic regression with L2 
regularization using
+L-BFGS optimizer. 
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.classification.LogisticRegressionModel
+
+val data = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
+val numFeatures = data.take(1)(0).features.size
+
+// Split data into training (60%) and test (40%).
+val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
+
+// Prepend 1 into the training data as intercept.
+val training = splits(0).map(x => (x.label, 
MLUtils.appendBias(x.features))).cache()
+
+val test = splits(1)
+
+// Run training algorithm to build the model
+val numCorrections = 10
+val convergenceTol = 1e-4
+val maxNumIterations = 20
+val regParam = 0.1
+val initialWeightsWithIntercept = Vectors.dense(new 
Array[Double](numFeatures + 1))
+
+val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
+  training,
+  new LogisticGradient(),
+  new SquaredL2Updater(),
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  initialWeightsWithIntercept)
+
+val model = new LogisticRegressionModel(
+  Vectors.dense(weightsWithIntercept.toArray.slice(1, 
weightsWithIntercept.size)),
--- End diff --

Why don't we have prependOne in the MLUtils as well? Due to the scope, 
users can not use prependOne. It's more intuitive to have intercept as first 
element. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: L-BFGS Documentation

2014-05-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12499609
  
--- Diff: docs/mllib-optimization.md ---
@@ -163,3 +177,100 @@ each iteration, to compute the gradient direction.
 Available algorithms for gradient descent:
 
 * 
[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+
+### Limited-memory BFGS
+L-BFGS is currently only a low-level optimization primitive in `MLlib`. If 
you want to use L-BFGS in various 
+ML algorithms such as Linear Regression, and Logistic Regression, you have 
to pass the gradient of objective
+function, and updater into optimizer yourself instead of using the 
training APIs like 

+[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression).
+See the example below. It will be addressed in the next release. 
+
+The L1 regularization by using 

+[Updater.L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.Updater)
 will not work since the 
+soft-thresholding logic in L1Updater is designed for gradient descent.
+
+The L-BFGS method

+[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
+has the following parameters:
+
+* `gradient` is a class that computes the gradient of the objective 
function
+being optimized, i.e., with respect to a single training example, at the
+current parameter value. MLlib includes gradient classes for common loss
+functions, e.g., hinge, logistic, least-squares.  The gradient class takes 
as
+input a training example, its label, and the current parameter value. 
+* `updater` is a class originally designed for gradient decent which 
computes 
--- End diff --

Agree. I will move it into the comment in the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: L-BFGS Documentation

2014-05-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12499183
  
--- Diff: docs/mllib-optimization.md ---
@@ -128,10 +128,24 @@ is sampled, i.e. `$|S|=$ miniBatchFraction $\cdot n = 
1$`, then the algorithm is
 standard SGD. In that case, the step direction depends from the uniformly 
random sampling of the
 point.
 
+### Limited-memory BFGS
+[Limited-memory BFGS 
(L-BFGS)](http://en.wikipedia.org/wiki/Limited-memory_BFGS) is an optimization 
+algorithm in the family of quasi-Newton methods to solve the optimization 
problems of the form 
+`$\min_{\wv \in\R^d} \; f(\wv)$`. The L-BFGS approximates the objective 
function locally as a quadratic
+without evaluating the second partial derivatives of the objective 
function to construct the 
+Hessian matrix. The Hessian matrix is approximated by previous gradient 
evaluations, so there is no 
+vertical scalability issue (the number of training features) when 
computing the Hessian matrix 
+explicitly in Newton method. As a result, L-BFGS often achieves rapider 
convergence compared with 
+other first-order optimization. 
 
+Since the Hessian is constructed approximately from previous gradient 
evaluations, the objective 
+function can not be changed during the optimization process. As a result, 
Stochastic L-BFGS will 
+not work naively by just using miniBatch; therefore, we don't provide this 
until we have better 
+understanding.  
--- End diff --

Do we have `Developer` section for this type of stuff?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: L-BFGS Documentation

2014-05-15 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/702

L-BFGS Documentation

Documentation for L-BFGS, and an example of training binary L2 logistic 
regression using L-BFGS.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-lbfgs-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/702.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #702


commit bbd9db595aa845ea82f79a0b888e7479f5b4b0af
Author: DB Tsai 
Date:   2014-05-08T01:42:18Z

L-BFGS Documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: L-BFGS Documentation

2014-05-16 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12499273
  
--- Diff: docs/mllib-optimization.md ---
@@ -128,10 +128,24 @@ is sampled, i.e. `$|S|=$ miniBatchFraction $\cdot n = 
1$`, then the algorithm is
 standard SGD. In that case, the step direction depends from the uniformly 
random sampling of the
 point.
 
+### Limited-memory BFGS
+[Limited-memory BFGS 
(L-BFGS)](http://en.wikipedia.org/wiki/Limited-memory_BFGS) is an optimization 
+algorithm in the family of quasi-Newton methods to solve the optimization 
problems of the form 
+`$\min_{\wv \in\R^d} \; f(\wv)$`. The L-BFGS approximates the objective 
function locally as a quadratic
+without evaluating the second partial derivatives of the objective 
function to construct the 
+Hessian matrix. The Hessian matrix is approximated by previous gradient 
evaluations, so there is no 
+vertical scalability issue (the number of training features) when 
computing the Hessian matrix 
+explicitly in Newton method. As a result, L-BFGS often achieves rapider 
convergence compared with 
+other first-order optimization. 
 
+Since the Hessian is constructed approximately from previous gradient 
evaluations, the objective 
+function can not be changed during the optimization process. As a result, 
Stochastic L-BFGS will 
+not work naively by just using miniBatch; therefore, we don't provide this 
until we have better 
+understanding.  
--- End diff --

I decided to move those message to the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870][branch-0.9] Jars added by sc.addJ...

2014-05-19 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/834

[SPARK-1870][branch-0.9] Jars added by sc.addJar are not in the default 
classLoader in executor for YARN

The summary is copied from Sandy's comment in the mailing list.

The relevant difference between YARN and standalone is that, on YARN, the 
app jar is loaded by the system classloader instead of Spark's custom URL
classloader.

On YARN, the system classloader knows about [the classes in the spark jars,
the classes in the primary app jar].   The custom classloader knows about
[the classes in secondary app jars] and has the system classloader as its
parent.

A few relevant facts (mostly redundant with what Sean pointed out):
* Every class has a classloader that loaded it.
* When an object of class B is instantiated inside of class A, the
classloader used for loading B is the classloader that was used for loading 
A.
* When a classloader fails to load a class, it lets its parent classloader
try.  If its parent succeeds, its parent becomes the "classloader that
loaded it".

So suppose class B is in a secondary app jar and class A is in the primary
app jar:
1. The custom classloader will try to load class A.
2. It will fail, because it only knows about the secondary jars.
3. It will delegate to its parent, the system classloader.
4. The system classloader will succeed, because it knows about the primary
app jar.
5. A's classloader will be the system classloader.
6. A tries to instantiate an instance of class B.
7. B will be loaded with A's classloader, which is the system classloader.
8. Loading B will fail, because A's classloader, which is the system
classloader, doesn't know about the secondary app jars.

In Spark standalone, A and B are both loaded by the custom classloader, so
this issue doesn't come up.

In this PR, we don't use customClassLoader anymore. We add URL to the 
current classloader instead. Since AddURL is protected method in 
URLClassLoader,
calling the protected method is achieved through reflection.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark branch-0.9-dbtsai-classloader

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/834.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #834


commit 474ef2c936b8f659521a519c103bc7fdb116353b
Author: DB Tsai 
Date:   2014-05-20T04:34:58Z

Fixed the classLoader issue in 0.9 branch.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/848#discussion_r12921552
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -479,37 +485,24 @@ object ClientBase {
 
 extraClassPath.foreach(addClasspathEntry)
 
-addClasspathEntry(Environment.PWD.$())
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse("").split(",")
 // Normally the users app.jar is last in case conflicts with spark jars
 if (sparkConf.get("spark.yarn.user.classpath.first", 
"false").toBoolean) {
--- End diff --

What's difference between `spark.yarn.user.classpath.first` and 
`spark.files.userClassPathFirst `? For me, it seems to be the same thing with 
two different configuration. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/848#discussion_r12921709
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -479,37 +485,24 @@ object ClientBase {
 
 extraClassPath.foreach(addClasspathEntry)
 
-addClasspathEntry(Environment.PWD.$())
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse("").split(",")
 // Normally the users app.jar is last in case conflicts with spark jars
 if (sparkConf.get("spark.yarn.user.classpath.first", 
"false").toBoolean) {
--- End diff --

PS, in line 47,   * 1. In standalone mode, it will launch an 
[[org.apache.spark.deploy.yarn.ApplicationMaster]]
should it be cluster mode now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43812877
  
Thanks. It looks great for me, and better than my patch.

cachedSecondaryJarLinks.foreach(addPwdClasspathEntry) is not needed since 
we have 
addPwdClasspathEntry("*"). But later, we may change the priority of the 
jars since we explicitly add them.

This patch also works for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43814642
  
It works under driver before, so the major issue is those files are not in 
executor's distributed cache. But I like the idea to add them explicitly so 
we'll not miss anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/955

[SPARK-1969][MLlib] Public available online summarizer for mean, variance, 
min, and max

It basically moved the private ColumnStatisticsAggregator class from 
RowMatrix to public available DeveloperApi.

Changes:
1) Moved the trait from 
org.apache.spark.mllib.stat.MultivariateStatisticalSummary to 
org.apache.spark.mllib.stats.Summarizer 
2) Moved the private implementation from org.apache.spark.mllib.linalg. 
ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
3) When creating OnlineSummarizer object, the number of columns is not 
needed in the constructor. It's determined when users add the first sample.
4) Added the API documentation for OnlineSummarizer
5) Added the unittest for OnlineSummarizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-summarizer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/955.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #955


commit 6d0e596a71b44c21b86ba3407d6dc62b0b684198
Author: DB Tsai 
Date:   2014-06-03T03:01:16Z

First version.

commit 1bd8e0c7ded84049371b29bc47c666957f07d091
Author: DB Tsai 
Date:   2014-06-03T20:53:50Z

Some cleanup.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45023171
  
Since the "Statistical" in MultivariateStatisticalSummary is already in the 
package name as "stat", I think it worths to have a concise name. Also, most 
people spell the abbreviation of statistics as "stats", so I changed it from 
"stat" to "stats".

Since it's already a public API, I've no problem to change it back.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870][branch-0.9] Jars added by sc.addJ...

2014-06-03 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/834


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45026777
  
Don't know why jenkins is not happy with removing "private class 
ColumnStatisticsAggregator(private val n: Int)". After all, it's a private 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed a typo

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/959

Fixed a typo

in RowMatrix.scala

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #959


commit fab0e0e77ff63a67868d7f3d8f5434b113ee48fd
Author: DB Tsai 
Date:   2014-06-03T23:14:18Z

Fixed typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-04 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45124672
  
@mengxr Get you. It's false-positive error. Do you have any comment or 
feedback moving it out as public api? I'm building a feature scaling api in 
MlUtils which depends on this. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-45277558
  
This looks good to me. 

However, we still have more of System.exit in different deployment code; we 
probably want to review and fix them. This can be a good step!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/987

[SPARK-1177] Allow SPARK_JAR to be set programmatically in system properties



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark 
dbtsai-yarn-spark-jar-from-java-property

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/987.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #987


commit 196df1c9fa0c423a30f3b118bf1dd58480cb2fee
Author: DB Tsai 
Date:   2014-05-27T23:07:27Z

Allow users to programmatically set the spark jar.

commit bdff88ac46bff5aea63e23c24d5d5f00a4e83023
Author: DB Tsai 
Date:   2014-06-05T22:43:09Z

Doc update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45286460
  
@chesterxgchen 

#560 Agree, it's a more throughout way to handle this issue. In the code 
you have, it seems that the spark jar setting is moved to conf: SparkConf in 
favor of the CONF_SPARK_JAR. But it will make users difficult to set it up 
since the Client.scala also has to be changed. Simple question, with your 
change, how users can submit job with their own spark jar by passing the 
CONF_SPARK_JAR correctly?

def sparkJar(conf: SparkConf) = {
   if (conf.contains(CONF_SPARK_JAR)) {
 conf.get(CONF_SPARK_JAR)
   } else if (System.getenv(ENV_SPARK_JAR) != null) {
 logWarning(
  s"$ENV_SPARK_JAR detected in the system environment. This 
variable has been deprecated " 
   s"in favor of the $CONF_SPARK_JAR configuration variable.")
 System.getenv(ENV_SPARK_JAR)
   } else {
 SparkContext.jarOfClass(this.getClass).head
   }
 }




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45292804
  
The app's code will only run in the application master in yarn-cluster 
mode, how can yarn client know which jar will be submitted to distributed cache 
if we set it in the app's spark conf?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45296471
  
We lunched Spark job inside our tomcat, and we directly use Client.scala 
API. With my patch, I can setup the spark jar using System.setProperty() before 

  val sparkConf = new SparkConf
  val args = getArgsFromConf(conf)
  new Client(new ClientArguments(args, sparkConf), hadoopConfig, 
sparkConf).run

Do you mean that with your work, I can setup the jar location in the 
sparkConf which will be passed into the new Client?

Can we have the following in sparkJar method

def sparkJar(conf: SparkConf) = {
   if (conf.contains(CONF_SPARK_JAR)) {
 conf.get(CONF_SPARK_JAR)
   } else if (System.getProperty(ENV_SPARK_JAR) != null) {
 logWarning(
  s"$ENV_SPARK_JAR detected in the system property. This variable 
has been deprecated " 
   s"in favor of the $CONF_SPARK_JAR configuration variable.")
 System.getProperty(ENV_SPARK_JAR)
   } else if (System.getenv(ENV_SPARK_JAR) != null) {
 logWarning(
  s"$ENV_SPARK_JAR detected in the system environment. This 
variable has been deprecated " 
   s"in favor of the $CONF_SPARK_JAR configuration variable.")
 System.getenv(ENV_SPARK_JAR)
   } else {
 SparkContext.jarOfClass(this.getClass).head
   }
 }




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45297396
  
k... better to have Mima exclude the private class automatically, or we can 
have annotation for the private class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-06 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-45363846
  
Got you. Looking forward to having your patch merged. Thanks.

Sent from my Google Nexus 5
On Jun 6, 2014 9:35 AM, "Marcelo Vanzin"  wrote:

> I mean you can set system properties the same way. SparkConf initializes
> its configuration from system properties, so my patch covers not only your
> case, but also others (like using a spark-defaults.conf file for
> spark-submit users).
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/987#issuecomment-45357297>.
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1013

[SPARK-1870] Ported from 1.0 branch to 0.9 branch. 

Made deployment with --jars work in yarn-standalone mode. Sent secondary 
jars to distributed cache of all containers and add the cached jars to 
classpath before executors start.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark branch-0.9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1013.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1013


commit 0956af95e24bc37303525fde6f85e0b3aeebd946
Author: DB Tsai 
Date:   2014-06-08T23:16:53Z

Ported SPARK-1870 from 1.0 branch to 0.9 branch




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45451719
  
CC: @mengxr and @sryza


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45459920
  
Work in my local VM. Should work in real yarn cluster. Will test it 
tomorrow in the office.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-09 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45551414
  
Tested in PivotalHD 1.1 Yarn 4 node cluster. With --addjars 
file:///somePath/to/jar, launching spark application works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1013#discussion_r13573407
  
--- Diff: 
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -281,18 +280,19 @@ class Client(args: ClientArguments, conf: 
Configuration, sparkConf: SparkConf)
 }
 
 // Handle jars local to the ApplicationMaster.
+var cachedSecondaryJarLinks = ListBuffer.empty[String]
--- End diff --

thks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1013#discussion_r13573544
  
--- Diff: 
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -507,12 +508,19 @@ object Client {
   Apps.addToEnvironment(env, Environment.CLASSPATH.name, 
Environment.PWD.$() +
 Path.SEPARATOR + LOG4J_PROP)
 }
+
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse("").split(",")
--- End diff --

Thanks. You are right. It will add empty string to array, and then add the 
folder without file into classpath. Will fix in master as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Make sure that empty string is filtered out wh...

2014-06-09 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1027

Make sure that empty string is filtered out when we get the secondary jars 
from conf



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-classloader

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1027


commit c9c7ad7fc6a2cf03503fe7b19ea1da92247196c6
Author: DB Tsai 
Date:   2014-06-10T01:29:04Z

Make sure that empty string is filtered out when we get the secondary jars 
from conf.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624385
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory > maxMem) {
-  logError("Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster.".
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+"Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.".
+format(args.executorMemory, maxMem)
+  logError(errorMessage)
+  throw new IllegalArgumentException(errorMessage)
 }
 val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD
 if (amMem > maxMem) {
-  logError("Required AM memory (%d) is above the max threshold (%d) of 
this cluster".
-format(args.amMemory, maxMem))
-  System.exit(1)
+  val errorMessage ="Required AM memory (%d) is above the max 
threshold (%d) of this cluster".
--- End diff --

Please add a space after =


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624580
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory > maxMem) {
-  logError("Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster.".
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+"Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.".
+format(args.executorMemory, maxMem)
--- End diff --

Move the . to the new line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624615
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory > maxMem) {
-  logError("Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster.".
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+"Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.".
+format(args.executorMemory, maxMem)
+  logError(errorMessage)
+  throw new IllegalArgumentException(errorMessage)
 }
 val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD
 if (amMem > maxMem) {
-  logError("Required AM memory (%d) is above the max threshold (%d) of 
this cluster".
-format(args.amMemory, maxMem))
-  System.exit(1)
+  val errorMessage ="Required AM memory (%d) is above the max 
threshold (%d) of this cluster".
--- End diff --

move the . to the newline 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-11 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/1013


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-11 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-45835283
  
@mengxr Do you think it's in good shape now? This is the only issue 
blocking us using vanilla spark. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13897737
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -38,10 +38,10 @@ import org.apache.spark.mllib.linalg.{Vectors, Vector}
 class LBFGS(private var gradient: Gradient, private var updater: Updater)
   extends Optimizer with Logging {
 
-  private var numCorrections = 10
-  private var convergenceTol = 1E-4
-  private var maxNumIterations = 100
-  private var regParam = 0.0
+  private var numCorrections: Int = 10
+  private var convergenceTol: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
 
--- End diff --

In most of the mllib codebase, we don't specify the type of variable. Can 
you remove them? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13897825
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with 
LocalSparkContext with Matchers {
 assert(lossLBFGS3.length == 6)
 assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) < 
convergenceTol)
   }
+
--- End diff --

The bug isn't found because we only test the static runLBFGS method instead 
of the class. We probably can change all the existing tests to use the one in 
class, so we don't need to add another test.  

@mengxr what do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


<    5   6   7   8   9   10   11   12   13   14   >