[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-22 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2068#issuecomment-53138329 @atalwalkar and @mengxr I just addressed the merge conflict. I think it's ready to merge. Thanks. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
GitHub user dbtsai reopened a pull request: https://github.com/apache/spark/pull/2207 [SPARK-3317][MLlib] The loss of regularization in Updater should use the oldWeights The current loss of the regularization is computed from the newWeights which is not correct. The loss, R(w) = 1

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2207#issuecomment-53933078 LBFGS needs correct loss to find next weights while SGD doesn't. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-29 Thread dbtsai
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/2207 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2207#issuecomment-54002680 @srowen @mengxr I was working on OWLQN for L1 in my company, and I didn't follow the LBFGS code so I was confused. The current code in MLlib actually gives

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/2207 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2207#issuecomment-54002773 PS, it seems that I can not close https://issues.apache.org/jira/browse/SPARK-3317 myself. Can any of you close for me? Thanks. --- If your project is set up

[GitHub] spark pull request: [SPARK-3317][MLlib] The loss of regularization...

2014-08-31 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2207#issuecomment-54002970 You are right. Using my desktop without login session. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3220#discussion_r20206271 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala --- @@ -50,6 +50,29 @@ class MultivariateOnlineSummarizer

[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3220#discussion_r20207949 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala --- @@ -124,37 +128,28 @@ class MultivariateOnlineSummarizer

[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3220#discussion_r20208266 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala --- @@ -50,6 +50,29 @@ class MultivariateOnlineSummarizer

[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3220#issuecomment-62689770 LGTM. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-4355][MLLIB] fix OnlineSummarizer.merge...

2014-11-12 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3220#issuecomment-62694226 Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-4348] [PySpark] [MLlib] rename random.p...

2014-11-13 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3216#issuecomment-62856261 It works for me as well. ᚛ |activeIterator *|$ ./bin/pyspark Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-15 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3288 [SPARK-4431][MLlib] Implement efficient activeIterator for dense and sparse vector Previously, we were using Breeze's activeIterator to access the non-zero elements in sparse vector

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20532934 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -76,6 +76,22 @@ sealed trait Vector extends Serializable { def copy

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20533260 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -273,6 +289,47 @@ class DenseVector(val values: Array[Double]) extends

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20544650 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -273,6 +289,47 @@ class DenseVector(val values: Array[Double]) extends

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3288#issuecomment-63566328 (PS, when I did the bytecode analysis, I found that accessing the member variables of values and values.size require two operation. By having a local copy

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20553260 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -76,6 +76,22 @@ sealed trait Vector extends Serializable { def copy

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20554090 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -76,6 +76,22 @@ sealed trait Vector extends Serializable { def copy

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20615000 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -76,6 +76,22 @@ sealed trait Vector extends Serializable { def copy

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20687461 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala --- @@ -95,22 +93,7 @@ class MultivariateOnlineSummarizer

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient active...

2014-11-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3288#discussion_r20688070 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala --- @@ -173,4 +173,63 @@ class VectorsSuite extends FunSuite { val v

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-20 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-63904113 @avulanov I will merge this on Spark 1.3, and sorry for delay since I was very busy recently. Yes, the branch you found should work, but it can not be cleanly merged

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-20 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-63906768 no, in the algorithm, I already model the problem http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/24 , so there will always be only (num_features + 1

[GitHub] spark pull request: [SPARK-4431][MLlib] Implement efficient foreac...

2014-11-23 Thread dbtsai
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/3288 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: Minor change in the comment of spark-defaults....

2014-11-24 Thread dbtsai
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/2709 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-4581][MLlib] Refactorize StandardScaler...

2014-11-24 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3435 [SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance The following optimizations are done to improve the StandardScaler model transformation performance

[GitHub] spark pull request: [SPARK-4581][MLlib] Refactorize StandardScaler...

2014-11-24 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3435#issuecomment-64304769 @mengxr Without the local reference copy of `factor` and `shift` arrays, the runtime is almost three time slower. DenseVector withMean and withStd

[GitHub] spark pull request: [SPARK-4581][MLlib] Refactorize StandardScaler...

2014-11-24 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3435#issuecomment-64304881 PS, we may want to go though the mllib codebase, and find things like this. This issue impacts the performance quite a lot. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-4581][MLlib] Refactorize StandardScaler...

2014-11-24 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3435#issuecomment-64308394 Wow, with ```scala private[this] val factor: Array[Double] = { val f = Array.ofDim[Double](variance.size) var i = 0 while (i f.size

[GitHub] spark pull request: [SPARK-4596][MLLib] Refactorize Normalizer to ...

2014-11-24 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3446 [SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner In this refactoring, the performance will be slightly increased due to removing the overhead from breeze vector. The bottleneck

[GitHub] spark pull request: [SPARK-4581][MLlib] Refactorize StandardScaler...

2014-11-24 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3435#discussion_r20847415 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -97,30 +97,57 @@ class StandardScalerModel private[mllib

[GitHub] spark pull request: [SPARK-4596][MLLib] Refactorize Normalizer to ...

2014-11-25 Thread dbtsai
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/3446 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-4581][MLlib] Refactorize StandardScaler...

2014-11-25 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3435#discussion_r20885451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -97,30 +97,57 @@ class StandardScalerModel private[mllib

[GitHub] spark pull request: Implement the efficient vector norm

2014-11-25 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3462 Implement the efficient vector norm The vector norm in breeze is implemented by `activeIterator` which is known to be very slow. In this PR, an efficient vector norm is implemented

[GitHub] spark pull request: Implement the efficient vector norm

2014-11-25 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3462#issuecomment-64505454 Using `foreachActive` instead of `while loop` DenseVector: 12.95secs SparseVector: 2.89secs ```scala private[spark] def norm(p: Double): Double

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-25 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20919934 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -85,6 +85,52 @@ sealed trait Vector extends Serializable

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20921838 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20921892 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20921916 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20961188 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20967444 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20968353 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r20970806 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-28 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/582 [SPARK-1157][MLlib] Bug fix: lossHistory should be monotonically decresing Instead of recording the loss in the costFun for each time that optimizer calls costFun, we get the loss from the api

[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41740842 @mengxr Just did some hack on trying to implement the right stochastic L-BFGS, and it kind of works as long as we don't change the objective function

[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41751464 Make sense from the inverse of hessian point of view. Just remove it! --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-05-04 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/458#issuecomment-42160096 lbfgs is not good for L1 problem. I'm working on and preparing to do benchmark with bfgs variant OWL-QN for L1 which is ideal to be compared with ADMM. --- If your

[GitHub] spark pull request: MLlib documentation fix

2014-05-10 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/703 MLlib documentation fix Fixed the documentation for that `loadLibSVMData` is changed to `loadLibSVMFile`. You can merge this pull request into a Git repository by running: $ git pull https

[GitHub] spark pull request: L-BFGS Documentation

2014-05-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12502968 --- Diff: docs/mllib-optimization.md --- @@ -163,3 +171,108 @@ each iteration, to compute the gradient direction. Available algorithms for gradient descent

[GitHub] spark pull request: L-BFGS Documentation

2014-05-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12499609 --- Diff: docs/mllib-optimization.md --- @@ -163,3 +177,100 @@ each iteration, to compute the gradient direction. Available algorithms for gradient descent

[GitHub] spark pull request: L-BFGS Documentation

2014-05-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12499183 --- Diff: docs/mllib-optimization.md --- @@ -128,10 +128,24 @@ is sampled, i.e. `$|S|=$ miniBatchFraction $\cdot n = 1$`, then the algorithm is standard

[GitHub] spark pull request: L-BFGS Documentation

2014-05-15 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/702 L-BFGS Documentation Documentation for L-BFGS, and an example of training binary L2 logistic regression using L-BFGS. You can merge this pull request into a Git repository by running: $ git

[GitHub] spark pull request: L-BFGS Documentation

2014-05-16 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12499273 --- Diff: docs/mllib-optimization.md --- @@ -128,10 +128,24 @@ is sampled, i.e. `$|S|=$ miniBatchFraction $\cdot n = 1$`, then the algorithm is standard

[GitHub] spark pull request: [SPARK-1870][branch-0.9] Jars added by sc.addJ...

2014-05-19 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/834 [SPARK-1870][branch-0.9] Jars added by sc.addJar are not in the default classLoader in executor for YARN The summary is copied from Sandy's comment in the mailing list. The relevant

[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/848#discussion_r12921552 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -479,37 +485,24 @@ object ClientBase

[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/848#discussion_r12921709 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -479,37 +485,24 @@ object ClientBase

[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/848#issuecomment-43812877 Thanks. It looks great for me, and better than my patch. cachedSecondaryJarLinks.foreach(addPwdClasspathEntry) is not needed since we have

[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/848#issuecomment-43814642 It works under driver before, so the major issue is those files are not in executor's distributed cache. But I like the idea to add them explicitly so we'll not miss

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/955 [SPARK-1969][MLlib] Public available online summarizer for mean, variance, min, and max It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/955#issuecomment-45023171 Since the Statistical in MultivariateStatisticalSummary is already in the package name as stat, I think it worths to have a concise name. Also, most people spell

[GitHub] spark pull request: [SPARK-1870][branch-0.9] Jars added by sc.addJ...

2014-06-03 Thread dbtsai
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/834 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/955#issuecomment-45026777 Don't know why jenkins is not happy with removing private class ColumnStatisticsAggregator(private val n: Int). After all, it's a private class. --- If your project

[GitHub] spark pull request: Fixed a typo

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/959 Fixed a typo in RowMatrix.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-typo Alternatively you can review and apply

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-04 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/955#issuecomment-45124672 @mengxr Get you. It's false-positive error. Do you have any comment or feedback moving it out as public api? I'm building a feature scaling api in MlUtils which depends

[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/987 [SPARK-1177] Allow SPARK_JAR to be set programmatically in system properties You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai

[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/987#issuecomment-45286460 @chesterxgchen #560 Agree, it's a more throughout way to handle this issue. In the code you have, it seems that the spark jar setting is moved to conf: SparkConf

[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/987#issuecomment-45292804 The app's code will only run in the application master in yarn-cluster mode, how can yarn client know which jar will be submitted to distributed cache if we set

[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/987#issuecomment-45296471 We lunched Spark job inside our tomcat, and we directly use Client.scala API. With my patch, I can setup the spark jar using System.setProperty() before val

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/955#issuecomment-45297396 k... better to have Mima exclude the private class automatically, or we can have annotation for the private class. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-06-06 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/987#issuecomment-45363846 Got you. Looking forward to having your patch merged. Thanks. Sent from my Google Nexus 5 On Jun 6, 2014 9:35 AM, Marcelo Vanzin notificati...@github.com wrote

[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1013 [SPARK-1870] Ported from 1.0 branch to 0.9 branch. Made deployment with --jars work in yarn-standalone mode. Sent secondary jars to distributed cache of all containers and add the cached jars

[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1013#issuecomment-45451719 CC: @mengxr and @sryza --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-08 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1013#issuecomment-45459920 Work in my local VM. Should work in real yarn cluster. Will test it tomorrow in the office. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-4611][MLlib] Implement the efficient ve...

2014-12-01 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3462#discussion_r21076434 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -261,6 +261,57 @@ object Vectors { sys.error(Unsupported Breeze

[GitHub] spark pull request: [SPARK-4708][MLLib] Make k-mean runs two/three...

2014-12-02 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3565 [SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance

[GitHub] spark pull request: [SPARK-4708][MLLib] Make k-mean runs two/three...

2014-12-02 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3565#issuecomment-65340272 Calling BLAS will add very small extra overhead. The benchmark will now be DenseVector: 33.19secs SparseVector: 22.05secs --- If your project is set up

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-02 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-65340600 @avulanov Sure, it's interesting to see the comparison. Let me know the result once you have it. I'm going to make it merge in 1.3, so will be easier to use

[GitHub] spark pull request: [SPARK-4717][MLlib] Optimize BLAS library to a...

2014-12-03 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3577 [SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple times in loop Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-08 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66192930 @avulanov I did couple performance turning in the MLOR gradient calculation in my company's proprietary implementation which results 4x faster than the open source one

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66336110 @avulanov 1. I did the same optimization for MLlib in [my recently PRs](https://github.com/apache/spark/commits/master?author=dbtsai). * Accessing

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-10 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66513731 @avulanov I remembered CJ Lin said he posted the 600GB dataset on his website. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-4887][MLlib] Fix a bad unittest in Logi...

2014-12-18 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3735 [SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite The original test doesn't make sense since if you step in, the lossSum is already NaN, and the coefficients are diverging

[GitHub] spark pull request: [SPARK-4887][MLlib] Fix a bad unittest in Logi...

2014-12-18 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3735#issuecomment-67562831 I agree. The test is not good. I'm thinking we probably can add couple well known dataset like iris or prostate cancer dataset into the test resource, and we can compare

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

2014-12-19 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3746 [SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R In most of the academic paper and algorithm implementations, people use L = 1/2n ||A weights-y||^2

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-67694284 @avulanov I don't check your implementation yet, but I'm ready to have the optimized MLOR for you to test. Can you try the `LogisticGradient` in https://github.com

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-67716565 @avulanov PS, you can just replace the gradient function without doing any change. Let me know how much performance gain you see, and I'm very interested in this. Thanks

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-67718128 Yes, `foreachActive` is the new API in Spark 1.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-67720689 @avulanov The new branch is not finished yet. You need to rebase https://github.com/dbtsai/spark/tree/dbtsai-mlor to master, and just replace the gradient function

[GitHub] spark pull request: [SPARK-4907][MLlib] Inconsistent loss and grad...

2014-12-22 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3746#issuecomment-67842962 @bryanyang0528 The learning rate issue here is different story. With modern optimization algorithms like LBFGS and OWLQN, the learning rate is not required

[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...

2014-12-22 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1518#discussion_r22173571 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/Regularizer.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-23 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-68029618 @avulanov It's very encouraging benchmark result you saw in real world cluster setup. Since I'm on vacation recently, I don't actually deploy the new code and benchmark

[GitHub] spark pull request: [SPARK-4972][MLlib] Updated the scala doc for ...

2014-12-26 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3808 [SPARK-4972][MLlib] Updated the scala doc for lasso and ridge regression for the change of LeastSquaresGradient In #SPARK-4907, we added factor of 2 into the LeastSquaresGradient. We updated

[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...

2014-12-29 Thread dbtsai
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/3833 [SPARK-2309][MLlib] Multinomial Logistic Regression #1379 is automatically closed by asfgit, and github can not reopen it once it's closed, so this will be the new PR. Binary Logistic

[GitHub] spark pull request: [Spark-4995] Replace Vector.toBreeze.activeIte...

2014-12-30 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3846#issuecomment-68397022 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...

2015-01-23 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4140#discussion_r23486231 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -61,20 +61,30 @@ class StandardScaler(withMean: Boolean, withStd

[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...

2015-01-23 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/4140#issuecomment-71281849 For the unit-test part, is it possible not to change too much? Also, it will be easier to debug if the assertion is in the test instead of abstract out. For example

[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...

2015-01-23 Thread dbtsai
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4140#discussion_r23485163 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -61,20 +61,30 @@ class StandardScaler(withMean: Boolean, withStd

<    1   2   3   4   5   6   7   8   9   10   >