[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2015-01-07 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-69123928
  
@dbtsai BTW., have you thought about batch processing of input vectors, 
i.e. stack N vectors into matrix and perform optimization with this matrix 
instead of vector? With native BLAS enabled this might improve the performance. 
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2015-01-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-69125554
  
@avulanov  I've thought about that. However, @mengxr  told me that they 
have a intern trying to do this type of experiment last year, and they don't 
see significant performance gain. I'm thinking to implement the whole gradient 
function using native code/SIMD by batching the input vectors as matrix. Since 
for MLOR, the computation of objective function is very expensive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2015-01-26 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-71531266
  
@dbtsai I did batching for artificial neural networks and the performance 
improved ~5x https://github.com/apache/spark/pull/1290#issuecomment-70313952


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-11 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1379

[SPARK-2309][MLlib] Generalize the binary logistic regression into 
multinomial logistic regression

Currently, there is no multi-class classifier in mllib. Logistic regression 
can be extended to multinomial classifier straightforwardly.
The following formula will be implemented. 
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25

Note: When multi-classes mode, there will be multiple intercepts, so we 
don't use the single intercept in `GeneralizedLinearModel`, and have all the 
intercepts into weights. It makes some inconsistency. For example, in the 
binary mode, the intercept can not be specified by users, but since in the 
multinomial mode, the intercepts are combined into weights, users can specify 
them. 

@mengxr Should we just deprecate the intercept, and have everything in 
weights? It makes sense in term of optimization point of view, and also make 
the interface cleaner. Thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-mlor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1379


commit 82dae74135bafa5d1adeef4b2b421693c05b2778
Author: DB Tsai 
Date:   2014-06-27T21:47:15Z

Multinomial Logistic Regression




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-48796056
  
QA results for PR 1379:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):* as used in multi-class classification (it is also used in 
binary logistic regression).For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16579/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-48796052
  
QA tests have started for PR 1379. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16579/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-49681981
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-49682150
  
I think it fails due to the apache license is not in the test file. As you 
suggest, I'll move it to be generated in the runtime. Would like to know the 
general feedback. I'll make the test pass tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-49682455
  
QA results for PR 1379:- This patch FAILED unit tests.For more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16937/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-49682447
  
QA tests have started for PR 1379. This patch DID NOT merge cleanly! 
View progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16937/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-49682997
  
It is easier to review if it passes the tests. @SparkQA shows new public 
classes and interface changes. Could you remove the data file and generate some 
synthetic data for unit tests? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1379


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50982699
  
@mengxr  Is there any problem with asfgit? This is not finished yet, why 
asfgit said it's merged into apache:master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50983381
  
... I have no idea. Let me check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50983421
  
@pwendell I didn't see `Closes #1379` in the merged commit. Is something 
wrong with asfgit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-10-28 Thread BigCrunsh
Github user BigCrunsh commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-60792386
  
What is the current state of the PR? Can't see any changes in the code...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-10-28 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-60813678
  
@BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-18 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-63577201
  
@dbtsai Hi! What is the current state of PR? I would like to download and 
test. Could you suggest where are the sources?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-19 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-63748972
  
Apparently, I've found this implementation 
https://github.com/dbtsai/spark/tree/dbtsai-mlor. It did work on my examples 
producing reasonable results. Could you comment on the following? Why the 
number of parameters (weights) is equal to (num_features + 1)*(num_classes-1) ? 
I would expect (num_features + 1)*(num_classes) as it is here for example: 
http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-20 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-63904113
  
@avulanov I will merge this on Spark 1.3, and sorry for delay since I was 
very busy recently. Yes, the branch you found should work, but it can not be 
cleanly merged in upstream, and I'm working on it. You can try that branch for 
now. Also, in the branch, we don't use LBFGS as optimizer, so the convergent 
rate will be slow.

Basically, you can model the whole problem using (num_features + 
1)(num_classes), but the solution will not be unique. You can chose one of the 
class as base class to make the solution unique, and I chose the first class as 
base class. See `Properties of softmax regression parameterization` in the wiki 
page you refer. Or my presentation 
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 for more technical 
detail.  You can think about binary logistic regression, and you only have  
(num_features + 1) coefficients instead of 2 * (num_features + 1) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-20 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-63906173
  
@dbtsai Thanks for explanation! Do I understand correct, that if I want to 
get (num_features+1)*(num_classes) parameters from your model, I need to 
concatenate a vector of length (num_features+1) with zeros at the beginning of 
the vector that your model returns with `model.weights`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-11-20 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-63906768
  
no, in the algorithm, I already model the problem 
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/24 , so there will 
always be only (num_features + 1)(num_classes-1) parameters. Of course, you can 
chose any transformation to make it over-parameterize, see `Properties of 
softmax regression parameterization` session in wiki for detail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-02 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-65339343
  
@dbtsai I've tried your implementation with `LBFGS` optimizer and it seems 
to have similar performance in terms of running time and accuracy to SGD that 
you have right now. Do you think it worth testing it against our implementation 
of artificial neural network with no hidden layer 
https://github.com/apache/spark/pull/1290? It uses a different cost function 
but it still might be interesting to compare.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-02 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-65340600
  
@avulanov Sure, it's interesting to see the comparison. Let me know the 
result once you have it. I'm going to make it merge in 1.3, so will be easier 
to use it in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-05 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-65879536
  
@dbtsai Here are the results of my tests:
- Settings:
 - Spark: latest Spark merged with 
https://github.com/dbtsai/spark/tree/dbtsai-mlor (manual merge) and 
https://github.com/avulanov/spark/tree/annclassifier. Optimizer in MLOR was 
changed to LBFGS to make a correct comparison with ANN which uses LBFGS.
 - Hadoop 1.2.1, dataset is loaded from hdfs
 - Cluster: 6 machines Xeon 3.3GHz, 16GB RAM, each machine has 2 Spark 
Workers with maximum 8GB or RAM and 2GB used, total 16 workers
 - Dataset: mnist8m; classes: 10;  data: 8,100,000 instances; features: 
784; random split 99% train, 1% test
 - Link to the dataset: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2
 - Learning settings: 40 iterations, tolerance=1e-4 (both); ANN 
classifier: hidden layer `Array[Int]()` (no hidden layer - the same as 
regression)
- Result
 - ANN classifier: training time: 00:47:55; accuracy: 0.848
 - MLOR: training time: 01:30:45; accuracy: 0.864
- Average gradient compute time (`mapPartitionsWithIndex at 
RDDFunctions.scala:108`)
 - ANN classifier: 51 seconds
 - MLOR: 2.1 minutes
- Average update time (`reduce at RDDFunctions.scala:112`)
 - ANN classifier: 90 ms
 - MLOR: 90 ms

It seems that ANN is almost 2x faster (with the mentioned settings), though 
accuracy is 1.6% smaller. The difference in accuracy can be explained by the 
fact that ANN uses (half) squared error cost function instead of cross entropy 
and no softmax. They are supposed to be better for classification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66192930
  
@avulanov I did couple performance turning in the MLOR gradient calculation 
in my company's proprietary implementation which results 4x faster than the 
open source one in github you tested. I'm trying to make it open source and 
merge into spark soon. (ps, simple polynomial expansion with MLOR can increase 
the mnist8m accuracy from 86% to 94% in my experiment. See Prof. CJ Lin's talk 
- https://www.youtube.com/watch?v=GCIJP0cLSmU ) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66208868
  
@avulanov  Nice tests!  A few comments:
* Computing accuracy: It would be good to test on the original MNIST test 
set, rather than a subset of the training set.  The training set includes a 
bunch of duplicates of images with slight modifications, so results on it might 
be misleading.
* The timing tests look pretty convincing for ANN!  Can you please confirm 
whether both algorithms did all 40 iterations?  Or did they sometimes stop 
early b/c of the convergence tolerance?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66318526
  
@dbtsai 1) Could you elaborate on what kind of optimizations did you do? 
Probably, they could be applied to the broader MLlib, which is beneficial. 2) 
Do you know the reason why our ANN implementation worked faster than the MLOR 
you shared? This could also be interesting in terms of MLlib optimization. 3) 
Did you mean fitting a n-th degree polynom instead of a linear function? Thanks 
for the link, it seems very interesting! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66320878
  
@jkbradley Thank you! They took some time. 
   - I totally agree with you, I need to perform tests on the original test 
set. It contains less attributes, i.e. 778 vs 784 in mnist8m, so one needs to 
add zeros to it to make it compatible. 
   - They both did all 40 iterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66336110
  
@avulanov 

1. I did the same optimization for MLlib in [my recently 
PRs](https://github.com/apache/spark/commits/master?author=dbtsai).

* Accessing the values in dense/sparse vector directly is very slow without 
having a local reference of primitive array due to the dereference. See #3577 
and #3435. There is bytecode analysis for this issue in #3435
* Breeze's foreachActive is very slow, so I implemented a 4x faster version 
in #3288 My experience is that if Breeze is used in critical code path, it has 
to be cautious.  

2. I don't check out your ANN implementation yet, but I will check today. 
I'll send you our optimized Gradient Computation code for MLOR. Will be 
interesting to see the new benchmark compared with the one you tested.

3. See page 27 at Prof. CJ Lin's slide. 
http://www.csie.ntu.edu.tw/~cjlin/talks/SFmeetup.pdf It's just doing the 
feature expansion by mapping the data into higher dimension space. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-10 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66490270
  
@dbtsai Thank you, I look forward for your code to perform benchmarks. 
Thanks again for the video! I've enjoy ed it, especially Q&A after the talk. At 
51:23 Prof CJ Lin mentiones that "we released dataset of about 600 Gigabytes". 
Do you know where I can download it? It should be quite a challenging workload 
for classification in Spark!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-10 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66513731
  
@avulanov I remembered CJ Lin said he posted the 600GB dataset on his 
website. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-16 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67257821
  
@dbtsai Hi! Did you have a chance to check our implementation and send me 
the optimized one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67694284
  
@avulanov I don't check your implementation yet, but I'm ready to have the 
optimized MLOR for you to test. Can you try the `LogisticGradient` in 
https://github.com/AlpineNow/spark/commits/mlor

```scala
@DeveloperApi
class LogisticGradient extends Gradient {
  override def compute(data: Vector, label: Double, weights: Vector): 
(Vector, Double) = {
val gradient = Vectors.zeros(weights.size)
val loss = compute(data, label, weights, gradient)
(gradient, loss)
  }

  override def compute(
  data: Vector,
  label: Double,
  weights: Vector,
  cumGradient: Vector): Double = {
assert((weights.size % data.size) == 0)
val dataSize = data.size
// (n + 1) is number of classes
val n = (weights.size / dataSize)
val numerators = Array.ofDim[Double](n)

var denominator = 0.0
var margin = 0.0

val weightsArray = weights match {
  case dv: DenseVector => dv.values
  case _ =>
throw new IllegalArgumentException(
  s"weights only supports dense vector but got type 
${weights.getClass}.")
}
val cumGradientArray = cumGradient match {
  case dv: DenseVector => dv.values
  case _ =>
throw new IllegalArgumentException(
  s"cumGradient only supports dense vector but got type 
${cumGradient.getClass}.")
}

var i = 0
while (i < n) {
  var sum = 0.0
  data.foreachActive { (index, value) =>
if (value != 0.0) sum += value * weightsArray((i * dataSize) + 
index)
  }
  if (i == label.toInt - 1) margin = sum
  numerators(i) = math.exp(sum)
  denominator += numerators(i)
  i += 1
}

i = 0
while (i < n) {
  val multiplier = numerators(i) / (denominator + 1.0) - {
if (label != 0.0 && label == i + 1) 1.0 else 0.0
  }
  data.foreachActive { (index, value) =>
if (value != 0.0) cumGradientArray(i * dataSize + index) += 
multiplier * value
  }
  i += 1
}

if (label > 0.0) {
  math.log1p(denominator) - margin
} else {
  math.log1p(denominator)
}
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67716565
  
@avulanov PS, you can just replace the gradient function without doing any 
change. Let me know how much performance gain you see, and I'm very interested 
in this. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67718021
  
@dbtsai Thank you! Should I use the latest Spark with this Gradient?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67718128
  
Yes, `foreachActive` is the new API in Spark 1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67719973
  
@dbtsai `GeneralizedLinearAlgorithm` throws exception 
`org.apache.spark.SparkException: Input validation failed.`. Moreover, there is 
no regression with LBFGS. Probably I need to use some other your files, like I 
did it before. Should I clone https://github.com/AlpineNow/spark/commits/mlor 
and merge it with latest Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67720689
  
@avulanov The new branch is not finished yet. You need to rebase 
https://github.com/dbtsai/spark/tree/dbtsai-mlor to master, and just replace 
the gradient function. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-22 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-67872100
  
@dbtsai I did local experiment on mnist and your new implementation seems 
to be more than 2x faster than the previous one! I am going to perform bigger 
experiments. In the meantime, could you suggest if optimizations that you did 
are applicable for ANN Gradient? It will be extremely helpful for us. 
https://github.com/bgreeven/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala#L467


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-23 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-68002991
  
New results of experiments with optimized ANN and MLOR are below. I used 
the same cluster of 6 machines with 12 workers total, mnist8m dataset as train 
and the standard mnist test converted to 784 attributes.
  - Results
   - ANN classifier: training time: 00:16:58 (was 00:47:55); accuracy: 
0.9021
   - MLOR: training time: 00:09:46 (was 01:30:45); accuracy: 0.9084
  - Average step time (reduce at RDDFunctions.scala:112):
   - ANN classifier: 23 seconds (was 51 s)
   - MLOR: 14 seconds (was 2.1 mins)

The ANN became ~3x and MLOR ~10x faster (!) than before. The current MLOR 
is ~60% faster than current ANN. I assume that there are the following 
overheads in ANN: 1) it uses back-propagation, so there are two matrix vector 
multiplications on forward and backward passes 2) it does rolling the 
parameters stored in matrices to the vector form. I will be happy to know how 
these overheads can be reduced. We can't compare with previously obtained 
accuracy because I used different test sets. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-23 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-68029618
  
@avulanov It's very encouraging benchmark result you saw in real world 
cluster setup. Since I'm on vacation recently, I don't actually deploy the new 
code and benchmark in our cluster. Great to see such huge 10x performance gain 
(actually bigger than what I thought, and in my local single machine testing, I 
only saw 2~4x difference.)

What optimization do you do on your ANN implementation? The same things in 
MLOR?

@mengxr Is it possible to reopne this closed PR in github? There are lots 
of useful discussion here, so I don't want to open another PR in github. I 
think I'm mostly done except the unit-test, and I can push the code for code 
review now before our meeting. (PS, the now code is more generalized than 
binary one, and has the same performance in the binary special case in my local 
testing.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2015-01-05 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-68741897
  
@dbtsai 
Just back from vacation too:) 

I used my old implementation of the matrix form of back propagation and 
made sure that it properly uses stride of matrices in breeze. Also, I optimized 
roll of parameters into vector combined with in-place update of cumulative sum. 
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org