[jira] [Commented] (SPARK-7253) Add example of belief propagation with GraphX

2016-12-13 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746841#comment-15746841
 ] 

Alexander Ulanov commented on SPARK-7253:
-

Here is the implementation of belief propagation algorithm for factor graphs 
with examples: https://github.com/HewlettPackard/sandpiper

> Add example of belief propagation with GraphX
> -
>
> Key: SPARK-7253
> URL: https://issues.apache.org/jira/browse/SPARK-7253
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Joseph K. Bradley
>
> It would nice to document (via an example) how to use GraphX to do belief 
> propagation.  It's probably too much right now to talk about a full-fledged 
> graphical model library (and that would belong in MLlib anyways), but a 
> simple example of a graphical model + BP would be nice to add to GraphX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566467#comment-15566467
 ] 

Alexander Ulanov commented on SPARK-17870:
--

[`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)
 works with "a Function taking two arrays X and y, and returning a pair of 
arrays (scores, pvalues) or a single array with scores". According to what you 
observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case 
for all functions that return two arrays: 
https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331.
 Alternative, one case use raw `chi2` scores for sorting. She need to pass only 
the first array from `chi2` to `SelectKBest`. As far as I remember, using raw 
chi2 scores is default in Weka's 
[ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html).
 So, I would not claim that either of approaches is incorrect. According to 
[Introduction to 
IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),
 there might be an issue with computing p-values because then chi2-test is used 
multiple times. Using plain chi2 values does not involve statistical test, so 
it might be treated as just some ranking with no statistical implications.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2017-05-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003110#comment-16003110
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Autoencoder is implemented in the referenced pull request. I will be glad to 
follow up on the code review if anyone can do it.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940446#comment-14940446
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi, Weide,

Sounds good! What kind of feature are you planning to add?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944173#comment-14944173
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Weide,

These are major features and some of them are under development. You can check 
their status in the linked issues. Could you work on something smaller as a 
first step? [~mengxr], do you have any suggestions?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-11262:


 Summary: Unit test for gradient, loss layers, memory management 
for multilayer perceptron
 Key: SPARK-11262
 URL: https://issues.apache.org/jira/browse/SPARK-11262
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.1
Reporter: Alexander Ulanov
 Fix For: 1.5.1


Multi-layer perceptron requires more rigorous tests and refactoring of layer 
interfaces to accommodate development of new features.
1)Implement unit test for gradient and loss
2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988705#comment-14988705
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Disha,

RNN is a major feature. I suggest to start from a smaller contribution. Spark 
contains the implementation of multi-layer perceptron since version 1.5. New 
features are supposed to re-use its code and follow the internal API that it 
has introduced. 

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992447#comment-14992447
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. 

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992447#comment-14992447
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 11/5/15 8:50 PM:
--

Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. Does it work with LBFGS?

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.


was (Author: avulanov):
Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. 

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997300#comment-14997300
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Narine,

Thank you for your observation. It seems that such information is useful to 
know. Indeed, LBFGS in Spark does not print any information during the 
execution. ANN uses Spark's LBFGS. You might want to add the needed output to 
the LBFGS code 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L185.
 

Best regards, Alexander 


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks

2016-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395379#comment-15395379
 ] 

Alexander Ulanov commented on SPARK-10627:
--

[~RubenJanssen] These are major features. Could you work on something smaller 
as a first step?

> Regularization for artificial neural networks
> -
>
> Key: SPARK-10627
> URL: https://issues.apache.org/jira/browse/SPARK-10627
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Add regularization for artificial neural networks. Includes, but not limited 
> to:
> 1)L1 and L2 regularization
> 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
> 3)Dropconnect 
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9120:

Description: 
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update: After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel has "predict:Double". Its "train" method uses 
"predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) is 
hard-coded to have only one output. There exist a similar problem in MLLib 
(https://issues.apache.org/jira/browse/SPARK-5362). 

The possible solution for this problem might require to redesign the class 
hierarchy or addition of a separate interface that extends model. Though the 
latter means code duplication.


  was:
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update: After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has "predict:Double". 
Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
(and RegressionModel) is hard-coded to have only one output. There exist a 
similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). 

The possible solution for this problem might require to redesign the class 
hierarchy or addition of a separate interface that extends model. Though the 
latter means code duplication.



> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel has "predict:Double". Its "train" method uses 
> "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) 
> is hard-coded to have only one output. There exist a similar problem in MLLib 
> (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2016-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395434#comment-15395434
 ] 

Alexander Ulanov commented on SPARK-9120:
-

Thanks for the comment, RegressionModel does not extend that trait indeed. 
However it is designed to handle one output variable, as mentioned in the 
description. This presents it from use in multivariate regression.

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel has "predict:Double". Its "train" method uses 
> "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) 
> is hard-coded to have only one output. There exist a similar problem in MLLib 
> (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Implement extensible API compatible with Spark ML. Basic abstractions such as 
Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should 
be implemented as traits or interfaces, so they can be easily extended or 
reused. 
# Performance. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.

# Implement efficient distributed training. It relies heavily on the efficient 
communication and scheduling mechanisms. The default implementation is based on 
Spark. More efficient implementations might include some external libraries but 
use the same interface defined.

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 

One can wrap other deep learning implementations with this interface allowing 
users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the 
default one. The interface has to provide few architectures for deep learning 
that are widely used in practice, such as AlexNet. The main motivation for 
using specialized libraries for deep learning would be to fully take advantage 
of the hardware where Spark runs, in particular GPUs. Having the default 
interface in Spark, we will need to wrap only a subset of functions from a 
given specialized library. It does require an effort, however it is not the 
same as wrapping all functions. Wrappers can be provided as packages without 
the need to pull new dependencies into Spark.




*Requirements:* 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.

  was:
Goal: Implement various types of artificial neural networks

Motivation: deep learning trend

Requirements: 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural netwo

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron.
# Autoencoder 
# Convolutional neural networks. The interface has to provide few architectures 
for deep learning that are widely used in practice, such as AlexNet.

*Additional features:* (lower priority)
# The internal API of Spark ANN is designed to be flexible and can handle 
different types of layers. However, only a part of the API is made public. We 
have to limit the number of public classes in order to make it simpler to 
support other languages. This forces us to use (String or Number) parameters 
instead of introducing of new public classes. One of the options to specify the 
architecture of ANN is to use text configuration with layer-wise description. 
We have considered using Caffe format for this. It gives the benefit of 
compatibility with well known deep learning tool and simplifies the support of 
other languages in Spark. Implementation of a parser for the subset of Caffe 
format might be the first step towards the support of general ANN architectures 
in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.




*Requirements:* 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/b

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Progress:*
# Requirements: done
# Features:
## Multilayer perceptron classifier

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is 

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
# Requirements: https://issues.apache.org/jira/browse/SPARK-9471
## API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
## Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
# Features:
## Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
# Features:
##

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
# Requirements: https://issues.apache.org/jira/browse/SPARK-9471
## API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
## Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
# Features:
## Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
# Features:
##

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
* Requirements: https://issues.apache.org/jira/browse/SPARK-9471
** API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
* Features:
** Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
* Features:
**

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395489#comment-15395489
 ] 

Alexander Ulanov commented on SPARK-15581:
--

[~bordaw] sounds great! Just in case, I have summarized the above discussion 
related to the DNN in the main DNN jira: 
https://issues.apache.org/jira/browse/SPARK-5575

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of 

[jira] [Created] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-4752:
---

 Summary: Classifier based on artificial neural network
 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0


Implement classifier based on artificial neural network (ANN). Requirements:
1) Use the existing artificial neural network implementation 
https://issues.apache.org/jira/browse/SPARK-2352, 
https://github.com/apache/spark/pull/1290
2) Extend MLlib ClassificationModel trait, 
3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234855#comment-14234855
 ] 

Alexander Ulanov edited comment on SPARK-4752 at 12/5/14 12:51 AM:
---

The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It encodes the class 
label as a binary vector in the ANN output and selects the class based on 
biggest output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.


was (Author: avulanov):
The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It codes the class label 
as a binary vector in the ANN output and selects the class based on biggest 
output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234855#comment-14234855
 ] 

Alexander Ulanov commented on SPARK-4752:
-

The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It codes the class label 
as a binary vector in the ANN output and selects the class based on biggest 
output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2623) Stacked Auto Encoder (Deep Learning )

2014-12-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244651#comment-14244651
 ] 

Alexander Ulanov commented on SPARK-2623:
-

Relevant PR: artificial neural networks 
https://github.com/apache/spark/pull/1290. Also, I've implemented alphas 
version of stacked autoencoder here 
https://github.com/avulanov/spark/tree/autoencoder

> Stacked Auto Encoder (Deep Learning )
> -
>
> Key: SPARK-2623
> URL: https://issues.apache.org/jira/browse/SPARK-2623
> Project: Spark
>  Issue Type: New Feature
>Reporter: Victor Fang
>Assignee: Victor Fang
>  Labels: deeplearning, machine_learning
>
> We would like to add parallel implementation of  Stacked Auto Encoder (Deep 
> Learning ) algorithm to Spark MLLib.
> SAE is one of the most popular Deep Learning algorithms. It has achieved 
> successful benchmarks in MNIST hand written classifications, Google's 
> ICML2012 "cat face" paper (http://icml.cc/2012/papers/73.pdf), etc.
> Our focus is to leverage the RDD and get the SAE with the following 
> capability with ease of use for both beginners and advanced researchers:
> 1, multi layer SAE deep network training and scoring.
> 2, unsupervised feature learning.
> 3, supervised learning with multinomial logistic regression (softmax). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2623) Stacked Auto Encoder (Deep Learning )

2014-12-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244651#comment-14244651
 ] 

Alexander Ulanov edited comment on SPARK-2623 at 12/12/14 7:31 PM:
---

Relevant PR https://issues.apache.org/jira/browse/SPARK-2352: artificial neural 
networks https://github.com/apache/spark/pull/1290. Also, I've implemented 
alphas version of stacked autoencoder here 
https://github.com/avulanov/spark/tree/autoencoder


was (Author: avulanov):
Relevant PR: artificial neural networks 
https://github.com/apache/spark/pull/1290. Also, I've implemented alphas 
version of stacked autoencoder here 
https://github.com/avulanov/spark/tree/autoencoder

> Stacked Auto Encoder (Deep Learning )
> -
>
> Key: SPARK-2623
> URL: https://issues.apache.org/jira/browse/SPARK-2623
> Project: Spark
>  Issue Type: New Feature
>Reporter: Victor Fang
>Assignee: Victor Fang
>  Labels: deeplearning, machine_learning
>
> We would like to add parallel implementation of  Stacked Auto Encoder (Deep 
> Learning ) algorithm to Spark MLLib.
> SAE is one of the most popular Deep Learning algorithms. It has achieved 
> successful benchmarks in MNIST hand written classifications, Google's 
> ICML2012 "cat face" paper (http://icml.cc/2012/papers/73.pdf), etc.
> Our focus is to leverage the RDD and get the SAE with the following 
> capability with ease of use for both beginners and advanced researchers:
> 1, multi layer SAE deep network training and scoring.
> 2, unsupervised feature learning.
> 3, supervised learning with multinomial logistic regression (softmax). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395001#comment-14395001
 ] 

Alexander Ulanov commented on SPARK-6673:
-

Probably similar issue: I am trying to execute unit tests in MLlib with 
LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log 
saying that: "Cannot find any assembly build directories." If I do set 
SPARK_SCALA_VERSION=2.10 then I get "No assemblies found in 
'C:\dev\spark\mllib\.\assembly\target\scala-2.10'"

> spark-shell.cmd can't start even when spark was built in Windows
> 
>
> Key: SPARK-6673
> URL: https://issues.apache.org/jira/browse/SPARK-6673
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.3.0
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Blocker
>
> spark-shell.cmd can't start.
> {code}
> bin\spark-shell.cmd --master local
> {code}
> will get
> {code}
> Failed to find Spark assembly JAR.
> You need to build Spark before running this program.
> {code}
> even when we have built spark.
> This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which 
> is used in {{spark-class2.cmd}}.
> In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in 
> {{load-spark-env.sh}}, but there are no equivalent script in Windows.
> As workaround, by executing
> {code}
> set SPARK_SCALA_VERSION=2.10
> {code}
> before execute spark-shell.cmd, we can successfully start it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395185#comment-14395185
 ] 

Alexander Ulanov commented on SPARK-2356:
-

The following worked for me:
Download http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and 
put it to DISK:\FOLDERS\bin\
Set HADOOP_CONF=DISK:\FOLDERS

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483729#comment-14483729
 ] 

Alexander Ulanov edited comment on SPARK-6682 at 4/7/15 6:35 PM:
-

This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {

val lbfgs = new LBFGS(_gradient, _updater)

optimizer = lbfgs

lbfgs

  }

```

 Another downside of it is that if someone implements new Optimizer then one 
have to add "setMyOptimizer" to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {

optimizer match {

  case lbfgs: LBFGS => lbfgs.setGradient(gradient)

  case sgd: GradientDescent => sgd.setGradient(gradient)

  case other => throw new UnsupportedOperationException(

s"Only LBFGS and GradientDescent are supported but got 
${other.getClass}.")

}

  }

```

So it is essential to work out the Optimizer interface first.


was (Author: avulanov):
This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {
val lbfgs = new LBFGS(_gradient, _updater)
optimizer = lbfgs
lbfgs
  }
```

 Another downside of it is that if someone implements new Optimizer then one 
have to add "setMyOptimizer" to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {
optimizer match {
  case lbfgs: LBFGS => lbfgs.setGradient(gradient)
  case sgd: GradientDescent => sgd.setGradient(gradient)
  case other => throw new UnsupportedOperationException(
s"Only LBFGS and GradientDescent are supported but got 
${other.getClass}.")
}
  }
```

So it is essential to work out the Optimizer interface first.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the bui

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483729#comment-14483729
 ] 

Alexander Ulanov commented on SPARK-6682:
-

This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {
val lbfgs = new LBFGS(_gradient, _updater)
optimizer = lbfgs
lbfgs
  }
```

 Another downside of it is that if someone implements new Optimizer then one 
have to add "setMyOptimizer" to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {
optimizer match {
  case lbfgs: LBFGS => lbfgs.setGradient(gradient)
  case sgd: GradientDescent => sgd.setGradient(gradient)
  case other => throw new UnsupportedOperationException(
s"Only LBFGS and GradientDescent are supported but got 
${other.getClass}.")
}
  }
```

So it is essential to work out the Optimizer interface first.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485554#comment-14485554
 ] 

Alexander Ulanov commented on SPARK-6682:
-

[~yuu.ishik...@gmail.com] 
They reside in package org.apache.spark.mllib.optimization: class LBFGS(private 
var gradient: Gradient, private var updater: Updater) and class GradientDescent 
private[mllib] (private var gradient: Gradient, private var updater: Updater). 
They extend Optimizer trait that has only one function: def optimize(data: 
RDD[(Double, Vector)], initialWeights: Vector): Vector. This function is 
limited to only one type of input: vectors and their labels. I have submitted a 
separate issue regarding this https://issues.apache.org/jira/browse/SPARK-5362. 

1. Right now static methods work with hard-coded optimizers, such as 
LogisticRegressionWithSGD. This is not very convenient. I think moving away 
from static methods and use builders implies that optimizers also could be set 
by users. It will be a problem because current optimizers require Updater and 
Gradient at the creation time. 
2. The workaround I suggested in the previous post addresses this.


> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494518#comment-14494518
 ] 

Alexander Ulanov commented on SPARK-5256:
-

Probably the main issue for MLlib is that iterative algorithms are implemented 
with aggregate function. It has a fixed overhead around half of a second that 
limits its application when one needs to make big number of iterations. This is 
the case for bigger data for which Spark is intended for. This problem gets 
worse with stochastic algorithms because there is no good way to randomly pick 
data from RDD and one needs to sequentially look through it.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494568#comment-14494568
 ] 

Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:43 PM:
--

The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper "The tradeoffs of 
large scale learning", SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series. 

Just in case, link to the paper 
http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf


was (Author: avulanov):
The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper "The tradeoffs of 
large scale learning", SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494568#comment-14494568
 ] 

Alexander Ulanov commented on SPARK-5256:
-

The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper "The tradeoffs of 
large scale learning", SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494579#comment-14494579
 ] 

Alexander Ulanov commented on SPARK-5256:
-

[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, don't you think? :)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494579#comment-14494579
 ] 

Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:48 PM:
--

[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, shouldn't they? :)


was (Author: avulanov):
[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, don't you think? :)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-3403:
---

 Summary: NaiveBayes crashes with blas/lapack native libraries for 
breeze (netlib-java)
 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0


Code:
val model = NaiveBayes.train(train)
val predictionAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}
predictionAndLabels.foreach(println)

Result: 
program crashes with: "Process finished with exit code -1073741819 
(0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-3403:

Attachment: NativeNN.scala

The file contains example that produces the same issue

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121563#comment-14121563
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Yes, I tried using netlib-java separately with the same OpenBLAS setup and it 
worked properly, even within several threads. However I didn't mimic the same 
multi-threading setup as MLlib has because it is complicated.  Do you want me 
to send you all DLLs that I used? I had troubles with compiling OpenBLAS for 
Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites.


> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122699#comment-14122699
 ] 

Alexander Ulanov commented on SPARK-3403:
-

I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?


> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122699#comment-14122699
 ] 

Alexander Ulanov edited comment on SPARK-3403 at 9/5/14 9:53 AM:
-

I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?
3)I didn't get any performance improvements with native libraries versus java 
arrays. My matrices are of size up to 10K-20K . Is it supposed to be so?


was (Author: avulanov):
I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?


> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138829#comment-14138829
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java 
(https://github.com/fommil/netlib-java)? I thought the latter has jni 
implementation. I it ok to submit it as is?

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140128#comment-14140128
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Posted to netlib-java: https://github.com/fommil/netlib-java/issues/72

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138829#comment-14138829
 ] 

Alexander Ulanov edited comment on SPARK-3403 at 9/19/14 7:16 AM:
--

Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS ( https://github.com/xianyi/OpenBLAS ) or netlib-java ( 
https://github.com/fommil/netlib-java )? I thought the latter has jni 
implementation. I it ok to submit it as is?


was (Author: avulanov):
Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java 
(https://github.com/fommil/netlib-java)? I thought the latter has jni 
implementation. I it ok to submit it as is?

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140396#comment-14140396
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Thanks, Sam! Posted to OpenBLAS: https://github.com/xianyi/OpenBLAS/issues/452

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-06-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582768#comment-14582768
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Janani, 

There is already an implemenation of DBN (and RBM) by [~gq]. You can find it 
here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-8449:
---

 Summary: HDF5 read/write support for Spark MLlib
 Key: SPARK-8449
 URL: https://issues.apache.org/jira/browse/SPARK-8449
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.1


Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS 
and local file system have to be supported. Other Spark formats to be 
discussed. 

Interface proposal:
/* path - directory path in any Hadoop-supported file system URI */
MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
/* path - file or directory path in any Hadoop-supported file system URI */
MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592398#comment-14592398
 ] 

Alexander Ulanov commented on SPARK-8449:
-

It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worth a 
look.

> HDF5 read/write support for Spark MLlib
> ---
>
> Key: SPARK-8449
> URL: https://issues.apache.org/jira/browse/SPARK-8449
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.1
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Add support for reading and writing HDF5 file format to/from LabeledPoint. 
> HDFS and local file system have to be supported. Other Spark formats to be 
> discussed. 
> Interface proposal:
> /* path - directory path in any Hadoop-supported file system URI */
> MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
> /* path - file or directory path in any Hadoop-supported file system URI */
> MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592398#comment-14592398
 ] 

Alexander Ulanov edited comment on SPARK-8449 at 6/18/15 7:53 PM:
--

It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worths 
a look.


was (Author: avulanov):
It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worth a 
look.

> HDF5 read/write support for Spark MLlib
> ---
>
> Key: SPARK-8449
> URL: https://issues.apache.org/jira/browse/SPARK-8449
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.1
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Add support for reading and writing HDF5 file format to/from LabeledPoint. 
> HDFS and local file system have to be supported. Other Spark formats to be 
> discussed. 
> Interface proposal:
> /* path - directory path in any Hadoop-supported file system URI */
> MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
> /* path - file or directory path in any Hadoop-supported file system URI */
> MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2222) Add multiclass evaluation metrics

2014-06-20 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-:
---

 Summary: Add multiclass evaluation metrics
 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov


There is no class in Spark MLlib for measuring the performance of multiclass 
classifiers. This task involves adding such class and unit tests. The following 
measures are to be implemented: per class, micro averaged and weighted averaged 
Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2222) Add multiclass evaluation metrics

2014-06-21 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040029#comment-14040029
 ] 

Alexander Ulanov commented on SPARK-:
-

Hi Jun,

I've already implemented this feature and made a pull request. You can view it 
on https://github.com/apache/spark/pull/1155#issuecomment-46683617

Best regards, Alexander

> Add multiclass evaluation metrics
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Alexander Ulanov
>
> There is no class in Spark MLlib for measuring the performance of multiclass 
> classifiers. This task involves adding such class and unit tests. The 
> following measures are to be implemented: per class, micro averaged and 
> weighted averaged Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2222) Add multiclass evaluation metrics

2014-06-22 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040101#comment-14040101
 ] 

Alexander Ulanov commented on SPARK-:
-

The micro averaged Precision and Recall are equal for multiclass classifier, 
because sum(fn(i))=sum(fp(i)), i.e. they are just the sum of all non-diagonal 
elements in confusion matrix. For more details please refer to the book 
"Introduction to IR" by Manning.

> Add multiclass evaluation metrics
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Alexander Ulanov
>
> There is no class in Spark MLlib for measuring the performance of multiclass 
> classifiers. This task involves adding such class and unit tests. The 
> following measures are to be implemented: per class, micro averaged and 
> weighted averaged Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2329) Add multi-label evaluation metrics

2014-06-30 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-2329:
---

 Summary: Add multi-label evaluation metrics
 Key: SPARK-2329
 URL: https://issues.apache.org/jira/browse/SPARK-2329
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov
 Fix For: 1.1.0


There is no class in Spark MLlib for measuring the performance of multi-label  
classifiers. Multilabel classification is when the document is labeled with 
several labels (classes).

This task involves adding the class for multilabel evaluation and unit tests. 
The following measures are to be implemented: Precision, Recall and F1-measure 
(1) based on documents averaged by the number of documents; (2) per label; (3) 
based on labels micro and macro averaged; (4) Hamming loss. Reference: 
Tsoumakas, Grigorios, Ioannis Katakis, and Ioannis Vlahavas. "Mining 
multi-label data." Data mining and knowledge discovery handbook. Springer US, 
2010. 667-685.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-07-02 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049939#comment-14049939
 ] 

Alexander Ulanov commented on SPARK-1473:
-

Does anybody work on this issue?

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Priority: Minor
>  Labels: features
> Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-08-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090473#comment-14090473
 ] 

Alexander Ulanov commented on SPARK-1473:
-

I've implemented Chi-Squared and added a pull request

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Priority: Minor
>  Labels: features
> Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2014-08-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090473#comment-14090473
 ] 

Alexander Ulanov edited comment on SPARK-1473 at 8/8/14 8:27 AM:
-

I've implemented Chi-Squared and added a pull request 
https://github.com/apache/spark/pull/1484


was (Author: avulanov):
I've implemented Chi-Squared and added a pull request

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Priority: Minor
>  Labels: features
> Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277986#comment-14277986
 ] 

Alexander Ulanov commented on SPARK-5256:
-

I would like to improve Gradient interface, so it will be able to process 
something more general than `Label` (which is relevant only to classifiers but 
not to other machine learning methods) and also allowing batch processing. The 
simplest way for me of doing this is to add another function to `Gradient` 
interface:

def compute(data: Vector, output: Vector, weights: Vector, cumGradient: 
Vector): Double

In `Gradient` trait it should call `compute` with `label`. Of course, one needs 
to make some adjustments to LBFGS and GradientDescent optimizers, replacing 
label: double with output:vector. 

 For batch processing one can put data and output points stacked into a long 
vector (matrices are stored in this way in breeze) and pass them with the 
proposed interface.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277988#comment-14277988
 ] 

Alexander Ulanov commented on SPARK-5256:
-

Also, asynchronous gradient update might be a good thing to have.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-01-21 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5362:
---

 Summary: Gradient and Optimizer to support generic output (instead 
of label) and data batches
 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0


Currently, Gradient and Optimizer interfaces support data in form of 
RDD[Double, Vector] which refers to label and features. This limits its 
application to classification problems. For example, artificial neural network 
demands Vector as output (instead of label: Double). Moreover, current 
interface does not support data batches. I propose to replace label: Double 
with output: Vector. It enables passing generic output instead of label and 
also passing data and output batches stored in corresponding vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-01-21 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286703#comment-14286703
 ] 

Alexander Ulanov commented on SPARK-5362:
-

https://github.com/apache/spark/pull/4152

> Gradient and Optimizer to support generic output (instead of label) and data 
> batches
> 
>
> Key: SPARK-5362
> URL: https://issues.apache.org/jira/browse/SPARK-5362
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Gradient and Optimizer interfaces support data in form of 
> RDD[Double, Vector] which refers to label and features. This limits its 
> application to classification problems. For example, artificial neural 
> network demands Vector as output (instead of label: Double). Moreover, 
> current interface does not support data batches. I propose to replace label: 
> Double with output: Vector. It enables passing generic output instead of 
> label and also passing data and output batches stored in corresponding 
> vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-21 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286706#comment-14286706
 ] 

Alexander Ulanov commented on SPARK-5256:
-

I've implemented my proposition with Vector as output in 
https://issues.apache.org/jira/browse/SPARK-5362

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5386:
---

 Summary: Reduce fails with vectors of big length
 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, 
Ubuntu), each runs 2 Workers
./spark-shell --executor-memory 8G --driver-memory 8G

Reporter: Alexander Ulanov
 Fix For: 1.3.0


Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double](n))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5386:

Description: 
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed


  was:
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double](n))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed



> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, 
> Ubuntu), each runs 2 Workers
> ./spark-shell --executor-memory 8G --driver-memory 8G
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5386:

Environment: 
Overall:
6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
Spark:
./spark-shell --executor-memory 8G --driver-memory 8G
spark.driver.maxResultSize 0
"java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space

  was:
6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
./spark-shell --executor-memory 8G --driver-memory 8G



> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289621#comment-14289621
 ] 

Alexander Ulanov commented on SPARK-5386:
-

I allocate 8G for driver and each worker. Could you suggest why it is not 
enough for handling reduce operation with 60M vector of Double?

> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289677#comment-14289677
 ] 

Alexander Ulanov commented on SPARK-5386:
-

My spark-env.sh contains:
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_INSTANCES=2
I run spark-shell with ./spark-shell --executor-memory 8G --driver-memory 8G. 
In Spark-UI each worker has 8GB of memory. 

Btw., I run this code once again and this time it does not crash and keep 
trying to shedule the job for the failing node that tries to allocate memory 
and fails and so on. Is it a normal behavior?

> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289708#comment-14289708
 ] 

Alexander Ulanov commented on SPARK-5386:
-

Thank you for suggestions.
1. count() does work, it returns 12
2. It failed with p = 2. However, in some of my previous experiments it did not 
fail even for p up to 5 or 7 (in different runs)

> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5386:

Description: 
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
vv.count()
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed


  was:
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed



> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.count()
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289708#comment-14289708
 ] 

Alexander Ulanov edited comment on SPARK-5386 at 1/23/15 6:52 PM:
--

Thank you for suggestions.
1. count() does work, it returns 12
2. Full script failed with p = 2. However, in some of my previous experiments 
it did not fail even for p up to 5 or 7 (in different runs)


was (Author: avulanov):
Thank you for suggestions.
1. count() does work, it returns 12
2. It failed with p = 2. However, in some of my previous experiments it did not 
fail even for p up to 5 or 7 (in different runs)

> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289880#comment-14289880
 ] 

Alexander Ulanov commented on SPARK-5386:
-

Thank you, it might be the problem. I was trying to run GC before each 
operation but it did not help. Probably, it takes a lot of memory to run 
initialization of Breeze Dense Vector. Assuming that the problem is due to 
insufficient memory on the Worker node, I am curious, what will happen on 
Driver? Will it receive 12 vectors of size 60M Doubles and then do the 
aggregation? Is it feasible? (P.S. I know that there is a treeReduce function 
that forces do partial aggregation on Workers. However, for big number of 
Wokers the problem will remain in treeReduce as well, as far as I understand) 

> Reduce fails with vectors of big length
> ---
>
> Key: SPARK-5386
> URL: https://issues.apache.org/jira/browse/SPARK-5386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
> Environment: Overall:
> 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
> Spark:
> ./spark-shell --executor-memory 8G --driver-memory 8G
> spark.driver.maxResultSize 0
> "java.io.tmpdir" and "spark.local.dir" set to a disk with a lot of free space
>Reporter: Alexander Ulanov
> Fix For: 1.3.0
>
>
> Code:
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.count()
> vv.reduce(_ + _)
> When executing in shell it crashes after some period of time. One of the node 
> contain the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
> During the execution there is a message: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Connection from 
> server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-02-03 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5575:
---

 Summary: Artificial neural networks for MLlib deep learning
 Key: SPARK-5575
 URL: https://issues.apache.org/jira/browse/SPARK-5575
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov


Goal: Implement various types of artificial neural networks

Motivation: deep learning trend

Requirements: 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328246#comment-14328246
 ] 

Alexander Ulanov commented on SPARK-5912:
-

Sure, I can. Could you point me to some template or a good example of a 
programming guide?

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-20 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329685#comment-14329685
 ] 

Alexander Ulanov commented on SPARK-5912:
-

I've almost written the ChiSquared section in the corresponding file. I was 
able to generate API docs with `build/sbt doc`, however I don't see that 
"mllib-*-*" are too. Could you suggest how should I generate them?

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15851:


 Summary: Spark 2.0 does not compile in Windows 7
 Key: SPARK-15851
 URL: https://issues.apache.org/jira/browse/SPARK-15851
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
 Environment: Windows 7
Reporter: Alexander Ulanov


Spark does not compile in Windows 7.
"mvn compile" fails on spark-core due to trying to execute a bash script 
spark-build-info.

Work around:
1)Install win-bash and put in path
2)Change line 350 of core/pom.xml

  
  
  


Error trace:
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
spark-core_2.11: An Ant BuildException has occured: Execute failed: 
java.io.IOException: Cannot run program 
"C:\dev\spark\core\..\build\spark-build-info" (in directory 
"C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
application
[ERROR] around Ant part .. @ 4:73 in 
C:\dev\spark\core\target\antrun\build-main.xml




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-15851:
-
Target Version/s: 2.0.0
   Fix Version/s: (was: 2.0.0)

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-15851:
-
Fix Version/s: 2.0.0

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323624#comment-15323624
 ] 

Alexander Ulanov commented on SPARK-15851:
--

This does not work because Ant uses Java Process to run executable which 
returns "not a valid Win32 application". In order to run it, one need to run 
"bash" and provide bash file as a param. This approach I proposed as a 
work-around. For more details please refer to: 
http://stackoverflow.com/questions/20883212/how-can-i-use-ant-exec-to-execute-commands-on-linux

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323638#comment-15323638
 ] 

Alexander Ulanov commented on SPARK-15851:
--

I can do that. However, it seems that "spark-build-info" can be rewritten as a 
shell script. This will remove the need to install bash for Windows users that 
compile Spark with maven. What do you think?

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323704#comment-15323704
 ] 

Alexander Ulanov commented on SPARK-15851:
--

Sorry for confusion, I mean the shell that is "/bin/sh". Windows version of it 
comes with Git.

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-10 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ] 

Alexander Ulanov commented on SPARK-15581:
--

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on

[jira] [Created] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

2016-06-10 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15893:


 Summary: spark.createDataFrame raises an exception in Spark 2.0 
tests on Windows
 Key: SPARK-15893
 URL: https://issues.apache.org/jira/browse/SPARK-15893
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.0.0
Reporter: Alexander Ulanov


spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

For example, LogisticRegressionSuite fails at Line 46:
Exception encountered when invoking run on a nested suite - 
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109)


Another example, DataFrameSuite raises:
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

2016-06-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330065#comment-15330065
 ] 

Alexander Ulanov commented on SPARK-15893:
--

Actually, the code that I am trying to run does not have explicit paths in it. 
It is Spark unit tests that were running properly on 1.6 (and with earlier 
versions) on Windows. It seems that the recent change in 2.0 broke that. Could 
you propose a way to debug this?

> spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
> ---
>
> Key: SPARK-15893
> URL: https://issues.apache.org/jira/browse/SPARK-15893
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Alexander Ulanov
>
> spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
> For example, LogisticRegressionSuite fails at Line 46:
> Exception encountered when invoking run on a nested suite - 
> java.net.URISyntaxException: Relative path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109)
> Another example, DataFrameSuite raises:
> java.net.URISyntaxException: Relative path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ] 

Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM:
---

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML. 
We might also include recurrent networks as well.

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. One 
can wrap other deep learning implementations with this interface allowing users 
to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default 
one. The interface has to provide few architectures for deep learning that are 
widely used in practice, such as AlexNet.

The ultimate goal will be to provide efficient distributed training. It relies 
heavily on the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.


was (Author: avulanov):
I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analyt

[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ] 

Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM:
---

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML. 
We might also include recurrent networks as well.

Update (6/16) based on our conversation with Ben Lorica:

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. One 
can wrap other deep learning implementations with this interface allowing users 
to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default 
one. The interface has to provide few architectures for deep learning that are 
widely used in practice, such as AlexNet.

The ultimate goal will be to provide efficient distributed training. It relies 
heavily on the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.


was (Author: avulanov):
I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML li

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-22 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345264#comment-15345264
 ] 

Alexander Ulanov commented on SPARK-15581:
--

The current implementation of multilayer perceptron in Spark is less than 2x 
slower than Caffe, both measured on CPU. The main overhead sources are JVM and 
Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, I expect that 
efficient implementation of deep learning in Spark will be only few times 
slower than in specialized tool. This is very reasonable for the platform that 
does much more than deep learning and I believe it is understood by the 
community.

The main motivation for using specialized libraries for deep learning would be 
to fully take advantage of the hardware where Spark runs, in particular GPUs. 
Having the default interface in Spark, we will need to wrap only a subset of 
functions from a given specialized library. It does require an effort, however 
it is not the same as wrapping all functions. Wrappers can be provided as 
packages without the need to pull new dependencies into Spark.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvement

[jira] [Commented] (SPARK-15899) file scheme should be used correctly

2016-06-22 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345611#comment-15345611
 ] 

Alexander Ulanov commented on SPARK-15899:
--

`user.dir` on Windows starts with a letter:
scala> System.getProperty("user.dir")
res0: String = C:\Program Files (x86)\scala\bin

On Linux it starts with a slash:
scala> System.getProperty("user.dir")
res0: String = /home/hduser

It seems that java.io.File could convert it to a proper URI:
Windows:
scala> new File("c:/myfile").toURI
res6: java.net.URI = file:/c:/myfile
Linux:
scala> new File("/home/myfile").toURI
res3: java.net.URI = file:/home/myfile

We can remove "file:" from 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58
 and add toURI conversion in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L694





> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2016-06-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351922#comment-15351922
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Here is the PR https://github.com/apache/spark/pull/13621

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-12 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
* Requirements: https://issues.apache.org/jira/browse/SPARK-9471
** API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
* Features:
** Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
* Features:
**

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-30 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536321#comment-15536321
 ] 

Alexander Ulanov commented on SPARK-5575:
-

I recently released a package to handle new features that are not yet merged in 
Spark: https://spark-packages.org/package/avulanov/scalable-deeplearning

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default interface in

[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114080#comment-15114080
 ] 

Alexander Ulanov commented on SPARK-10528:
--

Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked 
with the earlier versions of Spark. Changing permissions do not help. Is there 
a workaround? 

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114080#comment-15114080
 ] 

Alexander Ulanov edited comment on SPARK-10528 at 1/24/16 1:30 AM:
---

Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked 
with the earlier versions of Spark. Changing permissions do not help. Spark 
launches eventually with that error and does not provide sqlContext. I've 
checked Spark 1.4.1 and it worked fine.

Is there a workaround? 


was (Author: avulanov):
Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked 
with the earlier versions of Spark. Changing permissions do not help. Is there 
a workaround? 

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-11 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers: 

References: 
1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1, 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-13 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1. Vincent, Pascal, et al. "Extracting and composing robust features with 
denoising autoencoders." Proceedings of the 25th international conference on 
Machine learning. ACM, 2008. 
http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
 
2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694356#comment-14694356
 ] 

Alexander Ulanov commented on SPARK-9897:
-

We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?

> User Guide for Multilayer Perceptron Classifier
> ---
>
> Key: SPARK-9897
> URL: https://issues.apache.org/jira/browse/SPARK-9897
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>
> SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
> docs. We should update the user guide to include this under the {{Algorithm 
> Guides > Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9897:

Comment: was deleted

(was: We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?)

> User Guide for Multilayer Perceptron Classifier
> ---
>
> Key: SPARK-9897
> URL: https://issues.apache.org/jira/browse/SPARK-9897
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>
> SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
> docs. We should update the user guide to include this under the {{Algorithm 
> Guides > Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694355#comment-14694355
 ] 

Alexander Ulanov commented on SPARK-9897:
-

We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?

> User Guide for Multilayer Perceptron Classifier
> ---
>
> Key: SPARK-9897
> URL: https://issues.apache.org/jira/browse/SPARK-9897
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>
> SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
> docs. We should update the user guide to include this under the {{Algorithm 
> Guides > Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697902#comment-14697902
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I have this already, I plan to use it for the User Guide. Should we have a 
different example code in the examples?

> Example code for Multilayer Perceptron Classifier
> -
>
> Key: SPARK-9951
> URL: https://issues.apache.org/jira/browse/SPARK-9951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-17 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700567#comment-14700567
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I've submitter a PR for the user guide. Could you suggest if the example code 
in the PR can be used for this issue? https://github.com/apache/spark/pull/8262

> Example code for Multilayer Perceptron Classifier
> -
>
> Key: SPARK-9951
> URL: https://issues.apache.org/jira/browse/SPARK-9951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10408:


 Summary: Autoencoder
 Key: SPARK-10408
 URL: https://issues.apache.org/jira/browse/SPARK-10408
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Issue Type: Umbrella  (was: Improvement)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >