[jira] [Commented] (SPARK-10408) Autoencoder

2017-05-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003110#comment-16003110
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Autoencoder is implemented in the referenced pull request. I will be glad to 
follow up on the code review if anyone can do it.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7253) Add example of belief propagation with GraphX

2016-12-13 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746841#comment-15746841
 ] 

Alexander Ulanov commented on SPARK-7253:
-

Here is the implementation of belief propagation algorithm for factor graphs 
with examples: https://github.com/HewlettPackard/sandpiper

> Add example of belief propagation with GraphX
> -
>
> Key: SPARK-7253
> URL: https://issues.apache.org/jira/browse/SPARK-7253
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Joseph K. Bradley
>
> It would nice to document (via an example) how to use GraphX to do belief 
> propagation.  It's probably too much right now to talk about a full-fledged 
> graphical model library (and that would belong in MLlib anyways), but a 
> simple example of a graphical model + BP would be nice to add to GraphX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566467#comment-15566467
 ] 

Alexander Ulanov commented on SPARK-17870:
--

[`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)
 works with "a Function taking two arrays X and y, and returning a pair of 
arrays (scores, pvalues) or a single array with scores". According to what you 
observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case 
for all functions that return two arrays: 
https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331.
 Alternative, one case use raw `chi2` scores for sorting. She need to pass only 
the first array from `chi2` to `SelectKBest`. As far as I remember, using raw 
chi2 scores is default in Weka's 
[ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html).
 So, I would not claim that either of approaches is incorrect. According to 
[Introduction to 
IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),
 there might be an issue with computing p-values because then chi2-test is used 
multiple times. Using plain chi2 values does not involve statistical test, so 
it might be treated as just some ranking with no statistical implications.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-30 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536321#comment-15536321
 ] 

Alexander Ulanov commented on SPARK-5575:
-

I recently released a package to handle new features that are not yet merged in 
Spark: https://spark-packages.org/package/avulanov/scalable-deeplearning

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default interface in

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-12 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
* Requirements: https://issues.apache.org/jira/browse/SPARK-9471
** API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
* Features:
** Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
* Features:
**

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395489#comment-15395489
 ] 

Alexander Ulanov commented on SPARK-15581:
--

[~bordaw] sounds great! Just in case, I have summarized the above discussion 
related to the DNN in the main DNN jira: 
https://issues.apache.org/jira/browse/SPARK-5575

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of 

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
* Requirements: https://issues.apache.org/jira/browse/SPARK-9471
** API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
* Features:
** Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
* Features:
**

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
# Requirements: https://issues.apache.org/jira/browse/SPARK-9471
## API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
## Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
# Features:
## Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
# Features:
##

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Completed (merged to the main Spark branch):*
# Requirements: https://issues.apache.org/jira/browse/SPARK-9471
## API 
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
## Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
# Features:
## Multilayer perceptron classifier 
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
# Features:
##

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

*Progress:*
# Requirements: done
# Features:
## Multilayer perceptron classifier

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is 

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to 
provide few architectures for deep learning that are widely used in practice, 
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term 
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), 
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be 
flexible and can handle different types of layers. However, only a part of the 
API is made public. We have to limit the number of public classes in order to 
make it simpler to support other languages. This forces us to use (String or 
Number) parameters instead of introducing of new public classes. One of the 
options to specify the architecture of ANN is to use text configuration with 
layer-wise description. We have considered using Caffe format for this. It 
gives the benefit of compatibility with well known deep learning tool and 
simplifies the support of other languages in Spark. Implementation of a parser 
for the subset of Caffe format might be the first step towards the support of 
general ANN architectures in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron.
# Autoencoder 
# Convolutional neural networks. The interface has to provide few architectures 
for deep learning that are widely used in practice, such as AlexNet.

*Additional features:* (lower priority)
# The internal API of Spark ANN is designed to be flexible and can handle 
different types of layers. However, only a part of the API is made public. We 
have to limit the number of public classes in order to make it simpler to 
support other languages. This forces us to use (String or Number) parameters 
instead of introducing of new public classes. One of the options to specify the 
architecture of ANN is to use text configuration with layer-wise description. 
We have considered using Caffe format for this. It gives the benefit of 
compatibility with well known deep learning tool and simplifies the support of 
other languages in Spark. Implementation of a parser for the subset of Caffe 
format might be the first step towards the support of general ANN architectures 
in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.




*Requirements:* 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/b

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5575:

Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Implement extensible API compatible with Spark ML. Basic abstractions such as 
Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should 
be implemented as traits or interfaces, so they can be easily extended or 
reused. 
# Performance. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.

# Implement efficient distributed training. It relies heavily on the efficient 
communication and scheduling mechanisms. The default implementation is based on 
Spark. More efficient implementations might include some external libraries but 
use the same interface defined.

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 

One can wrap other deep learning implementations with this interface allowing 
users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the 
default one. The interface has to provide few architectures for deep learning 
that are widely used in practice, such as AlexNet. The main motivation for 
using specialized libraries for deep learning would be to fully take advantage 
of the hardware where Spark runs, in particular GPUs. Having the default 
interface in Spark, we will need to wrap only a subset of functions from a 
given specialized library. It does require an effort, however it is not the 
same as wrapping all functions. Wrappers can be provided as packages without 
the need to pull new dependencies into Spark.




*Requirements:* 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.

  was:
Goal: Implement various types of artificial neural networks

Motivation: deep learning trend

Requirements: 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural netwo

[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2016-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395434#comment-15395434
 ] 

Alexander Ulanov commented on SPARK-9120:
-

Thanks for the comment, RegressionModel does not extend that trait indeed. 
However it is designed to handle one output variable, as mentioned in the 
description. This presents it from use in multivariate regression.

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel has "predict:Double". Its "train" method uses 
> "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) 
> is hard-coded to have only one output. There exist a similar problem in MLLib 
> (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2016-07-27 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9120:

Description: 
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update: After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel has "predict:Double". Its "train" method uses 
"predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) is 
hard-coded to have only one output. There exist a similar problem in MLLib 
(https://issues.apache.org/jira/browse/SPARK-5362). 

The possible solution for this problem might require to redesign the class 
hierarchy or addition of a separate interface that extends model. Though the 
latter means code duplication.


  was:
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update: After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has "predict:Double". 
Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
(and RegressionModel) is hard-coded to have only one output. There exist a 
similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). 

The possible solution for this problem might require to redesign the class 
hierarchy or addition of a separate interface that extends model. Though the 
latter means code duplication.



> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel has "predict:Double". Its "train" method uses 
> "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) 
> is hard-coded to have only one output. There exist a similar problem in MLLib 
> (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks

2016-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395379#comment-15395379
 ] 

Alexander Ulanov commented on SPARK-10627:
--

[~RubenJanssen] These are major features. Could you work on something smaller 
as a first step?

> Regularization for artificial neural networks
> -
>
> Key: SPARK-10627
> URL: https://issues.apache.org/jira/browse/SPARK-10627
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Add regularization for artificial neural networks. Includes, but not limited 
> to:
> 1)L1 and L2 regularization
> 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
> 3)Dropconnect 
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2016-06-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351922#comment-15351922
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Here is the PR https://github.com/apache/spark/pull/13621

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15899) file scheme should be used correctly

2016-06-22 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345611#comment-15345611
 ] 

Alexander Ulanov commented on SPARK-15899:
--

`user.dir` on Windows starts with a letter:
scala> System.getProperty("user.dir")
res0: String = C:\Program Files (x86)\scala\bin

On Linux it starts with a slash:
scala> System.getProperty("user.dir")
res0: String = /home/hduser

It seems that java.io.File could convert it to a proper URI:
Windows:
scala> new File("c:/myfile").toURI
res6: java.net.URI = file:/c:/myfile
Linux:
scala> new File("/home/myfile").toURI
res3: java.net.URI = file:/home/myfile

We can remove "file:" from 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58
 and add toURI conversion in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L694





> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-22 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345264#comment-15345264
 ] 

Alexander Ulanov commented on SPARK-15581:
--

The current implementation of multilayer perceptron in Spark is less than 2x 
slower than Caffe, both measured on CPU. The main overhead sources are JVM and 
Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, I expect that 
efficient implementation of deep learning in Spark will be only few times 
slower than in specialized tool. This is very reasonable for the platform that 
does much more than deep learning and I believe it is understood by the 
community.

The main motivation for using specialized libraries for deep learning would be 
to fully take advantage of the hardware where Spark runs, in particular GPUs. 
Having the default interface in Spark, we will need to wrap only a subset of 
functions from a given specialized library. It does require an effort, however 
it is not the same as wrapping all functions. Wrappers can be provided as 
packages without the need to pull new dependencies into Spark.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvement

[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ] 

Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM:
---

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML. 
We might also include recurrent networks as well.

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. One 
can wrap other deep learning implementations with this interface allowing users 
to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default 
one. The interface has to provide few architectures for deep learning that are 
widely used in practice, such as AlexNet.

The ultimate goal will be to provide efficient distributed training. It relies 
heavily on the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.


was (Author: avulanov):
I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analyt

[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ] 

Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM:
---

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML. 
We might also include recurrent networks as well.

Update (6/16) based on our conversation with Ben Lorica:

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. One 
can wrap other deep learning implementations with this interface allowing users 
to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default 
one. The interface has to provide few architectures for deep learning that are 
widely used in practice, such as AlexNet.

The ultimate goal will be to provide efficient distributed training. It relies 
heavily on the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.


was (Author: avulanov):
I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML li

[jira] [Commented] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

2016-06-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330065#comment-15330065
 ] 

Alexander Ulanov commented on SPARK-15893:
--

Actually, the code that I am trying to run does not have explicit paths in it. 
It is Spark unit tests that were running properly on 1.6 (and with earlier 
versions) on Windows. It seems that the recent change in 2.0 broke that. Could 
you propose a way to debug this?

> spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
> ---
>
> Key: SPARK-15893
> URL: https://issues.apache.org/jira/browse/SPARK-15893
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Alexander Ulanov
>
> spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
> For example, LogisticRegressionSuite fails at Line 46:
> Exception encountered when invoking run on a nested suite - 
> java.net.URISyntaxException: Relative path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109)
> Another example, DataFrameSuite raises:
> java.net.URISyntaxException: Relative path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> file:C:/dev/spark/external/flume-assembly/spark-warehouse
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

2016-06-10 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15893:


 Summary: spark.createDataFrame raises an exception in Spark 2.0 
tests on Windows
 Key: SPARK-15893
 URL: https://issues.apache.org/jira/browse/SPARK-15893
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.0.0
Reporter: Alexander Ulanov


spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

For example, LogisticRegressionSuite fails at Line 46:
Exception encountered when invoking run on a nested suite - 
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109)


Another example, DataFrameSuite raises:
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-10 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ] 

Alexander Ulanov commented on SPARK-15581:
--

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on

[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323704#comment-15323704
 ] 

Alexander Ulanov commented on SPARK-15851:
--

Sorry for confusion, I mean the shell that is "/bin/sh". Windows version of it 
comes with Git.

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323638#comment-15323638
 ] 

Alexander Ulanov commented on SPARK-15851:
--

I can do that. However, it seems that "spark-build-info" can be rewritten as a 
shell script. This will remove the need to install bash for Windows users that 
compile Spark with maven. What do you think?

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323624#comment-15323624
 ] 

Alexander Ulanov commented on SPARK-15851:
--

This does not work because Ant uses Java Process to run executable which 
returns "not a valid Win32 application". In order to run it, one need to run 
"bash" and provide bash file as a param. This approach I proposed as a 
work-around. For more details please refer to: 
http://stackoverflow.com/questions/20883212/how-can-i-use-ant-exec-to-execute-commands-on-linux

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-15851:
-
Fix Version/s: 2.0.0

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-15851:
-
Target Version/s: 2.0.0
   Fix Version/s: (was: 2.0.0)

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15851:


 Summary: Spark 2.0 does not compile in Windows 7
 Key: SPARK-15851
 URL: https://issues.apache.org/jira/browse/SPARK-15851
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
 Environment: Windows 7
Reporter: Alexander Ulanov


Spark does not compile in Windows 7.
"mvn compile" fails on spark-core due to trying to execute a bash script 
spark-build-info.

Work around:
1)Install win-bash and put in path
2)Change line 350 of core/pom.xml

  
  
  


Error trace:
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
spark-core_2.11: An Ant BuildException has occured: Execute failed: 
java.io.IOException: Cannot run program 
"C:\dev\spark\core\..\build\spark-build-info" (in directory 
"C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
application
[ERROR] around Ant part .. @ 4:73 in 
C:\dev\spark\core\target\antrun\build-main.xml




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-17 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150960#comment-15150960
 ] 

Alexander Ulanov commented on SPARK-9273:
-

[~srowen] Do you mean that CNN will never be merged into Spark ML?

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149139#comment-15149139
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi [~gsateesh110],

Besides the one mentioned by Yuhao, there is SparkNet that allows using Caffe. 

In future, I plan to switch the present neural network implementation in Spark 
to tensors, and probably implement CNN, that is easier with tensors: 
https://github.com/avulanov/spark/tree/mlp-tensor

Best regards, Alexander

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114080#comment-15114080
 ] 

Alexander Ulanov edited comment on SPARK-10528 at 1/24/16 1:30 AM:
---

Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked 
with the earlier versions of Spark. Changing permissions do not help. Spark 
launches eventually with that error and does not provide sqlContext. I've 
checked Spark 1.4.1 and it worked fine.

Is there a workaround? 


was (Author: avulanov):
Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked 
with the earlier versions of Spark. Changing permissions do not help. Is there 
a workaround? 

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114080#comment-15114080
 ] 

Alexander Ulanov commented on SPARK-10528:
--

Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked 
with the earlier versions of Spark. Changing permissions do not help. Is there 
a workaround? 

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-13 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1. Vincent, Pascal, et al. "Extracting and composing robust features with 
denoising autoencoders." Proceedings of the 25th international conference on 
Machine learning. ACM, 2008. 
http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
 
2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-11 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers: 

References: 
1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1, 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997300#comment-14997300
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Narine,

Thank you for your observation. It seems that such information is useful to 
know. Indeed, LBFGS in Spark does not print any information during the 
execution. ANN uses Spark's LBFGS. You might want to add the needed output to 
the LBFGS code 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L185.
 

Best regards, Alexander 


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992447#comment-14992447
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 11/5/15 8:50 PM:
--

Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. Does it work with LBFGS?

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.


was (Author: avulanov):
Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. 

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992447#comment-14992447
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. 

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988705#comment-14988705
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Disha,

RNN is a major feature. I suggest to start from a smaller contribution. Spark 
contains the implementation of multi-layer perceptron since version 1.5. New 
features are supposed to re-use its code and follow the internal API that it 
has introduced. 

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-11262:


 Summary: Unit test for gradient, loss layers, memory management 
for multilayer perceptron
 Key: SPARK-11262
 URL: https://issues.apache.org/jira/browse/SPARK-11262
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.1
Reporter: Alexander Ulanov
 Fix For: 1.5.1


Multi-layer perceptron requires more rigorous tests and refactoring of layer 
interfaces to accommodate development of new features.
1)Implement unit test for gradient and loss
2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944173#comment-14944173
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Weide,

These are major features and some of them are under development. You can check 
their status in the linked issues. Could you work on something smaller as a 
first step? [~mengxr], do you have any suggestions?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940446#comment-14940446
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi, Weide,

Sounds good! What kind of feature are you planning to add?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks

2015-09-15 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746636#comment-14746636
 ] 

Alexander Ulanov commented on SPARK-10627:
--

Dropout WIP refactoring for the new ML API 
https://github.com/avulanov/spark/tree/dropout-mlp. 

> Regularization for artificial neural networks
> -
>
> Key: SPARK-10627
> URL: https://issues.apache.org/jira/browse/SPARK-10627
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Add regularization for artificial neural networks. Includes, but not limited 
> to:
> 1)L1 and L2 regularization
> 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
> 3)Dropconnect 
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10627) Regularization for artificial neural networks

2015-09-15 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10627:


 Summary: Regularization for artificial neural networks
 Key: SPARK-10627
 URL: https://issues.apache.org/jira/browse/SPARK-10627
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Add regularization for artificial neural networks. Includes, but not limited to:
1)L1 and L2 regularization
2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
3)Dropconnect 
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-09-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737948#comment-14737948
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 9/10/15 1:18 AM:
--

Hi Yuhao! I have few comments regarding the interface and the optimization of 
your implementation. There are two options of optimizing convolutions: using 
matrix-matrix multiplication and using FFTs. The latter seems a bit more 
complicated since we don't have optimized parallel FFT in Spark. It also has to 
support batch data processing. Instead, if one uses matrix-matrix 
multiplication for convolution, then it can take advantage of native BLAS and 
batch computations can be supported straightforward. Another benefit is that we 
would not need to change the current Layer's input/ouptput type (matrix) to 
tensor. We can store the unwrapped inputs/outputs as vectors within the 
input/output matrix. Does it make sense to you?



was (Author: avulanov):
Hi Yuhao! I have few comments regarding the interface and the optimization of 
your implementation. There are two options of optimizing convolutions: using 
matrix-matrix multiplication and using FFTs. The latter seems a bit more 
complicated since we don't have optimized parallel FFT in Spark. It also has to 
support batch data processing. Instead, if one uses matrix-matrix 
multiplication for convolution, then it can take advantage of native BLAS and 
batch computations can be supported straightforward. Another benefit is that we 
would not need to change the current Layer's input/ouptput type (matrix) to 
tensor. We can store the unwrapped inputs/outputs as vectors within the 
input/output matrix. Do you think that it is reasonable?


> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-09-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737948#comment-14737948
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi Yuhao! I have few comments regarding the interface and the optimization of 
your implementation. There are two options of optimizing convolutions: using 
matrix-matrix multiplication and using FFTs. The latter seems a bit more 
complicated since we don't have optimized parallel FFT in Spark. It also has to 
support batch data processing. Instead, if one uses matrix-matrix 
multiplication for convolution, then it can take advantage of native BLAS and 
batch computations can be supported straightforward. Another benefit is that we 
would not need to change the current Layer's input/ouptput type (matrix) to 
tensor. We can store the unwrapped inputs/outputs as vectors within the 
input/output matrix. Do you think that it is reasonable?


> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4752) Classifier based on artificial neural network

2015-09-08 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov closed SPARK-4752.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-02 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10324:
-
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
** logistic regression (SPARK-7685)
** linear regression (SPARK-9642)
** random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-5575)
** autoencoder (SPARK-10408)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
** feature interaction (SPARK-9698)
** SQL transformer (SPARK-8345)
** ??
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You 
can find a complete list 
[here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall 
into two major categories:

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.m

[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/autoencoder-mlp/mllib/src/main/scala/org/apache/spark/ml/feature/Autoencoder.scala)


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp (

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp (


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:54 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala)


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers: 

References: 
1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10409) Multilayer perceptron regression

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10409:


 Summary: Multilayer perceptron regression
 Key: SPARK-10409
 URL: https://issues.apache.org/jira/browse/SPARK-10409
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Implement regression based on multilayer perceptron (MLP). It should support 
different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. The 
implementation might take advantage of autoencoder. Time-series forecasting for 
financial data might be one of the use cases, see 
http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726435#comment-14726435
 ] 

Alexander Ulanov commented on SPARK-10409:
--

Basic implementation with the current ML api can be found here: 
https://github.com/avulanov/spark/blob/a2261330c227be8ef26172dbe355a617d653553a/mllib/src/main/scala/org/apache/spark/ml/regression/MultilayerPerceptronRegressor.scala

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Issue Type: Umbrella  (was: Improvement)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10408:


 Summary: Autoencoder
 Key: SPARK-10408
 URL: https://issues.apache.org/jira/browse/SPARK-10408
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-17 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700567#comment-14700567
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I've submitter a PR for the user guide. Could you suggest if the example code 
in the PR can be used for this issue? https://github.com/apache/spark/pull/8262

> Example code for Multilayer Perceptron Classifier
> -
>
> Key: SPARK-9951
> URL: https://issues.apache.org/jira/browse/SPARK-9951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697902#comment-14697902
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I have this already, I plan to use it for the User Guide. Should we have a 
different example code in the examples?

> Example code for Multilayer Perceptron Classifier
> -
>
> Key: SPARK-9951
> URL: https://issues.apache.org/jira/browse/SPARK-9951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9897:

Comment: was deleted

(was: We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?)

> User Guide for Multilayer Perceptron Classifier
> ---
>
> Key: SPARK-9897
> URL: https://issues.apache.org/jira/browse/SPARK-9897
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>
> SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
> docs. We should update the user guide to include this under the {{Algorithm 
> Guides > Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694355#comment-14694355
 ] 

Alexander Ulanov commented on SPARK-9897:
-

We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?

> User Guide for Multilayer Perceptron Classifier
> ---
>
> Key: SPARK-9897
> URL: https://issues.apache.org/jira/browse/SPARK-9897
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>
> SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
> docs. We should update the user guide to include this under the {{Algorithm 
> Guides > Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694356#comment-14694356
 ] 

Alexander Ulanov commented on SPARK-9897:
-

We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?

> User Guide for Multilayer Perceptron Classifier
> ---
>
> Key: SPARK-9897
> URL: https://issues.apache.org/jira/browse/SPARK-9897
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>
> SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
> docs. We should update the user guide to include this under the {{Algorithm 
> Guides > Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9508) Align graphx programming guide with the updated Pregel code

2015-07-31 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9508:
---

 Summary: Align graphx programming guide with the updated Pregel 
code
 Key: SPARK-9508
 URL: https://issues.apache.org/jira/browse/SPARK-9508
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0


SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be 
modified accordingly since it lists the old Pregel code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9471) Multilayer perceptron

2015-07-30 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9471:
---

 Summary: Multilayer perceptron 
 Key: SPARK-9471
 URL: https://issues.apache.org/jira/browse/SPARK-9471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


Implement Multilayer Perceptron for Spark ML. Requirements:
1) ML pipelines interface
2) Extensible internal interface for further development of artificial neural 
networks for ML
3) Efficient and scalable: use vectors and BLAS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646853#comment-14646853
 ] 

Alexander Ulanov commented on SPARK-9380:
-

It seems that I did not name the PR correctly. I renamed it and resolved this 
issue. Sorry for inconvenience.


> Pregel example fix in graphx-programming-guide
> --
>
> Key: SPARK-9380
> URL: https://issues.apache.org/jira/browse/SPARK-9380
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>
> Pregel operator to express single source
> shortest path does not work due to incorrect type of the graph: Graph[Int, 
> Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9380:

Comment: was deleted

(was: It seems that I did not name the PR correctly. I renamed it and resolved 
this issue. Sorry for inconvenience.
)

> Pregel example fix in graphx-programming-guide
> --
>
> Key: SPARK-9380
> URL: https://issues.apache.org/jira/browse/SPARK-9380
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>
> Pregel operator to express single source
> shortest path does not work due to incorrect type of the graph: Graph[Int, 
> Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646854#comment-14646854
 ] 

Alexander Ulanov commented on SPARK-9380:
-

It seems that I did not name the PR correctly. I renamed it and resolved this 
issue. Sorry for inconvenience.


> Pregel example fix in graphx-programming-guide
> --
>
> Key: SPARK-9380
> URL: https://issues.apache.org/jira/browse/SPARK-9380
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>
> Pregel operator to express single source
> shortest path does not work due to incorrect type of the graph: Graph[Int, 
> Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov resolved SPARK-9380.
-
Resolution: Fixed

> Pregel example fix in graphx-programming-guide
> --
>
> Key: SPARK-9380
> URL: https://issues.apache.org/jira/browse/SPARK-9380
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>
> Pregel operator to express single source
> shortest path does not work due to incorrect type of the graph: Graph[Int, 
> Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9436) Simplify Pregel by merging joins

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9436:

Summary: Simplify Pregel by merging joins  (was: Merge joins in Pregel )

> Simplify Pregel by merging joins
> 
>
> Key: SPARK-9436
> URL: https://issues.apache.org/jira/browse/SPARK-9436
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Pregel code contains two consecutive joins: 
> ```
> g.vertices.innerJoin(messages)(vprog)
> ...
> g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => 
> newOpt.getOrElse(old) }
> ```
> They can be replaced by one join. Ankur Dave proposed a patch based on our 
> discussion in mailing list: 
> https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9436) Merge joins in Pregel

2015-07-29 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9436:
---

 Summary: Merge joins in Pregel 
 Key: SPARK-9436
 URL: https://issues.apache.org/jira/browse/SPARK-9436
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0


Pregel code contains two consecutive joins: 
```
g.vertices.innerJoin(messages)(vprog)
...
g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) 
}
```
They can be replaced by one join. Ankur Dave proposed a patch based on our 
discussion in mailing list: 
https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-27 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9380:
---

 Summary: Pregel example fix in graphx-programming-guide
 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


Pregel operator to express single source
shortest path does not work due to incorrect type of the graph: Graph[Int, 
Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642513#comment-14642513
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 7/27/15 9:54 AM:
--

I have not heard about the PR until it was submitted. It would be useful to 
look at the code, benchmark it and see if it fits our API. I've added the link 
to the umbrella issue for deep learning 
https://issues.apache.org/jira/browse/SPARK-5575


was (Author: avulanov):
I have not heard about the PR until it was submitted. It would be useful to 
look at the code, benchmark it and see if it fits our API.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642513#comment-14642513
 ] 

Alexander Ulanov commented on SPARK-9273:
-

I have not heard about the PR until it was submitted. It would be useful to 
look at the code, benchmark it and see if it fits our API.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-21 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9120:

Description: 
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update: After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has "predict:Double". 
Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
(and RegressionModel) is hard-coded to have only one output. There exist a 
similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). 

The possible solution for this problem might require to redesign the class 
hierarchy or addition of a separate interface that extends model. Though the 
latter means code duplication.


  was:
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update:After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has "predict:Double". 
Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
is hard-coded to have only one output. It is the same problem that I pointed 
out long time ago in MLLib (



> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel extends PredictionModel which has "predict:Double". 
> Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
> (and RegressionModel) is hard-coded to have only one output. There exist a 
> similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-21 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9120:

Description: 
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".

Update:After reading the design docs, adding "predictMultivariate" to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has "predict:Double". 
Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
is hard-coded to have only one output. It is the same problem that I pointed 
out long time ago in MLLib (


  was:
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".



> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update:After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel extends PredictionModel which has "predict:Double". 
> Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel 
> is hard-coded to have only one output. It is the same problem that I pointed 
> out long time ago in MLLib (



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630654#comment-14630654
 ] 

Alexander Ulanov commented on SPARK-9120:
-

Thank you, it sounds doable.

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630596#comment-14630596
 ] 

Alexander Ulanov commented on SPARK-9120:
-

I think it should work for the train (aka fit) that has to return the model, 
not sure about the model itself. The common ancestor Model does not contain 
anything that can be called for prediction, its direct successor 
PredictionModel has predict:Double. Is there another way that you were 
mentioning?

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630560#comment-14630560
 ] 

Alexander Ulanov commented on SPARK-9120:
-

Thank you for sharing your thoughts. Do you mean that the algorithm that does 
multivariate regression should not be implemented within ML since ML does not 
support multivariate, so the algorithm should live within MLlib for a while 
until you figure out a generic interface? By support I mean handling the ".fit" 
and ".transform" staff etc.

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630453#comment-14630453
 ] 

Alexander Ulanov commented on SPARK-3702:
-

[~josephkb] Hi, Joseph! Do you plan to add support for multivariate regression? 
I need this for multi-layer perceptron. Multivariate regression interface might 
be useful for other tasks. I've added an issue 
https://issues.apache.org/jira/browse/SPARK-9120. Also I wonder if you plan to 
add integer array parameters: https://issues.apache.org/jira/browse/SPARK-9118. 
Both seems to be relatively easy to implement, the question is do you plan to 
merge these features in the near future?

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9120:
---

 Summary: Add multivariate regression (or prediction) interface
 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method "predict:Double" by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify "RegressionModel" interface similarly to how it is done in 
"ClassificationModel", which supports multiclass classification. It has 
"predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should 
have something like "predictMultivariate:Vector".




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-16 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9118:
---

 Summary: Implement integer array parameters for ml.param as 
IntArrayParam
 Key: SPARK-9118
 URL: https://issues.apache.org/jira/browse/SPARK-9118
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0


ml/param/params.scala lacks integer array parameter. It is needed for some 
models such as multilayer perceptron to specify the layer sizes. I suggest to 
implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592398#comment-14592398
 ] 

Alexander Ulanov edited comment on SPARK-8449 at 6/18/15 7:53 PM:
--

It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worths 
a look.


was (Author: avulanov):
It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worth a 
look.

> HDF5 read/write support for Spark MLlib
> ---
>
> Key: SPARK-8449
> URL: https://issues.apache.org/jira/browse/SPARK-8449
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.1
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Add support for reading and writing HDF5 file format to/from LabeledPoint. 
> HDFS and local file system have to be supported. Other Spark formats to be 
> discussed. 
> Interface proposal:
> /* path - directory path in any Hadoop-supported file system URI */
> MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
> /* path - file or directory path in any Hadoop-supported file system URI */
> MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592398#comment-14592398
 ] 

Alexander Ulanov commented on SPARK-8449:
-

It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worth a 
look.

> HDF5 read/write support for Spark MLlib
> ---
>
> Key: SPARK-8449
> URL: https://issues.apache.org/jira/browse/SPARK-8449
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.1
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Add support for reading and writing HDF5 file format to/from LabeledPoint. 
> HDFS and local file system have to be supported. Other Spark formats to be 
> discussed. 
> Interface proposal:
> /* path - directory path in any Hadoop-supported file system URI */
> MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
> /* path - file or directory path in any Hadoop-supported file system URI */
> MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-8449:
---

 Summary: HDF5 read/write support for Spark MLlib
 Key: SPARK-8449
 URL: https://issues.apache.org/jira/browse/SPARK-8449
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.1


Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS 
and local file system have to be supported. Other Spark formats to be 
discussed. 

Interface proposal:
/* path - directory path in any Hadoop-supported file system URI */
MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
/* path - file or directory path in any Hadoop-supported file system URI */
MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-06-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582768#comment-14582768
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Janani, 

There is already an implemenation of DBN (and RBM) by [~gq]. You can find it 
here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-05-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538209#comment-14538209
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Current implementation: 
https://github.com/avulanov/spark/tree/ann-interface-gemm

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7316) Add step capability to RDD sliding window

2015-05-06 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531229#comment-14531229
 ] 

Alexander Ulanov commented on SPARK-7316:
-

I would say that the major use case is practical considerations :)

In my case it is time series analysis of sensor data. It does not make sense to 
analyze time windows with step 1 because it is high-frequency sensor (1024 Hz). 
Also, even if we want to do it, the size of the resulting data gets enormous. 
For example, I have 2B data points (542 hours) of size 23GB binary data. If I 
apply sliding window with size 1024 and step 1, it will result in 
1024*23=23.5TB of data which I am not able to process with Spark currently 
(honestly speaking my disk space is only 10TB). If you store data in HDFS than 
it will be tripled, i.e. 70TB. 


> Add step capability to RDD sliding window
> -
>
> Key: SPARK-7316
> URL: https://issues.apache.org/jira/browse/SPARK-7316
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDDFunctions in MLlib contains sliding window implementation with step 1. 
> User should be able to define step. This capability should be implemented.
> Although one can generate sliding windows with step 1 and then filter every 
> Nth window, it might take much more time and disk space depending on the step 
> size. For example, if your window is 1000 then you will generate the amount 
> of data thousand times bigger than your initial dataset. It does not make 
> sense if you need just every Nth window, so the data generated will be 1000/N 
> smaller. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7316) Add step capability to RDD sliding window

2015-05-04 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-7316:

Description: 
RDDFunctions in MLlib contains sliding window implementation with step 1. User 
should be able to define step. This capability should be implemented.

Although one can generate sliding windows with step 1 and then filter every Nth 
window, it might take much more time and disk space depending on the step size. 
For example, if your window is 1000 then you will generate the amount of data 
thousand times bigger than your initial dataset. It does not make sense if you 
need just every Nth window, so the data generated will be 1000/N smaller. 



  was:RDDFunctions in MLlib contains sliding window implementation with step 1. 
User should be able to define step. This capability should be implemented.


> Add step capability to RDD sliding window
> -
>
> Key: SPARK-7316
> URL: https://issues.apache.org/jira/browse/SPARK-7316
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Alexander Ulanov
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDDFunctions in MLlib contains sliding window implementation with step 1. 
> User should be able to define step. This capability should be implemented.
> Although one can generate sliding windows with step 1 and then filter every 
> Nth window, it might take much more time and disk space depending on the step 
> size. For example, if your window is 1000 then you will generate the amount 
> of data thousand times bigger than your initial dataset. It does not make 
> sense if you need just every Nth window, so the data generated will be 1000/N 
> smaller. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7316) Add step capability to RDD sliding window

2015-05-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-7316:
---

 Summary: Add step capability to RDD sliding window
 Key: SPARK-7316
 URL: https://issues.apache.org/jira/browse/SPARK-7316
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


RDDFunctions in MLlib contains sliding window implementation with step 1. User 
should be able to define step. This capability should be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494579#comment-14494579
 ] 

Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:48 PM:
--

[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, shouldn't they? :)


was (Author: avulanov):
[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, don't you think? :)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494579#comment-14494579
 ] 

Alexander Ulanov commented on SPARK-5256:
-

[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, don't you think? :)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494568#comment-14494568
 ] 

Alexander Ulanov commented on SPARK-5256:
-

The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper "The tradeoffs of 
large scale learning", SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494568#comment-14494568
 ] 

Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:43 PM:
--

The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper "The tradeoffs of 
large scale learning", SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series. 

Just in case, link to the paper 
http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf


was (Author: avulanov):
The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper "The tradeoffs of 
large scale learning", SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494518#comment-14494518
 ] 

Alexander Ulanov commented on SPARK-5256:
-

Probably the main issue for MLlib is that iterative algorithms are implemented 
with aggregate function. It has a fixed overhead around half of a second that 
limits its application when one needs to make big number of iterations. This is 
the case for bigger data for which Spark is intended for. This problem gets 
worse with stochastic algorithms because there is no good way to randomly pick 
data from RDD and one needs to sequentially look through it.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485554#comment-14485554
 ] 

Alexander Ulanov commented on SPARK-6682:
-

[~yuu.ishik...@gmail.com] 
They reside in package org.apache.spark.mllib.optimization: class LBFGS(private 
var gradient: Gradient, private var updater: Updater) and class GradientDescent 
private[mllib] (private var gradient: Gradient, private var updater: Updater). 
They extend Optimizer trait that has only one function: def optimize(data: 
RDD[(Double, Vector)], initialWeights: Vector): Vector. This function is 
limited to only one type of input: vectors and their labels. I have submitted a 
separate issue regarding this https://issues.apache.org/jira/browse/SPARK-5362. 

1. Right now static methods work with hard-coded optimizers, such as 
LogisticRegressionWithSGD. This is not very convenient. I think moving away 
from static methods and use builders implies that optimizers also could be set 
by users. It will be a problem because current optimizers require Updater and 
Gradient at the creation time. 
2. The workaround I suggested in the previous post addresses this.


> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483729#comment-14483729
 ] 

Alexander Ulanov edited comment on SPARK-6682 at 4/7/15 6:35 PM:
-

This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {

val lbfgs = new LBFGS(_gradient, _updater)

optimizer = lbfgs

lbfgs

  }

```

 Another downside of it is that if someone implements new Optimizer then one 
have to add "setMyOptimizer" to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {

optimizer match {

  case lbfgs: LBFGS => lbfgs.setGradient(gradient)

  case sgd: GradientDescent => sgd.setGradient(gradient)

  case other => throw new UnsupportedOperationException(

s"Only LBFGS and GradientDescent are supported but got 
${other.getClass}.")

}

  }

```

So it is essential to work out the Optimizer interface first.


was (Author: avulanov):
This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {
val lbfgs = new LBFGS(_gradient, _updater)
optimizer = lbfgs
lbfgs
  }
```

 Another downside of it is that if someone implements new Optimizer then one 
have to add "setMyOptimizer" to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {
optimizer match {
  case lbfgs: LBFGS => lbfgs.setGradient(gradient)
  case sgd: GradientDescent => sgd.setGradient(gradient)
  case other => throw new UnsupportedOperationException(
s"Only LBFGS and GradientDescent are supported but got 
${other.getClass}.")
}
  }
```

So it is essential to work out the Optimizer interface first.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the bui

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483729#comment-14483729
 ] 

Alexander Ulanov commented on SPARK-6682:
-

This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {
val lbfgs = new LBFGS(_gradient, _updater)
optimizer = lbfgs
lbfgs
  }
```

 Another downside of it is that if someone implements new Optimizer then one 
have to add "setMyOptimizer" to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {
optimizer match {
  case lbfgs: LBFGS => lbfgs.setGradient(gradient)
  case sgd: GradientDescent => sgd.setGradient(gradient)
  case other => throw new UnsupportedOperationException(
s"Only LBFGS and GradientDescent are supported but got 
${other.getClass}.")
}
  }
```

So it is essential to work out the Optimizer interface first.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395185#comment-14395185
 ] 

Alexander Ulanov commented on SPARK-2356:
-

The following worked for me:
Download http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and 
put it to DISK:\FOLDERS\bin\
Set HADOOP_CONF=DISK:\FOLDERS

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395001#comment-14395001
 ] 

Alexander Ulanov commented on SPARK-6673:
-

Probably similar issue: I am trying to execute unit tests in MLlib with 
LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log 
saying that: "Cannot find any assembly build directories." If I do set 
SPARK_SCALA_VERSION=2.10 then I get "No assemblies found in 
'C:\dev\spark\mllib\.\assembly\target\scala-2.10'"

> spark-shell.cmd can't start even when spark was built in Windows
> 
>
> Key: SPARK-6673
> URL: https://issues.apache.org/jira/browse/SPARK-6673
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.3.0
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Blocker
>
> spark-shell.cmd can't start.
> {code}
> bin\spark-shell.cmd --master local
> {code}
> will get
> {code}
> Failed to find Spark assembly JAR.
> You need to build Spark before running this program.
> {code}
> even when we have built spark.
> This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which 
> is used in {{spark-class2.cmd}}.
> In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in 
> {{load-spark-env.sh}}, but there are no equivalent script in Windows.
> As workaround, by executing
> {code}
> set SPARK_SCALA_VERSION=2.10
> {code}
> before execute spark-shell.cmd, we can successfully start it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >