[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003110#comment-16003110 ] Alexander Ulanov commented on SPARK-10408: -- Autoencoder is implemented in the referenced pull request. I will be glad to follow up on the code review if anyone can do it. > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7253) Add example of belief propagation with GraphX
[ https://issues.apache.org/jira/browse/SPARK-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746841#comment-15746841 ] Alexander Ulanov commented on SPARK-7253: - Here is the implementation of belief propagation algorithm for factor graphs with examples: https://github.com/HewlettPackard/sandpiper > Add example of belief propagation with GraphX > - > > Key: SPARK-7253 > URL: https://issues.apache.org/jira/browse/SPARK-7253 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Joseph K. Bradley > > It would nice to document (via an example) how to use GraphX to do belief > propagation. It's probably too much right now to talk about a full-fledged > graphical model library (and that would belong in MLlib anyways), but a > simple example of a graphical model + BP would be nice to add to GraphX. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566467#comment-15566467 ] Alexander Ulanov commented on SPARK-17870: -- [`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) works with "a Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores". According to what you observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case for all functions that return two arrays: https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331. Alternative, one case use raw `chi2` scores for sorting. She need to pass only the first array from `chi2` to `SelectKBest`. As far as I remember, using raw chi2 scores is default in Weka's [ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html). So, I would not claim that either of approaches is incorrect. According to [Introduction to IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html), there might be an issue with computing p-values because then chi2-test is used multiple times. Using plain chi2 values does not involve statistical test, so it might be treated as just some ranking with no statistical implications. > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536321#comment-15536321 ] Alexander Ulanov commented on SPARK-5575: - I recently released a package to handle new features that are not yet merged in Spark: https://spark-packages.org/package/avulanov/scalable-deeplearning > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > *Goal:* Implement various types of artificial neural networks > *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) > Having deep learning within Spark's ML library is a question of convenience. > Spark has broad analytic capabilities and it is useful to have deep learning > as one of these tools at hand. Deep learning is a model of choice for several > important modern use-cases, and Spark ML might want to cover them. > Eventually, it is hard to explain, why do we have PCA in ML but don't provide > Autoencoder. To summarize this, Spark should have at least the most widely > used deep learning models, such as fully connected artificial neural network, > convolutional network and autoencoder. Advanced and experimental deep > learning features might reside within packages or as pluggable external > tools. These 3 will provide a comprehensive deep learning set for Spark ML. > We might also include recurrent networks as well. > *Requirements:* > # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, > Layer, Error, Regularization, Forward and Backpropagation etc. should be > implemented as traits or interfaces, so they can be easily extended or > reused. Define the Spark ML API for deep learning. This interface is similar > to the other analytics tools in Spark and supports ML pipelines. This makes > deep learning easy to use and plug in into analytics workloads for Spark > users. > # Efficiency. The current implementation of multilayer perceptron in Spark is > less than 2x slower than Caffe, both measured on CPU. The main overhead > sources are JVM and Spark's communication layer. For more details, please > refer to https://github.com/avulanov/ann-benchmark. Having said that, the > efficient implementation of deep learning in Spark should be only few times > slower than in specialized tool. This is very reasonable for the platform > that does much more than deep learning and I believe it is understood by the > community. > # Scalability. Implement efficient distributed training. It relies heavily on > the efficient communication and scheduling mechanisms. The default > implementation is based on Spark. More efficient implementations might > include some external libraries but use the same interface defined. > *Main features:* > # Multilayer perceptron classifier (MLP) > # Autoencoder > # Convolutional neural networks for computer vision. The interface has to > provide few architectures for deep learning that are widely used in practice, > such as AlexNet > *Additional features:* > # Other architectures, such as Recurrent neural network (RNN), Long-short > term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network > (DBN), MLP multivariate regression > # Regularizers, such as L1, L2, drop-out > # Normalizers > # Network customization. The internal API of Spark ANN is designed to be > flexible and can handle different types of layers. However, only a part of > the API is made public. We have to limit the number of public classes in > order to make it simpler to support other languages. This forces us to use > (String or Number) parameters instead of introducing of new public classes. > One of the options to specify the architecture of ANN is to use text > configuration with layer-wise description. We have considered using Caffe > format for this. It gives the benefit of compatibility with well known deep > learning tool and simplifies the support of other languages in Spark. > Implementation of a parser for the subset of Caffe format might be the first > step towards the support of general ANN architectures in Spark. > # Hardware specific optimization. One can wrap other deep learning > implementations with this interface allowing users to pick a particular > back-end, e.g. Caffe or TensorFlow, along with the default one. The interface > has to provide few architectures for deep learning that are widely used in > practice, such as AlexNet. The main motivation for using specialized > libraries for deep learning would be to fully take advantage of the hardware > where Spark runs, in particular GPUs. Having the default interface in
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Completed (merged to the main Spark branch):* * Requirements: https://issues.apache.org/jira/browse/SPARK-9471 ** API https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/ ** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark * Features: ** Multilayer perceptron classifier https://issues.apache.org/jira/browse/SPARK-9471 *In progress (pull request):* * Features: **
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395489#comment-15395489 ] Alexander Ulanov commented on SPARK-15581: -- [~bordaw] sounds great! Just in case, I have summarized the above discussion related to the DNN in the main DNN jira: https://issues.apache.org/jira/browse/SPARK-5575 > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal of
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Completed (merged to the main Spark branch):* * Requirements: https://issues.apache.org/jira/browse/SPARK-9471 ** API https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/ ** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark * Features: ** Multilayer perceptron classifier https://issues.apache.org/jira/browse/SPARK-9471 *In progress (pull request):* * Features: **
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Completed (merged to the main Spark branch):* # Requirements: https://issues.apache.org/jira/browse/SPARK-9471 ## API https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/ ## Efficiency & Scalability: https://github.com/avulanov/ann-benchmark # Features: ## Multilayer perceptron classifier https://issues.apache.org/jira/browse/SPARK-9471 *In progress (pull request):* # Features: ##
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Completed (merged to the main Spark branch):* # Requirements: https://issues.apache.org/jira/browse/SPARK-9471 ## API https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/ ## Efficiency & Scalability: https://github.com/avulanov/ann-benchmark # Features: ## Multilayer perceptron classifier https://issues.apache.org/jira/browse/SPARK-9471 *In progress (pull request):* # Features: ##
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Progress:* # Requirements: done # Features: ## Multilayer perceptron classifier was: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. was: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron. # Autoencoder # Convolutional neural networks. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. *Additional features:* (lower priority) # The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Requirements:* 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. was: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/b
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Implement extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. # Performance. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. The additional benefit of implementing deep learning for Spark is that we define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Requirements:* 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. was: Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > *Goal:* Implement various types of artificial neural netwo
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395434#comment-15395434 ] Alexander Ulanov commented on SPARK-9120: - Thanks for the comment, RegressionModel does not extend that trait indeed. However it is designed to handle one output variable, as mentioned in the description. This presents it from use in multivariate regression. > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". > Update: After reading the design docs, adding "predictMultivariate" to > RegressionModel does not seem reasonable to me anymore. The issue is as > follows. RegressionModel has "predict:Double". Its "train" method uses > "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) > is hard-coded to have only one output. There exist a similar problem in MLLib > (https://issues.apache.org/jira/browse/SPARK-5362). > The possible solution for this problem might require to redesign the class > hierarchy or addition of a separate interface that extends model. Though the > latter means code duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9120: Description: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". Update: After reading the design docs, adding "predictMultivariate" to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel has "predict:Double". Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) is hard-coded to have only one output. There exist a similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). The possible solution for this problem might require to redesign the class hierarchy or addition of a separate interface that extends model. Though the latter means code duplication. was: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". Update: After reading the design docs, adding "predictMultivariate" to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has "predict:Double". Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) is hard-coded to have only one output. There exist a similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). The possible solution for this problem might require to redesign the class hierarchy or addition of a separate interface that extends model. Though the latter means code duplication. > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". > Update: After reading the design docs, adding "predictMultivariate" to > RegressionModel does not seem reasonable to me anymore. The issue is as > follows. RegressionModel has "predict:Double". Its "train" method uses > "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) > is hard-coded to have only one output. There exist a similar problem in MLLib > (https://issues.apache.org/jira/browse/SPARK-5362). > The possible solution for this problem might require to redesign the class > hierarchy or addition of a separate interface that extends model. Though the > latter means code duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks
[ https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395379#comment-15395379 ] Alexander Ulanov commented on SPARK-10627: -- [~RubenJanssen] These are major features. Could you work on something smaller as a first step? > Regularization for artificial neural networks > - > > Key: SPARK-10627 > URL: https://issues.apache.org/jira/browse/SPARK-10627 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Add regularization for artificial neural networks. Includes, but not limited > to: > 1)L1 and L2 regularization > 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf > 3)Dropconnect > http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351922#comment-15351922 ] Alexander Ulanov commented on SPARK-10408: -- Here is the PR https://github.com/apache/spark/pull/13621 > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15899) file scheme should be used correctly
[ https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345611#comment-15345611 ] Alexander Ulanov commented on SPARK-15899: -- `user.dir` on Windows starts with a letter: scala> System.getProperty("user.dir") res0: String = C:\Program Files (x86)\scala\bin On Linux it starts with a slash: scala> System.getProperty("user.dir") res0: String = /home/hduser It seems that java.io.File could convert it to a proper URI: Windows: scala> new File("c:/myfile").toURI res6: java.net.URI = file:/c:/myfile Linux: scala> new File("/home/myfile").toURI res3: java.net.URI = file:/home/myfile We can remove "file:" from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58 and add toURI conversion in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L694 > file scheme should be used correctly > > > Key: SPARK-15899 > URL: https://issues.apache.org/jira/browse/SPARK-15899 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kazuaki Ishizaki >Priority: Minor > > [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as > {{file://host/}} or {{file:///}}. > [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme] > [Some code > stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58] > use different prefix such as {{file:}}. > It would be good to prepare a utility method to correctly add {{file://host}} > or {{file://} prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345264#comment-15345264 ] Alexander Ulanov commented on SPARK-15581: -- The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, I expect that efficient implementation of deep learning in Spark will be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvement
[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377 ] Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM: --- I would like to comment on Breeze and deep learning parts, because I have been implementing multilayer perceptron for Spark and have used Breeze a lot. Breeze provides convenient abstraction for dense and sparse vectors and matrices and allows performing linear algebra backed by netlib-java and native BLAS. At the same time Spark "linalg" has its own abstractions for that. This might be confusing to users and developers. Obviously, Spark should have a single library for linear algebra. Having said that, Breeze is more convenient and flexible than linalg, though it misses some features such as in-place matrix multiplications and multidimensional arrays. Breeze cannot be removed from Spark because "linalg" does not have enough functionality to fully replace it. To address this, I have implemented a Scala tensor library on top of netlib-java. "linalg" can be wrapped around it. It also provides functions similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], [~dbtsai] and myself were planning to discuss this after the 2.0 release, and I am posting these considerations here since you raised this question too. Could you take a look on this library and tell what do you think? The source code is here https://github.com/avulanov/scala-tensor With regards to deep learning, I believe that having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, I think that Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. Spark ML already has fully connected networks in place. Stacked autoencoder is implemented but not merged yet. The only thing that remains is convolutional network. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. The additional benefit of implementing deep learning for Spark is that we define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The ultimate goal will be to provide efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. was (Author: avulanov): I would like to comment on Breeze and deep learning parts, because I have been implementing multilayer perceptron for Spark and have used Breeze a lot. Breeze provides convenient abstraction for dense and sparse vectors and matrices and allows performing linear algebra backed by netlib-java and native BLAS. At the same time Spark "linalg" has its own abstractions for that. This might be confusing to users and developers. Obviously, Spark should have a single library for linear algebra. Having said that, Breeze is more convenient and flexible than linalg, though it misses some features such as in-place matrix multiplications and multidimensional arrays. Breeze cannot be removed from Spark because "linalg" does not have enough functionality to fully replace it. To address this, I have implemented a Scala tensor library on top of netlib-java. "linalg" can be wrapped around it. It also provides functions similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], [~dbtsai] and myself were planning to discuss this after the 2.0 release, and I am posting these considerations here since you raised this question too. Could you take a look on this library and tell what do you think? The source code is here https://github.com/avulanov/scala-tensor With regards to deep learning, I believe that having deep learning within Spark's ML library is a question of convenience. Spark has broad analyt
[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377 ] Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM: --- I would like to comment on Breeze and deep learning parts, because I have been implementing multilayer perceptron for Spark and have used Breeze a lot. Breeze provides convenient abstraction for dense and sparse vectors and matrices and allows performing linear algebra backed by netlib-java and native BLAS. At the same time Spark "linalg" has its own abstractions for that. This might be confusing to users and developers. Obviously, Spark should have a single library for linear algebra. Having said that, Breeze is more convenient and flexible than linalg, though it misses some features such as in-place matrix multiplications and multidimensional arrays. Breeze cannot be removed from Spark because "linalg" does not have enough functionality to fully replace it. To address this, I have implemented a Scala tensor library on top of netlib-java. "linalg" can be wrapped around it. It also provides functions similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], [~dbtsai] and myself were planning to discuss this after the 2.0 release, and I am posting these considerations here since you raised this question too. Could you take a look on this library and tell what do you think? The source code is here https://github.com/avulanov/scala-tensor With regards to deep learning, I believe that having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, I think that Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. Spark ML already has fully connected networks in place. Stacked autoencoder is implemented but not merged yet. The only thing that remains is convolutional network. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. Update (6/16) based on our conversation with Ben Lorica: The additional benefit of implementing deep learning for Spark is that we define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The ultimate goal will be to provide efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. was (Author: avulanov): I would like to comment on Breeze and deep learning parts, because I have been implementing multilayer perceptron for Spark and have used Breeze a lot. Breeze provides convenient abstraction for dense and sparse vectors and matrices and allows performing linear algebra backed by netlib-java and native BLAS. At the same time Spark "linalg" has its own abstractions for that. This might be confusing to users and developers. Obviously, Spark should have a single library for linear algebra. Having said that, Breeze is more convenient and flexible than linalg, though it misses some features such as in-place matrix multiplications and multidimensional arrays. Breeze cannot be removed from Spark because "linalg" does not have enough functionality to fully replace it. To address this, I have implemented a Scala tensor library on top of netlib-java. "linalg" can be wrapped around it. It also provides functions similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], [~dbtsai] and myself were planning to discuss this after the 2.0 release, and I am posting these considerations here since you raised this question too. Could you take a look on this library and tell what do you think? The source code is here https://github.com/avulanov/scala-tensor With regards to deep learning, I believe that having deep learning within Spark's ML li
[jira] [Commented] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330065#comment-15330065 ] Alexander Ulanov commented on SPARK-15893: -- Actually, the code that I am trying to run does not have explicit paths in it. It is Spark unit tests that were running properly on 1.6 (and with earlier versions) on Windows. It seems that the recent change in 2.0 broke that. Could you propose a way to debug this? > spark.createDataFrame raises an exception in Spark 2.0 tests on Windows > --- > > Key: SPARK-15893 > URL: https://issues.apache.org/jira/browse/SPARK-15893 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.0 >Reporter: Alexander Ulanov > > spark.createDataFrame raises an exception in Spark 2.0 tests on Windows > For example, LogisticRegressionSuite fails at Line 46: > Exception encountered when invoking run on a nested suite - > java.net.URISyntaxException: Relative path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109) > Another example, DataFrameSuite raises: > java.net.URISyntaxException: Relative path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
Alexander Ulanov created SPARK-15893: Summary: spark.createDataFrame raises an exception in Spark 2.0 tests on Windows Key: SPARK-15893 URL: https://issues.apache.org/jira/browse/SPARK-15893 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.0.0 Reporter: Alexander Ulanov spark.createDataFrame raises an exception in Spark 2.0 tests on Windows For example, LogisticRegressionSuite fails at Line 46: Exception encountered when invoking run on a nested suite - java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109) Another example, DataFrameSuite raises: java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377 ] Alexander Ulanov commented on SPARK-15581: -- I would like to comment on Breeze and deep learning parts, because I have been implementing multilayer perceptron for Spark and have used Breeze a lot. Breeze provides convenient abstraction for dense and sparse vectors and matrices and allows performing linear algebra backed by netlib-java and native BLAS. At the same time Spark "linalg" has its own abstractions for that. This might be confusing to users and developers. Obviously, Spark should have a single library for linear algebra. Having said that, Breeze is more convenient and flexible than linalg, though it misses some features such as in-place matrix multiplications and multidimensional arrays. Breeze cannot be removed from Spark because "linalg" does not have enough functionality to fully replace it. To address this, I have implemented a Scala tensor library on top of netlib-java. "linalg" can be wrapped around it. It also provides functions similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], [~dbtsai] and myself were planning to discuss this after the 2.0 release, and I am posting these considerations here since you raised this question too. Could you take a look on this library and tell what do you think? The source code is here https://github.com/avulanov/scala-tensor With regards to deep learning, I believe that having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, I think that Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. Spark ML already has fully connected networks in place. Stacked autoencoder is implemented but not merged yet. The only thing that remains is convolutional network. These 3 will provide a comprehensive deep learning set for Spark ML. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323704#comment-15323704 ] Alexander Ulanov commented on SPARK-15851: -- Sorry for confusion, I mean the shell that is "/bin/sh". Windows version of it comes with Git. > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323638#comment-15323638 ] Alexander Ulanov commented on SPARK-15851: -- I can do that. However, it seems that "spark-build-info" can be rewritten as a shell script. This will remove the need to install bash for Windows users that compile Spark with maven. What do you think? > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323624#comment-15323624 ] Alexander Ulanov commented on SPARK-15851: -- This does not work because Ant uses Java Process to run executable which returns "not a valid Win32 application". In order to run it, one need to run "bash" and provide bash file as a param. This approach I proposed as a work-around. For more details please refer to: http://stackoverflow.com/questions/20883212/how-can-i-use-ant-exec-to-execute-commands-on-linux > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-15851: - Fix Version/s: 2.0.0 > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-15851: - Target Version/s: 2.0.0 Fix Version/s: (was: 2.0.0) > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7
Alexander Ulanov created SPARK-15851: Summary: Spark 2.0 does not compile in Windows 7 Key: SPARK-15851 URL: https://issues.apache.org/jira/browse/SPARK-15851 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.0.0 Environment: Windows 7 Reporter: Alexander Ulanov Spark does not compile in Windows 7. "mvn compile" fails on spark-core due to trying to execute a bash script spark-build-info. Work around: 1)Install win-bash and put in path 2)Change line 350 of core/pom.xml Error trace: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "C:\dev\spark\core\..\build\spark-build-info" (in directory "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 application [ERROR] around Ant part .. @ 4:73 in C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150960#comment-15150960 ] Alexander Ulanov commented on SPARK-9273: - [~srowen] Do you mean that CNN will never be merged into Spark ML? > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149139#comment-15149139 ] Alexander Ulanov commented on SPARK-9273: - Hi [~gsateesh110], Besides the one mentioned by Yuhao, there is SparkNet that allows using Caffe. In future, I plan to switch the present neural network implementation in Spark to tensors, and probably implement CNN, that is easier with tensors: https://github.com/avulanov/spark/tree/mlp-tensor Best regards, Alexander > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114080#comment-15114080 ] Alexander Ulanov edited comment on SPARK-10528 at 1/24/16 1:30 AM: --- Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked with the earlier versions of Spark. Changing permissions do not help. Spark launches eventually with that error and does not provide sqlContext. I've checked Spark 1.4.1 and it worked fine. Is there a workaround? was (Author: avulanov): Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked with the earlier versions of Spark. Changing permissions do not help. Is there a workaround? > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114080#comment-15114080 ] Alexander Ulanov commented on SPARK-10528: -- Hi! I'm getting the same problem on Windows 7 64x with Spark 1.6.0. It worked with the earlier versions of Spark. Changing permissions do not help. Is there a workaround? > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers References: 1. Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008. http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf 2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(3371–3408). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers References: 1, 2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(3371–3408). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers References: 1, 2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(3371–3408). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers: References: 1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1, 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997300#comment-14997300 ] Alexander Ulanov commented on SPARK-5575: - Hi Narine, Thank you for your observation. It seems that such information is useful to know. Indeed, LBFGS in Spark does not print any information during the execution. ANN uses Spark's LBFGS. You might want to add the needed output to the LBFGS code https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L185. Best regards, Alexander > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992447#comment-14992447 ] Alexander Ulanov edited comment on SPARK-9273 at 11/5/15 8:50 PM: -- Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal interface. Also, I was able to run your example. It shows increasing accuracy while training however it is not very fast. Does it work with LBFGS? There is a good explanation how to use matrices multiplication in convolution: http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll all image patches (regions that will be convolved) the into vectors and stack them together in a matrix. The weights of convolutional layer also should be rolled into vectors and stacked. Multiplying two mentioned matrices provides the convolution result that can be unrolled to 3d matrix, however it would not be necessary for this implementation. We can discuss it offline if you wish. Besides the optimization, there are few more things to be done. It includes unit tests for new layers, gradient test, representing pooling layer as functional layer, and performance comparison with the other implementation of CNN. You can take a look at the tests I've added for MLP https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at https://github.com/avulanov/ann-benchmark. A separate branch/repo for these developments might be a good thing to do. I'll be happy to help you with this. was (Author: avulanov): Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal interface. Also, I was able to run your example. It shows increasing accuracy while training however it is not very fast. There is a good explanation how to use matrices multiplication in convolution: http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll all image patches (regions that will be convolved) the into vectors and stack them together in a matrix. The weights of convolutional layer also should be rolled into vectors and stacked. Multiplying two mentioned matrices provides the convolution result that can be unrolled to 3d matrix, however it would not be necessary for this implementation. We can discuss it offline if you wish. Besides the optimization, there are few more things to be done. It includes unit tests for new layers, gradient test, representing pooling layer as functional layer, and performance comparison with the other implementation of CNN. You can take a look at the tests I've added for MLP https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at https://github.com/avulanov/ann-benchmark. A separate branch/repo for these developments might be a good thing to do. I'll be happy to help you with this. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992447#comment-14992447 ] Alexander Ulanov commented on SPARK-9273: - Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal interface. Also, I was able to run your example. It shows increasing accuracy while training however it is not very fast. There is a good explanation how to use matrices multiplication in convolution: http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll all image patches (regions that will be convolved) the into vectors and stack them together in a matrix. The weights of convolutional layer also should be rolled into vectors and stacked. Multiplying two mentioned matrices provides the convolution result that can be unrolled to 3d matrix, however it would not be necessary for this implementation. We can discuss it offline if you wish. Besides the optimization, there are few more things to be done. It includes unit tests for new layers, gradient test, representing pooling layer as functional layer, and performance comparison with the other implementation of CNN. You can take a look at the tests I've added for MLP https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at https://github.com/avulanov/ann-benchmark. A separate branch/repo for these developments might be a good thing to do. I'll be happy to help you with this. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988705#comment-14988705 ] Alexander Ulanov commented on SPARK-5575: - Hi Disha, RNN is a major feature. I suggest to start from a smaller contribution. Spark contains the implementation of multi-layer perceptron since version 1.5. New features are supposed to re-use its code and follow the internal API that it has introduced. > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron
Alexander Ulanov created SPARK-11262: Summary: Unit test for gradient, loss layers, memory management for multilayer perceptron Key: SPARK-11262 URL: https://issues.apache.org/jira/browse/SPARK-11262 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.1 Reporter: Alexander Ulanov Fix For: 1.5.1 Multi-layer perceptron requires more rigorous tests and refactoring of layer interfaces to accommodate development of new features. 1)Implement unit test for gradient and loss 2)Refactor the internal layer interface to extract "loss function" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944173#comment-14944173 ] Alexander Ulanov commented on SPARK-5575: - Weide, These are major features and some of them are under development. You can check their status in the linked issues. Could you work on something smaller as a first step? [~mengxr], do you have any suggestions? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940446#comment-14940446 ] Alexander Ulanov commented on SPARK-5575: - Hi, Weide, Sounds good! What kind of feature are you planning to add? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks
[ https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746636#comment-14746636 ] Alexander Ulanov commented on SPARK-10627: -- Dropout WIP refactoring for the new ML API https://github.com/avulanov/spark/tree/dropout-mlp. > Regularization for artificial neural networks > - > > Key: SPARK-10627 > URL: https://issues.apache.org/jira/browse/SPARK-10627 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Add regularization for artificial neural networks. Includes, but not limited > to: > 1)L1 and L2 regularization > 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf > 3)Dropconnect > http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10627) Regularization for artificial neural networks
Alexander Ulanov created SPARK-10627: Summary: Regularization for artificial neural networks Key: SPARK-10627 URL: https://issues.apache.org/jira/browse/SPARK-10627 Project: Spark Issue Type: Umbrella Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Add regularization for artificial neural networks. Includes, but not limited to: 1)L1 and L2 regularization 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf 3)Dropconnect http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737948#comment-14737948 ] Alexander Ulanov edited comment on SPARK-9273 at 9/10/15 1:18 AM: -- Hi Yuhao! I have few comments regarding the interface and the optimization of your implementation. There are two options of optimizing convolutions: using matrix-matrix multiplication and using FFTs. The latter seems a bit more complicated since we don't have optimized parallel FFT in Spark. It also has to support batch data processing. Instead, if one uses matrix-matrix multiplication for convolution, then it can take advantage of native BLAS and batch computations can be supported straightforward. Another benefit is that we would not need to change the current Layer's input/ouptput type (matrix) to tensor. We can store the unwrapped inputs/outputs as vectors within the input/output matrix. Does it make sense to you? was (Author: avulanov): Hi Yuhao! I have few comments regarding the interface and the optimization of your implementation. There are two options of optimizing convolutions: using matrix-matrix multiplication and using FFTs. The latter seems a bit more complicated since we don't have optimized parallel FFT in Spark. It also has to support batch data processing. Instead, if one uses matrix-matrix multiplication for convolution, then it can take advantage of native BLAS and batch computations can be supported straightforward. Another benefit is that we would not need to change the current Layer's input/ouptput type (matrix) to tensor. We can store the unwrapped inputs/outputs as vectors within the input/output matrix. Do you think that it is reasonable? > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737948#comment-14737948 ] Alexander Ulanov commented on SPARK-9273: - Hi Yuhao! I have few comments regarding the interface and the optimization of your implementation. There are two options of optimizing convolutions: using matrix-matrix multiplication and using FFTs. The latter seems a bit more complicated since we don't have optimized parallel FFT in Spark. It also has to support batch data processing. Instead, if one uses matrix-matrix multiplication for convolution, then it can take advantage of native BLAS and batch computations can be supported straightforward. Another benefit is that we would not need to change the current Layer's input/ouptput type (matrix) to tensor. We can store the unwrapped inputs/outputs as vectors within the input/output matrix. Do you think that it is reasonable? > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov closed SPARK-4752. --- Resolution: Fixed Fix Version/s: 1.5.0 > Classifier based on artificial neural network > - > > Key: SPARK-4752 > URL: https://issues.apache.org/jira/browse/SPARK-4752 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Alexander Ulanov > Fix For: 1.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Implement classifier based on artificial neural network (ANN). Requirements: > 1) Use the existing artificial neural network implementation > https://issues.apache.org/jira/browse/SPARK-2352, > https://github.com/apache/spark/pull/1290 > 2) Extend MLlib ClassificationModel trait, > 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, > 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10324: - Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) ** logistic regression (SPARK-7685) ** linear regression (SPARK-9642) ** random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-5575) ** autoencoder (SPARK-10408) ** restricted Boltzmann machine (RBM) (SPARK-4251) ** convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) ** feature interaction (SPARK-9698) ** SQL transformer (SPARK-8345) ** ?? * predict single instance (SPARK-10413) * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export ** naive Bayes (SPARK-8546) ** decision tree (SPARK-8542) * model save/load ** FPGrowth (SPARK-6724) ** PrefixSpan (SPARK-10386) * code generation ** decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. You can find a complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall into two major categories: * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.m
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/autoencoder-mlp/mllib/src/main/scala/org/apache/spark/ml/feature/Autoencoder.scala) was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp ( > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp ( was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:54 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala) was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers: References: 1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10409) Multilayer perceptron regression
Alexander Ulanov created SPARK-10409: Summary: Multilayer perceptron regression Key: SPARK-10409 URL: https://issues.apache.org/jira/browse/SPARK-10409 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Implement regression based on multilayer perceptron (MLP). It should support different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. The implementation might take advantage of autoencoder. Time-series forecasting for financial data might be one of the use cases, see http://dl.acm.org/citation.cfm?id=561452. So there is the need for more specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10409) Multilayer perceptron regression
[ https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726435#comment-14726435 ] Alexander Ulanov commented on SPARK-10409: -- Basic implementation with the current ML api can be found here: https://github.com/avulanov/spark/blob/a2261330c227be8ef26172dbe355a617d653553a/mllib/src/main/scala/org/apache/spark/ml/regression/MultilayerPerceptronRegressor.scala > Multilayer perceptron regression > > > Key: SPARK-10409 > URL: https://issues.apache.org/jira/browse/SPARK-10409 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Implement regression based on multilayer perceptron (MLP). It should support > different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. > The implementation might take advantage of autoencoder. Time-series > forecasting for financial data might be one of the use cases, see > http://dl.acm.org/citation.cfm?id=561452. So there is the need for more > specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov commented on SPARK-10408: -- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Issue Type: Umbrella (was: Improvement) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10408) Autoencoder
Alexander Ulanov created SPARK-10408: Summary: Autoencoder Key: SPARK-10408 URL: https://issues.apache.org/jira/browse/SPARK-10408 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700567#comment-14700567 ] Alexander Ulanov commented on SPARK-9951: - I've submitter a PR for the user guide. Could you suggest if the example code in the PR can be used for this issue? https://github.com/apache/spark/pull/8262 > Example code for Multilayer Perceptron Classifier > - > > Key: SPARK-9951 > URL: https://issues.apache.org/jira/browse/SPARK-9951 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697902#comment-14697902 ] Alexander Ulanov commented on SPARK-9951: - I have this already, I plan to use it for the User Guide. Should we have a different example code in the examples? > Example code for Multilayer Perceptron Classifier > - > > Key: SPARK-9951 > URL: https://issues.apache.org/jira/browse/SPARK-9951 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9897) User Guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9897: Comment: was deleted (was: We already have an issue for MLP classifier docs: https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. Could you close this one?) > User Guide for Multilayer Perceptron Classifier > --- > > Key: SPARK-9897 > URL: https://issues.apache.org/jira/browse/SPARK-9897 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang > > SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib > docs. We should update the user guide to include this under the {{Algorithm > Guides > Algorithms in spark.ml}} section of {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694355#comment-14694355 ] Alexander Ulanov commented on SPARK-9897: - We already have an issue for MLP classifier docs: https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. Could you close this one? > User Guide for Multilayer Perceptron Classifier > --- > > Key: SPARK-9897 > URL: https://issues.apache.org/jira/browse/SPARK-9897 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang > > SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib > docs. We should update the user guide to include this under the {{Algorithm > Guides > Algorithms in spark.ml}} section of {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694356#comment-14694356 ] Alexander Ulanov commented on SPARK-9897: - We already have an issue for MLP classifier docs: https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. Could you close this one? > User Guide for Multilayer Perceptron Classifier > --- > > Key: SPARK-9897 > URL: https://issues.apache.org/jira/browse/SPARK-9897 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang > > SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib > docs. We should update the user guide to include this under the {{Algorithm > Guides > Algorithms in spark.ml}} section of {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9508) Align graphx programming guide with the updated Pregel code
Alexander Ulanov created SPARK-9508: --- Summary: Align graphx programming guide with the updated Pregel code Key: SPARK-9508 URL: https://issues.apache.org/jira/browse/SPARK-9508 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9471) Multilayer perceptron
Alexander Ulanov created SPARK-9471: --- Summary: Multilayer perceptron Key: SPARK-9471 URL: https://issues.apache.org/jira/browse/SPARK-9471 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Implement Multilayer Perceptron for Spark ML. Requirements: 1) ML pipelines interface 2) Extensible internal interface for further development of artificial neural networks for ML 3) Efficient and scalable: use vectors and BLAS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646853#comment-14646853 ] Alexander Ulanov commented on SPARK-9380: - It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. > Pregel example fix in graphx-programming-guide > -- > > Key: SPARK-9380 > URL: https://issues.apache.org/jira/browse/SPARK-9380 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > > Pregel operator to express single source > shortest path does not work due to incorrect type of the graph: Graph[Int, > Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9380: Comment: was deleted (was: It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. ) > Pregel example fix in graphx-programming-guide > -- > > Key: SPARK-9380 > URL: https://issues.apache.org/jira/browse/SPARK-9380 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > > Pregel operator to express single source > shortest path does not work due to incorrect type of the graph: Graph[Int, > Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646854#comment-14646854 ] Alexander Ulanov commented on SPARK-9380: - It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. > Pregel example fix in graphx-programming-guide > -- > > Key: SPARK-9380 > URL: https://issues.apache.org/jira/browse/SPARK-9380 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > > Pregel operator to express single source > shortest path does not work due to incorrect type of the graph: Graph[Int, > Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov resolved SPARK-9380. - Resolution: Fixed > Pregel example fix in graphx-programming-guide > -- > > Key: SPARK-9380 > URL: https://issues.apache.org/jira/browse/SPARK-9380 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > > Pregel operator to express single source > shortest path does not work due to incorrect type of the graph: Graph[Int, > Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9436) Simplify Pregel by merging joins
[ https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9436: Summary: Simplify Pregel by merging joins (was: Merge joins in Pregel ) > Simplify Pregel by merging joins > > > Key: SPARK-9436 > URL: https://issues.apache.org/jira/browse/SPARK-9436 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov >Priority: Minor > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Pregel code contains two consecutive joins: > ``` > g.vertices.innerJoin(messages)(vprog) > ... > g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => > newOpt.getOrElse(old) } > ``` > They can be replaced by one join. Ankur Dave proposed a patch based on our > discussion in mailing list: > https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9436) Merge joins in Pregel
Alexander Ulanov created SPARK-9436: --- Summary: Merge joins in Pregel Key: SPARK-9436 URL: https://issues.apache.org/jira/browse/SPARK-9436 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 Pregel code contains two consecutive joins: ``` g.vertices.innerJoin(messages)(vprog) ... g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) } ``` They can be replaced by one join. Ankur Dave proposed a patch based on our discussion in mailing list: https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9380) Pregel example fix in graphx-programming-guide
Alexander Ulanov created SPARK-9380: --- Summary: Pregel example fix in graphx-programming-guide Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642513#comment-14642513 ] Alexander Ulanov edited comment on SPARK-9273 at 7/27/15 9:54 AM: -- I have not heard about the PR until it was submitted. It would be useful to look at the code, benchmark it and see if it fits our API. I've added the link to the umbrella issue for deep learning https://issues.apache.org/jira/browse/SPARK-5575 was (Author: avulanov): I have not heard about the PR until it was submitted. It would be useful to look at the code, benchmark it and see if it fits our API. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642513#comment-14642513 ] Alexander Ulanov commented on SPARK-9273: - I have not heard about the PR until it was submitted. It would be useful to look at the code, benchmark it and see if it fits our API. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9120: Description: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". Update: After reading the design docs, adding "predictMultivariate" to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has "predict:Double". Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) is hard-coded to have only one output. There exist a similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). The possible solution for this problem might require to redesign the class hierarchy or addition of a separate interface that extends model. Though the latter means code duplication. was: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". Update:After reading the design docs, adding "predictMultivariate" to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has "predict:Double". Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel is hard-coded to have only one output. It is the same problem that I pointed out long time ago in MLLib ( > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". > Update: After reading the design docs, adding "predictMultivariate" to > RegressionModel does not seem reasonable to me anymore. The issue is as > follows. RegressionModel extends PredictionModel which has "predict:Double". > Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel > (and RegressionModel) is hard-coded to have only one output. There exist a > similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). > The possible solution for this problem might require to redesign the class > hierarchy or addition of a separate interface that extends model. Though the > latter means code duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9120: Description: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". Update:After reading the design docs, adding "predictMultivariate" to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has "predict:Double". Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel is hard-coded to have only one output. It is the same problem that I pointed out long time ago in MLLib ( was: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". > Update:After reading the design docs, adding "predictMultivariate" to > RegressionModel does not seem reasonable to me anymore. The issue is as > follows. RegressionModel extends PredictionModel which has "predict:Double". > Its "train" method uses "predict:Double" for prediction, i.e. PredictionModel > is hard-coded to have only one output. It is the same problem that I pointed > out long time ago in MLLib ( -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630654#comment-14630654 ] Alexander Ulanov commented on SPARK-9120: - Thank you, it sounds doable. > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630596#comment-14630596 ] Alexander Ulanov commented on SPARK-9120: - I think it should work for the train (aka fit) that has to return the model, not sure about the model itself. The common ancestor Model does not contain anything that can be called for prediction, its direct successor PredictionModel has predict:Double. Is there another way that you were mentioning? > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630560#comment-14630560 ] Alexander Ulanov commented on SPARK-9120: - Thank you for sharing your thoughts. Do you mean that the algorithm that does multivariate regression should not be implemented within ML since ML does not support multivariate, so the algorithm should live within MLlib for a while until you figure out a generic interface? By support I mean handling the ".fit" and ".transform" staff etc. > Add multivariate regression (or prediction) interface > - > > Key: SPARK-9120 > URL: https://issues.apache.org/jira/browse/SPARK-9120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.ml.regression.RegressionModel supports prediction only for a > single variable with a method "predict:Double" by extending the Predictor. > There is a need for multivariate prediction, at least for regression. I > propose to modify "RegressionModel" interface similarly to how it is done in > "ClassificationModel", which supports multiclass classification. It has > "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" > should have something like "predictMultivariate:Vector". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630453#comment-14630453 ] Alexander Ulanov commented on SPARK-3702: - [~josephkb] Hi, Joseph! Do you plan to add support for multivariate regression? I need this for multi-layer perceptron. Multivariate regression interface might be useful for other tasks. I've added an issue https://issues.apache.org/jira/browse/SPARK-9120. Also I wonder if you plan to add integer array parameters: https://issues.apache.org/jira/browse/SPARK-9118. Both seems to be relatively easy to implement, the question is do you plan to merge these features in the near future? > Standardize MLlib classes for learners, models > -- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "requires" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9120) Add multivariate regression (or prediction) interface
Alexander Ulanov created SPARK-9120: --- Summary: Add multivariate regression (or prediction) interface Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method "predict:Double" by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify "RegressionModel" interface similarly to how it is done in "ClassificationModel", which supports multiclass classification. It has "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" should have something like "predictMultivariate:Vector". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
Alexander Ulanov created SPARK-9118: --- Summary: Implement integer array parameters for ml.param as IntArrayParam Key: SPARK-9118 URL: https://issues.apache.org/jira/browse/SPARK-9118 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 ml/param/params.scala lacks integer array parameter. It is needed for some models such as multilayer perceptron to specify the layer sizes. I suggest to implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8449) HDF5 read/write support for Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592398#comment-14592398 ] Alexander Ulanov edited comment on SPARK-8449 at 6/18/15 7:53 PM: -- It seems that using the official HDF5 reader is not a viable choice for Spark due to platform dependent binaries. We need to look for pure Java implementation. Apparently, there is one called netCDF: http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It might be tricky to use it because the license is not Apache. However it worths a look. was (Author: avulanov): It seems that using the official HDF5 reader is not a viable choice for Spark due to platform dependent binaries. We need to look for pure Java implementation. Apparently, there is one called netCDF: http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It might be tricky to use it because the license is not Apache. However it worth a look. > HDF5 read/write support for Spark MLlib > --- > > Key: SPARK-8449 > URL: https://issues.apache.org/jira/browse/SPARK-8449 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.1 > > Original Estimate: 96h > Remaining Estimate: 96h > > Add support for reading and writing HDF5 file format to/from LabeledPoint. > HDFS and local file system have to be supported. Other Spark formats to be > discussed. > Interface proposal: > /* path - directory path in any Hadoop-supported file system URI */ > MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit > /* path - file or directory path in any Hadoop-supported file system URI */ > MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8449) HDF5 read/write support for Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592398#comment-14592398 ] Alexander Ulanov commented on SPARK-8449: - It seems that using the official HDF5 reader is not a viable choice for Spark due to platform dependent binaries. We need to look for pure Java implementation. Apparently, there is one called netCDF: http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It might be tricky to use it because the license is not Apache. However it worth a look. > HDF5 read/write support for Spark MLlib > --- > > Key: SPARK-8449 > URL: https://issues.apache.org/jira/browse/SPARK-8449 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Fix For: 1.4.1 > > Original Estimate: 96h > Remaining Estimate: 96h > > Add support for reading and writing HDF5 file format to/from LabeledPoint. > HDFS and local file system have to be supported. Other Spark formats to be > discussed. > Interface proposal: > /* path - directory path in any Hadoop-supported file system URI */ > MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit > /* path - file or directory path in any Hadoop-supported file system URI */ > MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8449) HDF5 read/write support for Spark MLlib
Alexander Ulanov created SPARK-8449: --- Summary: HDF5 read/write support for Spark MLlib Key: SPARK-8449 URL: https://issues.apache.org/jira/browse/SPARK-8449 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.1 Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS and local file system have to be supported. Other Spark formats to be discussed. Interface proposal: /* path - directory path in any Hadoop-supported file system URI */ MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit /* path - file or directory path in any Hadoop-supported file system URI */ MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582768#comment-14582768 ] Alexander Ulanov commented on SPARK-5575: - Hi Janani, There is already an implemenation of DBN (and RBM) by [~gq]. You can find it here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538209#comment-14538209 ] Alexander Ulanov commented on SPARK-5575: - Current implementation: https://github.com/avulanov/spark/tree/ann-interface-gemm > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7316) Add step capability to RDD sliding window
[ https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531229#comment-14531229 ] Alexander Ulanov commented on SPARK-7316: - I would say that the major use case is practical considerations :) In my case it is time series analysis of sensor data. It does not make sense to analyze time windows with step 1 because it is high-frequency sensor (1024 Hz). Also, even if we want to do it, the size of the resulting data gets enormous. For example, I have 2B data points (542 hours) of size 23GB binary data. If I apply sliding window with size 1024 and step 1, it will result in 1024*23=23.5TB of data which I am not able to process with Spark currently (honestly speaking my disk space is only 10TB). If you store data in HDFS than it will be tripled, i.e. 70TB. > Add step capability to RDD sliding window > - > > Key: SPARK-7316 > URL: https://issues.apache.org/jira/browse/SPARK-7316 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > RDDFunctions in MLlib contains sliding window implementation with step 1. > User should be able to define step. This capability should be implemented. > Although one can generate sliding windows with step 1 and then filter every > Nth window, it might take much more time and disk space depending on the step > size. For example, if your window is 1000 then you will generate the amount > of data thousand times bigger than your initial dataset. It does not make > sense if you need just every Nth window, so the data generated will be 1000/N > smaller. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7316) Add step capability to RDD sliding window
[ https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-7316: Description: RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. Although one can generate sliding windows with step 1 and then filter every Nth window, it might take much more time and disk space depending on the step size. For example, if your window is 1000 then you will generate the amount of data thousand times bigger than your initial dataset. It does not make sense if you need just every Nth window, so the data generated will be 1000/N smaller. was:RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. > Add step capability to RDD sliding window > - > > Key: SPARK-7316 > URL: https://issues.apache.org/jira/browse/SPARK-7316 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Alexander Ulanov > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > RDDFunctions in MLlib contains sliding window implementation with step 1. > User should be able to define step. This capability should be implemented. > Although one can generate sliding windows with step 1 and then filter every > Nth window, it might take much more time and disk space depending on the step > size. For example, if your window is 1000 then you will generate the amount > of data thousand times bigger than your initial dataset. It does not make > sense if you need just every Nth window, so the data generated will be 1000/N > smaller. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7316) Add step capability to RDD sliding window
Alexander Ulanov created SPARK-7316: --- Summary: Add step capability to RDD sliding window Key: SPARK-7316 URL: https://issues.apache.org/jira/browse/SPARK-7316 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Alexander Ulanov Fix For: 1.4.0 RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494579#comment-14494579 ] Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:48 PM: -- [~shivaram] Indeed, performance is orthogonal to the API design. Though well-designed things should work efficient, shouldn't they? :) was (Author: avulanov): [~shivaram] Indeed, performance is orthogonal to the API design. Though well-designed things should work efficient, don't you think? :) > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494579#comment-14494579 ] Alexander Ulanov commented on SPARK-5256: - [~shivaram] Indeed, performance is orthogonal to the API design. Though well-designed things should work efficient, don't you think? :) > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494568#comment-14494568 ] Alexander Ulanov commented on SPARK-5256: - The size of data that requires to use Spark suggests that learning algorithm will be limited by time versus data. According to the paper "The tradeoffs of large scale learning", SGD has significantly faster convergence than batch GD in this case. My use case is machine learning on large data, in particular, time series. > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494568#comment-14494568 ] Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:43 PM: -- The size of data that requires to use Spark suggests that learning algorithm will be limited by time versus data. According to the paper "The tradeoffs of large scale learning", SGD has significantly faster convergence than batch GD in this case. My use case is machine learning on large data, in particular, time series. Just in case, link to the paper http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf was (Author: avulanov): The size of data that requires to use Spark suggests that learning algorithm will be limited by time versus data. According to the paper "The tradeoffs of large scale learning", SGD has significantly faster convergence than batch GD in this case. My use case is machine learning on large data, in particular, time series. > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494518#comment-14494518 ] Alexander Ulanov commented on SPARK-5256: - Probably the main issue for MLlib is that iterative algorithms are implemented with aggregate function. It has a fixed overhead around half of a second that limits its application when one needs to make big number of iterations. This is the case for bigger data for which Spark is intended for. This problem gets worse with stochastic algorithms because there is no good way to randomly pick data from RDD and one needs to sequentially look through it. > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485554#comment-14485554 ] Alexander Ulanov commented on SPARK-6682: - [~yuu.ishik...@gmail.com] They reside in package org.apache.spark.mllib.optimization: class LBFGS(private var gradient: Gradient, private var updater: Updater) and class GradientDescent private[mllib] (private var gradient: Gradient, private var updater: Updater). They extend Optimizer trait that has only one function: def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector. This function is limited to only one type of input: vectors and their labels. I have submitted a separate issue regarding this https://issues.apache.org/jira/browse/SPARK-5362. 1. Right now static methods work with hard-coded optimizers, such as LogisticRegressionWithSGD. This is not very convenient. I think moving away from static methods and use builders implies that optimizers also could be set by users. It will be a problem because current optimizers require Updater and Gradient at the creation time. 2. The workaround I suggested in the previous post addresses this. > Deprecate static train and use builder instead for Scala/Java > - > > Key: SPARK-6682 > URL: https://issues.apache.org/jira/browse/SPARK-6682 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > In MLlib, we have for some time been unofficially moving away from the old > static train() methods and moving towards builder patterns. This JIRA is to > discuss this move and (hopefully) make it official. > "Old static train()" API: > {code} > val myModel = NaiveBayes.train(myData, ...) > {code} > "New builder pattern" API: > {code} > val nb = new NaiveBayes().setLambda(0.1) > val myModel = nb.train(myData) > {code} > Pros of the builder pattern: > * Much less code when algorithms have many parameters. Since Java does not > support default arguments, we required *many* duplicated static train() > methods (for each prefix set of arguments). > * Helps to enforce default parameters. Users should ideally not have to even > think about setting parameters if they just want to try an algorithm quickly. > * Matches spark.ml API > Cons of the builder pattern: > * In Python APIs, static train methods are more "Pythonic." > Proposal: > * Scala/Java: We should start deprecating the old static train() methods. We > must keep them for API stability, but deprecating will help with API > consistency, making it clear that everyone should use the builder pattern. > As we deprecate them, we should make sure that the builder pattern supports > all parameters. > * Python: Keep static train methods. > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483729#comment-14483729 ] Alexander Ulanov edited comment on SPARK-6682 at 4/7/15 6:35 PM: - This is a very good idea. Please note though, that there are few issues here 1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as constructor parameters. I don't think it is a good idea to force users to create Gradient and Updater separately and to be able to create Optimizer. So one have to explicitly implement methods like setLBFGSOptimizer or set SGDOptimizer and return them so the user will be able to set their parameters. ``` def LBFGSOptimizer: LBFGS = { val lbfgs = new LBFGS(_gradient, _updater) optimizer = lbfgs lbfgs } ``` Another downside of it is that if someone implements new Optimizer then one have to add "setMyOptimizer" to the builder. The above problems might be solved by figuring out a better interface of Optimizer that allows setting its parameters without actually creating it. 2) Setting parameters after setting the optimizer: what if user sets the Updater after setting the Optimizer? Optimizer takes Updater as a constructor parameter! So one has to recreate the corresponding Optimizer. ``` private[this] def updateGradient(gradient: Gradient): Unit = { optimizer match { case lbfgs: LBFGS => lbfgs.setGradient(gradient) case sgd: GradientDescent => sgd.setGradient(gradient) case other => throw new UnsupportedOperationException( s"Only LBFGS and GradientDescent are supported but got ${other.getClass}.") } } ``` So it is essential to work out the Optimizer interface first. was (Author: avulanov): This is a very good idea. Please note though, that there are few issues here 1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as constructor parameters. I don't think it is a good idea to force users to create Gradient and Updater separately and to be able to create Optimizer. So one have to explicitly implement methods like setLBFGSOptimizer or set SGDOptimizer and return them so the user will be able to set their parameters. ``` def LBFGSOptimizer: LBFGS = { val lbfgs = new LBFGS(_gradient, _updater) optimizer = lbfgs lbfgs } ``` Another downside of it is that if someone implements new Optimizer then one have to add "setMyOptimizer" to the builder. The above problems might be solved by figuring out a better interface of Optimizer that allows setting its parameters without actually creating it. 2) Setting parameters after setting the optimizer: what if user sets the Updater after setting the Optimizer? Optimizer takes Updater as a constructor parameter! So one has to recreate the corresponding Optimizer. ``` private[this] def updateGradient(gradient: Gradient): Unit = { optimizer match { case lbfgs: LBFGS => lbfgs.setGradient(gradient) case sgd: GradientDescent => sgd.setGradient(gradient) case other => throw new UnsupportedOperationException( s"Only LBFGS and GradientDescent are supported but got ${other.getClass}.") } } ``` So it is essential to work out the Optimizer interface first. > Deprecate static train and use builder instead for Scala/Java > - > > Key: SPARK-6682 > URL: https://issues.apache.org/jira/browse/SPARK-6682 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > In MLlib, we have for some time been unofficially moving away from the old > static train() methods and moving towards builder patterns. This JIRA is to > discuss this move and (hopefully) make it official. > "Old static train()" API: > {code} > val myModel = NaiveBayes.train(myData, ...) > {code} > "New builder pattern" API: > {code} > val nb = new NaiveBayes().setLambda(0.1) > val myModel = nb.train(myData) > {code} > Pros of the builder pattern: > * Much less code when algorithms have many parameters. Since Java does not > support default arguments, we required *many* duplicated static train() > methods (for each prefix set of arguments). > * Helps to enforce default parameters. Users should ideally not have to even > think about setting parameters if they just want to try an algorithm quickly. > * Matches spark.ml API > Cons of the builder pattern: > * In Python APIs, static train methods are more "Pythonic." > Proposal: > * Scala/Java: We should start deprecating the old static train() methods. We > must keep them for API stability, but deprecating will help with API > consistency, making it clear that everyone should use the builder pattern. > As we deprecate them, we should make sure that the bui
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483729#comment-14483729 ] Alexander Ulanov commented on SPARK-6682: - This is a very good idea. Please note though, that there are few issues here 1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as constructor parameters. I don't think it is a good idea to force users to create Gradient and Updater separately and to be able to create Optimizer. So one have to explicitly implement methods like setLBFGSOptimizer or set SGDOptimizer and return them so the user will be able to set their parameters. ``` def LBFGSOptimizer: LBFGS = { val lbfgs = new LBFGS(_gradient, _updater) optimizer = lbfgs lbfgs } ``` Another downside of it is that if someone implements new Optimizer then one have to add "setMyOptimizer" to the builder. The above problems might be solved by figuring out a better interface of Optimizer that allows setting its parameters without actually creating it. 2) Setting parameters after setting the optimizer: what if user sets the Updater after setting the Optimizer? Optimizer takes Updater as a constructor parameter! So one has to recreate the corresponding Optimizer. ``` private[this] def updateGradient(gradient: Gradient): Unit = { optimizer match { case lbfgs: LBFGS => lbfgs.setGradient(gradient) case sgd: GradientDescent => sgd.setGradient(gradient) case other => throw new UnsupportedOperationException( s"Only LBFGS and GradientDescent are supported but got ${other.getClass}.") } } ``` So it is essential to work out the Optimizer interface first. > Deprecate static train and use builder instead for Scala/Java > - > > Key: SPARK-6682 > URL: https://issues.apache.org/jira/browse/SPARK-6682 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > In MLlib, we have for some time been unofficially moving away from the old > static train() methods and moving towards builder patterns. This JIRA is to > discuss this move and (hopefully) make it official. > "Old static train()" API: > {code} > val myModel = NaiveBayes.train(myData, ...) > {code} > "New builder pattern" API: > {code} > val nb = new NaiveBayes().setLambda(0.1) > val myModel = nb.train(myData) > {code} > Pros of the builder pattern: > * Much less code when algorithms have many parameters. Since Java does not > support default arguments, we required *many* duplicated static train() > methods (for each prefix set of arguments). > * Helps to enforce default parameters. Users should ideally not have to even > think about setting parameters if they just want to try an algorithm quickly. > * Matches spark.ml API > Cons of the builder pattern: > * In Python APIs, static train methods are more "Pythonic." > Proposal: > * Scala/Java: We should start deprecating the old static train() methods. We > must keep them for API stability, but deprecating will help with API > consistency, making it clear that everyone should use the builder pattern. > As we deprecate them, we should make sure that the builder pattern supports > all parameters. > * Python: Keep static train methods. > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395185#comment-14395185 ] Alexander Ulanov commented on SPARK-2356: - The following worked for me: Download http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and put it to DISK:\FOLDERS\bin\ Set HADOOP_CONF=DISK:\FOLDERS > Exception: Could not locate executable null\bin\winutils.exe in the Hadoop > --- > > Key: SPARK-2356 > URL: https://issues.apache.org/jira/browse/SPARK-2356 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 1.0.0 >Reporter: Kostiantyn Kudriavtsev >Priority: Critical > > I'm trying to run some transformation on Spark, it works fine on cluster > (YARN, linux machines). However, when I'm trying to run it on local machine > (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file > from local filesystem): > {code} > 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.(Shell.java:326) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) > at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.(Groups.java:77) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.(SparkContext.scala:228) > at org.apache.spark.SparkContext.(SparkContext.scala:97) > {code} > It's happened because Hadoop config is initialized each time when spark > context is created regardless is hadoop required or not. > I propose to add some special flag to indicate if hadoop config is required > (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows
[ https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395001#comment-14395001 ] Alexander Ulanov commented on SPARK-6673: - Probably similar issue: I am trying to execute unit tests in MLlib with LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log saying that: "Cannot find any assembly build directories." If I do set SPARK_SCALA_VERSION=2.10 then I get "No assemblies found in 'C:\dev\spark\mllib\.\assembly\target\scala-2.10'" > spark-shell.cmd can't start even when spark was built in Windows > > > Key: SPARK-6673 > URL: https://issues.apache.org/jira/browse/SPARK-6673 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 1.3.0 >Reporter: Masayoshi TSUZUKI >Assignee: Masayoshi TSUZUKI >Priority: Blocker > > spark-shell.cmd can't start. > {code} > bin\spark-shell.cmd --master local > {code} > will get > {code} > Failed to find Spark assembly JAR. > You need to build Spark before running this program. > {code} > even when we have built spark. > This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which > is used in {{spark-class2.cmd}}. > In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in > {{load-spark-env.sh}}, but there are no equivalent script in Windows. > As workaround, by executing > {code} > set SPARK_SCALA_VERSION=2.10 > {code} > before execute spark-shell.cmd, we can successfully start it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org