[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1329: - Resolution: Fixed Status: Resolved (was: Patch Available) Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911436#comment-13911436 ] Gokhan Capan commented on MAHOUT-1329: -- I committed this to trunk Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911451#comment-13911451 ] Hudson commented on MAHOUT-1329: SUCCESS: Integrated in Mahout-Quality #2490 (See [https://builds.apache.org/job/Mahout-Quality/2490/]) MAHOUT-1329: Mahout for hadoop 2 (gcapan: rev 1571637) * /mahout/trunk/core/pom.xml * /mahout/trunk/integration/pom.xml * /mahout/trunk/pom.xml Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features
[ https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-1419: -- Resolution: Fixed Fix Version/s: 1.0 Assignee: Sean Owen Status: Resolved (was: Patch Available) OK, the core patch is in. I think additional test scripts can be added separately as desired. Random decision forest is excessively slow on numeric features -- Key: MAHOUT-1419 URL: https://issues.apache.org/jira/browse/MAHOUT-1419 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: Sean Owen Assignee: Sean Owen Fix For: 1.0 Attachments: MAHOUT-1419.patch, create-rf-data.sh, run-rf.sh Follow-up to MAHOUT-1417. There's a customer running this and observing it take an unreasonably long time on about 2GB of data -- like, 24 hours when other RDF M/R implementations take 9 minutes. The difference is big enough to probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. I am trying to further improve it. One key issue seems to be how splits are evaluated over numeric features. A split is tried for every distinct numeric value of the feature in the whole data set. Since these are floating point values, they could (and in the customer's case are) all distinct. 200K rows means 200K splits to evaluate every time a node is built on the feature. A better approach is to sample percentiles out of the feature and evaluate only those as splits. Really doing that efficiently would require a lot of rewrite. However, there are some modest changes possible which get some of the benefit, and appear to make it run about 3x faster. That is --on a data set that exhibits this problem -- meaning one using numeric features which are generally distinct. Which is not exotic. There are comparable but different problems with handling of categorical features, but that's for a different patch. I have a patch, but it changes behavior to some extent since it is evaluating only a sample of splits instead of every single possible one. In particular it makes the output of OptIgSplit no longer match the DefaultIgSplit. Although I think the point is that optimized may mean giving different choices of split here, which could yield differing trees. So that test probably has to go. (Along the way I found a number of micro-optimizations in this part of the code that added up to maybe a 3% speedup. And fixed an NPE too.) I will propose a patch shortly with all of this for thoughts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features
[ https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911662#comment-13911662 ] Hudson commented on MAHOUT-1419: SUCCESS: Integrated in Mahout-Quality #2492 (See [https://builds.apache.org/job/Mahout-Quality/2492/]) MAHOUT-1419: Random decision forest is excessively slow on numeric features (srowen: rev 1571704) * /mahout/trunk/CHANGELOG * /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/OptIgSplit.java * /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/split/OptIgSplitTest.java * /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/tools/VisualizerTest.java * /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/BuildForest.java Random decision forest is excessively slow on numeric features -- Key: MAHOUT-1419 URL: https://issues.apache.org/jira/browse/MAHOUT-1419 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.7, 0.8, 0.9 Reporter: Sean Owen Assignee: Sean Owen Fix For: 1.0 Attachments: MAHOUT-1419.patch, create-rf-data.sh, run-rf.sh Follow-up to MAHOUT-1417. There's a customer running this and observing it take an unreasonably long time on about 2GB of data -- like, 24 hours when other RDF M/R implementations take 9 minutes. The difference is big enough to probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. I am trying to further improve it. One key issue seems to be how splits are evaluated over numeric features. A split is tried for every distinct numeric value of the feature in the whole data set. Since these are floating point values, they could (and in the customer's case are) all distinct. 200K rows means 200K splits to evaluate every time a node is built on the feature. A better approach is to sample percentiles out of the feature and evaluate only those as splits. Really doing that efficiently would require a lot of rewrite. However, there are some modest changes possible which get some of the benefit, and appear to make it run about 3x faster. That is --on a data set that exhibits this problem -- meaning one using numeric features which are generally distinct. Which is not exotic. There are comparable but different problems with handling of categorical features, but that's for a different patch. I have a patch, but it changes behavior to some extent since it is evaluating only a sample of splits instead of every single possible one. In particular it makes the output of OptIgSplit no longer match the DefaultIgSplit. Although I think the point is that optimized may mean giving different choices of split here, which could yield differing trees. So that test probably has to go. (Along the way I found a number of micro-optimizations in this part of the code that added up to maybe a 3% speedup. And fixed an NPE too.) I will propose a patch shortly with all of this for thoughts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)
[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1346: - Attachment: ScalaSparkBindings.pdf WIP manual and working notes Spark Bindings (DRM) Key: MAHOUT-1346 URL: https://issues.apache.org/jira/browse/MAHOUT-1346 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Attachments: ScalaSparkBindings.pdf Spark bindings for Mahout DRM. DRM DSL. Disclaimer. This will all be experimental at this point. The idea is to wrap DRM by Spark RDD with support of some basic functionality, perhaps some humble beginning of Cost-based optimizer (0) Spark serialization support for Vector, Matrix (1) Bagel transposition (2) slim X'X (2a) not-so-slim X'X (3) blockify() (compose RDD containing vertical blocks of original input) (4) read/write Mahout DRM off HDFS (5) A'B ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (MAHOUT-1426) GSOC 2013 Neural network algorithms
Maciej Mazur created MAHOUT-1426: Summary: GSOC 2013 Neural network algorithms Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1426) GSOC 2013 Neural network algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Mazur updated MAHOUT-1426: - Description: I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? was: I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [jira] [Created] (MAHOUT-1426) GSOC 2013 Neural network algorithms
Since the training methods for neural network largely requires a lot of iterations, it is not perfect suitable to implement it in MapReduce style. Currently, the NeuralNetwork is implemented as an online learning model and the training is conducted via stochastic gradient descent. Moreover, currently version of NeuralNetwork is mainly used for supervised learning, so there is no RBM or Autoencoder. Regards, Yexi 2014-02-25 10:34 GMT-05:00 Maciej Mazur (JIRA) j...@apache.org: Maciej Mazur created MAHOUT-1426: Summary: GSOC 2013 Neural network algorithms Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? -- This message was sent by Atlassian JIRA (v6.1.5#6160) -- -- Yexi Jiang, ECS 251, yjian...@cs.fiu.edu School of Computer and Information Science, Florida International University Homepage: http://users.cis.fiu.edu/~yjian004/
[jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680 ] Suneel Marthi edited comment on MAHOUT-1426 at 2/25/14 3:59 PM: The classifier.mlp is a supervised classifier based on Online learning training using SGD. There are old JIRAs that had RBM implementation (not MapReduce) - Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it to the codebase. was (Author: smarthi): The classifier.mlp is a supercised classifier based on Online learning training using SGD. There are old JIRAs that had RBM implementation (not MapReduce) - Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it to the codebase. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680 ] Suneel Marthi commented on MAHOUT-1426: --- The classifier.mlp is a supercised classifier based on Online learning training using SGD. There are old JIRAs that had RBM implementation (not MapReduce) - Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it to the codebase. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms
I understand that neural networks aren't perfectly suitable for MapReduce. But if there is very large network and lagre training set it seems to be a good solution to use MapReduce. RBMs and Autoencoders are used for pretraining. It allows to learn better representation for deep architectures (acording to http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf). Deep supervised multi-layer Neural Networks are very hard to train, starting from random initialization. On Tue, Feb 25, 2014 at 5:01 PM, Suneel Marthi (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680] Suneel Marthi commented on MAHOUT-1426: --- The classifier.mlp is a supercised classifier based on Online learning training using SGD. There are old JIRAs that had RBM implementation (not MapReduce) - Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it to the codebase. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms
Doing a non-map-reduce neural network in Mahout would be of substantial interest. I don't see a role for something that is 10x slower than it should be. On Tue, Feb 25, 2014 at 10:03 AM, Maciej Mazur maciejmaz...@gmail.comwrote: I understand that neural networks aren't perfectly suitable for MapReduce. But if there is very large network and lagre training set it seems to be a good solution to use MapReduce. RBMs and Autoencoders are used for pretraining. It allows to learn better representation for deep architectures (acording to http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf). Deep supervised multi-layer Neural Networks are very hard to train, starting from random initialization. On Tue, Feb 25, 2014 at 5:01 PM, Suneel Marthi (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680 ] Suneel Marthi commented on MAHOUT-1426: --- The classifier.mlp is a supercised classifier based on Online learning training using SGD. There are old JIRAs that had RBM implementation (not MapReduce) - Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it to the codebase. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911865#comment-13911865 ] Yexi Jiang commented on MAHOUT-1426: I totally agree with you. From the algorithmic perspective, RBM and Autoencoder is proved to be very effective for feature learning. When training multi-level neural network, it is usually necessary to stack the RBMs or Autoencoders to learn the representative features first. 1. If the training dataset is large. It is true that if the training data is huge, the online version be be slow as it is not a parallel implementation. If we implement the algorithm in MapReduce way, the data can be read in parallel. Now matter we use stochastic gradient descent, mini-batch gradient descent, or full batch gradient descent, we need to train the model with many iteration. In practice, we need one job for each iteration. It is know that the start-up time of hadoop is time-consuming, therefore, the overhead can be even higher than the actual computing time. For example, if we use stochastic gradient descent, after each partition read one data instance, we need to update and synchronize the model. IMHO, BSP is more effective than MapReduce in such scenario. 2. If the model is large. If the model is large, we need to partition the model and store it distributedly, you can find a solution at a related NIPS paper (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf). In this case, the distributed system needs to be heterogeneous, since different nodes may have different tasks (for parameter storage or for computing). It is difficult to design an algorithm to conduct such work under MapReduce style, as each task is considered to be homogeneous in MapReduce. Actually, according to the talk of Tera-scale deep learning (http://static.googleusercontent.com/media/research.google.com/en/us/archive/unsupervised_learning_talk_2012.pdf), even BSP is not quite suitable since the error may always happen in a large scale distributed system. In their implementation, they implemented an asynchronous computing framework to conduct the large scale learning. In summary, implementing MapReduce version of NeuralNetwork is OK, but compared with the more suitable computing frameworks, it is not so efficient. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)