date:20140225

[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Gokhan Capan (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1329:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Gokhan Capan (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911436#comment-13911436
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

I committed this to trunk

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911451#comment-13911451
 ] 

Hudson commented on MAHOUT-1329:


SUCCESS: Integrated in Mahout-Quality #2490 (See 
[https://builds.apache.org/job/Mahout-Quality/2490/])
MAHOUT-1329: Mahout for hadoop 2 (gcapan: rev 1571637)
* /mahout/trunk/core/pom.xml
* /mahout/trunk/integration/pom.xml
* /mahout/trunk/pom.xml


 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-25 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated MAHOUT-1419:
--

Resolution: Fixed
Fix Version/s: 1.0
Assignee: Sean Owen
Status: Resolved (was: Patch Available)

OK, the core patch is in. I think additional test scripts can be added
separately as desired.

Random decision forest is excessively slow on numeric features
--

Key: MAHOUT-1419
URL: https://issues.apache.org/jira/browse/MAHOUT-1419
Project: Mahout
Issue Type: Bug
Components: Classification
Affects Versions: 0.7, 0.8, 0.9
Reporter: Sean Owen
Assignee: Sean Owen
Fix For: 1.0

Attachments: MAHOUT-1419.patch, create-rf-data.sh, run-rf.sh

Follow-up to MAHOUT-1417. There's a customer running this and observing it
take an unreasonably long time on about 2GB of data -- like, 24 hours when
other RDF M/R implementations take 9 minutes. The difference is big enough to
probably be considered a defect. MAHOUT-1417 got that down to about 5 hours.
I am trying to further improve it.
One key issue seems to be how splits are evaluated over numeric features. A
split is tried for every distinct numeric value of the feature in the whole
data set. Since these are floating point values, they could (and in the
customer's case are) all distinct. 200K rows means 200K splits to evaluate
every time a node is built on the feature.
A better approach is to sample percentiles out of the feature and evaluate
only those as splits. Really doing that efficiently would require a lot of
rewrite. However, there are some modest changes possible which get some of
the benefit, and appear to make it run about 3x faster. That is --on a data
set that exhibits this problem -- meaning one using numeric features which
are generally distinct. Which is not exotic.
There are comparable but different problems with handling of categorical
features, but that's for a different patch.
I have a patch, but it changes behavior to some extent since it is evaluating
only a sample of splits instead of every single possible one. In particular
it makes the output of OptIgSplit no longer match the DefaultIgSplit.
Although I think the point is that optimized may mean giving different
choices of split here, which could yield differing trees. So that test
probably has to go.
(Along the way I found a number of micro-optimizations in this part of the
code that added up to maybe a 3% speedup. And fixed an NPE too.)
I will propose a patch shortly with all of this for thoughts.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-25 Thread Hudson (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911662#comment-13911662
]

Hudson commented on MAHOUT-1419:

SUCCESS: Integrated in Mahout-Quality #2492 (See
[https://builds.apache.org/job/Mahout-Quality/2492/])
MAHOUT-1419: Random decision forest is excessively slow on numeric features
(srowen: rev 1571704)
* /mahout/trunk/CHANGELOG
*
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/OptIgSplit.java
*
/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/split/OptIgSplitTest.java
*
/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/tools/VisualizerTest.java
*
/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/BuildForest.java

Random decision forest is excessively slow on numeric features
--

Attachments: MAHOUT-1419.patch, create-rf-data.sh, run-rf.sh

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-25 Thread Dmitriy Lyubimov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Attachment: ScalaSparkBindings.pdf

WIP manual and working notes

 Spark Bindings (DRM)
 

 Key: MAHOUT-1346
 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0

 Attachments: ScalaSparkBindings.pdf


 Spark bindings for Mahout DRM. 
 DRM DSL. 
 Disclaimer. This will all be experimental at this point.
 The idea is to wrap DRM by Spark RDD with support of some basic 
 functionality, perhaps some humble beginning of Cost-based optimizer 
 (0) Spark serialization support for Vector, Matrix 
 (1) Bagel transposition 
 (2) slim X'X
 (2a) not-so-slim X'X
 (3) blockify() (compose RDD containing vertical blocks of original input)
 (4) read/write Mahout DRM off HDFS
 (5) A'B
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Maciej Mazur (JIRA)

Maciej Mazur created MAHOUT-1426:


 Summary: GSOC 2013 Neural network algorithms
 Key: MAHOUT-1426
 URL: https://issues.apache.org/jira/browse/MAHOUT-1426
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Maciej Mazur


I would like to ask about possibilites of implementing neural network 
algorithms in mahout during GSOC.

There is a classifier.mlp package with neural network.
I can't see neighter RBM  nor Autoencoder in these classes.
There is only one word about Autoencoders in NeuralNetwork class.
As far as I know Mahout doesn't support convolutional networks.

Is it a good idea to implement one of these algorithms?
Is it a reasonable amount of work?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Maciej Mazur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Mazur updated MAHOUT-1426:
-

Description: 
I would like to ask about possibilites of implementing neural network 
algorithms in mahout during GSOC.

There is a classifier.mlp package with neural network.
I can't see neighter RBM  nor Autoencoder in these classes.
There is only one word about Autoencoders in NeuralNetwork class.
As far as I know Mahout doesn't support convolutional networks.

Is it a good idea to implement one of these algorithms?
Is it a reasonable amount of work?

How hard is it to get GSOC in Mahout?
Did anyone succeed last year?

  was:
I would like to ask about possibilites of implementing neural network 
algorithms in mahout during GSOC.

There is a classifier.mlp package with neural network.
I can't see neighter RBM  nor Autoencoder in these classes.
There is only one word about Autoencoders in NeuralNetwork class.
As far as I know Mahout doesn't support convolutional networks.

Is it a good idea to implement one of these algorithms?
Is it a reasonable amount of work?


 GSOC 2013 Neural network algorithms
 ---

 Key: MAHOUT-1426
 URL: https://issues.apache.org/jira/browse/MAHOUT-1426
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Maciej Mazur

 I would like to ask about possibilites of implementing neural network 
 algorithms in mahout during GSOC.
 There is a classifier.mlp package with neural network.
 I can't see neighter RBM  nor Autoencoder in these classes.
 There is only one word about Autoencoders in NeuralNetwork class.
 As far as I know Mahout doesn't support convolutional networks.
 Is it a good idea to implement one of these algorithms?
 Is it a reasonable amount of work?
 How hard is it to get GSOC in Mahout?
 Did anyone succeed last year?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: [jira] [Created] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Yexi Jiang

Since the training methods for neural network largely requires a lot of
iterations, it is not perfect suitable to implement it in MapReduce style.

Currently, the NeuralNetwork is implemented as an online learning model and
the training is conducted via stochastic gradient descent.

Moreover, currently version of NeuralNetwork is mainly used for supervised
learning, so there is no RBM or Autoencoder.

Regards,
Yexi


2014-02-25 10:34 GMT-05:00 Maciej Mazur (JIRA) j...@apache.org:

 Maciej Mazur created MAHOUT-1426:
 

  Summary: GSOC 2013 Neural network algorithms
  Key: MAHOUT-1426
  URL: https://issues.apache.org/jira/browse/MAHOUT-1426
  Project: Mahout
   Issue Type: Improvement
   Components: Classification
 Reporter: Maciej Mazur


 I would like to ask about possibilites of implementing neural network
 algorithms in mahout during GSOC.

 There is a classifier.mlp package with neural network.
 I can't see neighter RBM  nor Autoencoder in these classes.
 There is only one word about Autoencoders in NeuralNetwork class.
 As far as I know Mahout doesn't support convolutional networks.

 Is it a good idea to implement one of these algorithms?
 Is it a reasonable amount of work?



 --
 This message was sent by Atlassian JIRA
 (v6.1.5#6160)




-- 
--
Yexi Jiang,
ECS 251,  yjian...@cs.fiu.edu
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/

[jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Suneel Marthi (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680
]

Suneel Marthi edited comment on MAHOUT-1426 at 2/25/14 3:59 PM:

The classifier.mlp is a supervised classifier based on Online learning training
using SGD. There are old JIRAs that had RBM implementation (not MapReduce) -
Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it
to the codebase.

was (Author: smarthi):
The classifier.mlp is a supercised classifier based on Online learning training
using SGD. There are old JIRAs that had RBM implementation (not MapReduce) -
Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it
to the codebase.

GSOC 2013 Neural network algorithms
---

Key: MAHOUT-1426
URL: https://issues.apache.org/jira/browse/MAHOUT-1426
Project: Mahout
Issue Type: Improvement
Components: Classification
Reporter: Maciej Mazur

I would like to ask about possibilites of implementing neural network
algorithms in mahout during GSOC.
There is a classifier.mlp package with neural network.
I can't see neighter RBM nor Autoencoder in these classes.
There is only one word about Autoencoders in NeuralNetwork class.
As far as I know Mahout doesn't support convolutional networks.
Is it a good idea to implement one of these algorithms?
Is it a reasonable amount of work?
How hard is it to get GSOC in Mahout?
Did anyone succeed last year?

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Suneel Marthi (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680
 ] 

Suneel Marthi commented on MAHOUT-1426:
---

The classifier.mlp is a supercised classifier based on Online learning training 
using SGD.  There are old JIRAs that had RBM implementation (not MapReduce)  - 
Mahout-968 and one for Autoencoders (MAhout-732). Both of which never made it 
to the codebase. 

 GSOC 2013 Neural network algorithms
 ---

 Key: MAHOUT-1426
 URL: https://issues.apache.org/jira/browse/MAHOUT-1426
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Maciej Mazur

 I would like to ask about possibilites of implementing neural network 
 algorithms in mahout during GSOC.
 There is a classifier.mlp package with neural network.
 I can't see neighter RBM  nor Autoencoder in these classes.
 There is only one word about Autoencoders in NeuralNetwork class.
 As far as I know Mahout doesn't support convolutional networks.
 Is it a good idea to implement one of these algorithms?
 Is it a reasonable amount of work?
 How hard is it to get GSOC in Mahout?
 Did anyone succeed last year?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: [jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Maciej Mazur

I understand that neural networks aren't perfectly suitable for MapReduce.
But if there is very large network and lagre training set it seems to be a
good solution to use MapReduce.

RBMs and Autoencoders are used for pretraining. It allows to learn better
representation for deep architectures (acording to
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf). Deep supervised
multi-layer Neural Networks are very hard to train, starting from random
initialization.

On Tue, Feb 25, 2014 at 5:01 PM, Suneel Marthi (JIRA) j...@apache.orgwrote:

[
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680]

Suneel Marthi commented on MAHOUT-1426:
---

The classifier.mlp is a supercised classifier based on Online learning
training using SGD. There are old JIRAs that had RBM implementation (not
MapReduce) - Mahout-968 and one for Autoencoders (MAhout-732). Both of
which never made it to the codebase.

GSOC 2013 Neural network algorithms
---

Key: MAHOUT-1426
URL: https://issues.apache.org/jira/browse/MAHOUT-1426
Project: Mahout
Issue Type: Improvement
Components: Classification
Reporter: Maciej Mazur

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: [jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Ted Dunning

Doing a non-map-reduce neural network in Mahout would be of substantial
interest.

I don't see a role for something that is 10x slower than it should be.

On Tue, Feb 25, 2014 at 10:03 AM, Maciej Mazur maciejmaz...@gmail.comwrote:

I understand that neural networks aren't perfectly suitable for MapReduce.
But if there is very large network and lagre training set it seems to be a
good solution to use MapReduce.

RBMs and Autoencoders are used for pretraining. It allows to learn better
representation for deep architectures (acording to
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf). Deep
supervised
multi-layer Neural Networks are very hard to train, starting from random
initialization.

On Tue, Feb 25, 2014 at 5:01 PM, Suneel Marthi (JIRA) j...@apache.org
wrote:

[

https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911680#comment-13911680
]

Suneel Marthi commented on MAHOUT-1426:
---

GSOC 2013 Neural network algorithms
---

Key: MAHOUT-1426
URL: https://issues.apache.org/jira/browse/MAHOUT-1426
Project: Mahout
Issue Type: Improvement
Components: Classification
Reporter: Maciej Mazur

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-25 Thread Yexi Jiang (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911865#comment-13911865
]

Yexi Jiang commented on MAHOUT-1426:

I totally agree with you. From the algorithmic perspective, RBM and Autoencoder
is proved to be very effective for feature learning. When training multi-level
neural network, it is usually necessary to stack the RBMs or Autoencoders to
learn the representative features first.

1. If the training dataset is large.
It is true that if the training data is huge, the online version be be slow as
it is not a parallel implementation. If we implement the algorithm in MapReduce
way, the data can be read in parallel. Now matter we use stochastic gradient
descent, mini-batch gradient descent, or full batch gradient descent, we need
to train the model with many iteration. In practice, we need one job for each
iteration. It is know that the start-up time of hadoop is time-consuming,
therefore, the overhead can be even higher than the actual computing time. For
example, if we use stochastic gradient descent, after each partition read one
data instance, we need to update and synchronize the model. IMHO, BSP is more
effective than MapReduce in such scenario.

2. If the model is large.
If the model is large, we need to partition the model and store it
distributedly, you can find a solution at a related NIPS paper
(http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf).

In this case, the distributed system needs to be heterogeneous, since different
nodes may have different tasks (for parameter storage or for computing). It is
difficult to design an algorithm to conduct such work under MapReduce style, as
each task is considered to be homogeneous in MapReduce.

Actually, according to the talk of Tera-scale deep learning
(http://static.googleusercontent.com/media/research.google.com/en/us/archive/unsupervised_learning_talk_2012.pdf),
even BSP is not quite suitable since the error may always happen in a large
scale distributed system. In their implementation, they implemented an
asynchronous computing framework to conduct the large scale learning.

In summary, implementing MapReduce version of NeuralNetwork is OK, but compared
with the more suitable computing frameworks, it is not so efficient.

GSOC 2013 Neural network algorithms
---

Key: MAHOUT-1426
URL: https://issues.apache.org/jira/browse/MAHOUT-1426
Project: Mahout
Issue Type: Improvement
Components: Classification
Reporter: Maciej Mazur

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

[jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

[jira] [Created] (MAHOUT-1426) GSOC 2013 Neural network algorithms

[jira] [Updated] (MAHOUT-1426) GSOC 2013 Neural network algorithms

Re: [jira] [Created] (MAHOUT-1426) GSOC 2013 Neural network algorithms

[jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

[jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

Re: [jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

Re: [jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

[jira] [Commented] (MAHOUT-1426) GSOC 2013 Neural network algorithms

14 matches

Site Navigation

Mail list logo

Footer information