[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686449#comment-13686449
 ] 

Sebastian Schelter commented on MAHOUT-1214:


we should open a new issue for the bug, dont mix it in here

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-18 Thread zhang da (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686529#comment-13686529
 ] 

zhang da commented on MAHOUT-1214:
--

i believe the dot product is a false alarm and the problem is in our patch. let 
me fix it and update the patch tonight.

 Improve the accuracy of the Spectral KMeans Method
 --

 Key: MAHOUT-1214
 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
 Environment: Mahout 0.7
Reporter: Yiqun Hu
Assignee: Robin Anil
  Labels: clustering, improvement
 Fix For: 0.8

 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2


 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
 NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
 implementations make it fail even for a very obvious trivial dataset. We have 
 implemented a solution to resolve these two issues and hope to contribute 
 back to the community.
 # Issue 1: 
 The EigenVerificationJob in version 0.7 does not check the orthogonality of 
 eigenvectors, which is necessary to obtain the correct clustering results for 
 the case of K1; We have an idea and implementation to select based on 
 cosAngle/orthogonality;
 # Issue 2:
 The random seed initialization of KMeans algorithm is not optimal and 
 sometimes a bad initialization will generate wrong clustering result. In this 
 case, the selected K eigenvector actually provides a better way to initalize 
 cluster centroids because each selected eigenvector is a relaxed indicator of 
 the memberships of one cluster. For every selected eigenvector, we use the 
 data point whose eigen component achieves the maximum absolute value. 
 We have already verified our improvement on synthetic dataset and it shows 
 that the improved version get the optimal clustering result while the current 
 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1265) Add Multilayer Perceptron

2013-06-18 Thread Yexi Jiang (JIRA)
Yexi Jiang created MAHOUT-1265:
--

 Summary: Add Multilayer Perceptron 
 Key: MAHOUT-1265
 URL: https://issues.apache.org/jira/browse/MAHOUT-1265
 Project: Mahout
  Issue Type: New Feature
Reporter: Yexi Jiang


Design of multilayer perceptron


1. Motivation
A multilayer perceptron (MLP) is a kind of feed forward artificial neural 
network, which is a mathematical model inspired by the biological neural 
network. The multilayer perceptron can be used for various machine learning 
tasks such as classification and regression. It is helpful if it can be 
included in mahout.

2. API

The design goal of API is to facilitate the usage of MLP for user, and make the 
implementation detail user transparent.

The following is an example code of how user uses the MLP.
-
//  set the parameters
double learningRate = 0.5;
double momentum = 0.1;
double regularization = 0.01;
int[] layerSizeArray = new int[] {2, 5, 1};
String costFuncName = “SquaredError”;
String squashingFuncName = “Sigmoid”;
//  the location to store the model, if there is already an existing model at 
the specified location, MLP will throw exception
URI modelLocation = ...
MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, 
regularization, momentum, squashingFuncName, costFuncName, layerSizeArray, 
modelLocation);

//  the user can also load an existing model with given URI and update the 
model with new training data, if there is no existing model at the specified 
location, an exception will be thrown
/*
MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, 
regularization, momentum, squashingFuncName, costFuncName, modelLocation);
*/

URI trainingDataLocation = …
//  the detail of training is transparent to the user, it may running in a 
single machine or in a distributed environment
mlp.train(trainingDataLocation);

//  user can also train the model with one training instance in stochastic 
gradient descent way
Vector trainingInstance = ...
mlp.train(trainingInstance);

//  prepare the input feature
Vector inputFeature …
//  the semantic meaning of the output result is defined by the user
//  in general case, the dimension of output vector is 1 for regression and 
two-class classification
//  the dimension of output vector is n for n-class classification (n  2)
Vector outputVector = mlp.output(inputFeature); 
-


3. Methodology

The output calculation can be easily implemented with feed-forward approach. 
Also, the single machine training is straightforward. The following will 
describe how to train MLP in distributed way with batch gradient descent. The 
workflow is illustrated as the below figure.


https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720

For the distributed training, each training iteration is divided into two 
steps, the weight update calculation step and the weight update step. The 
distributed MLP can only be trained in batch-update approach.


3.1 The partial weight update calculation step:
This step trains the MLP distributedly. Each task will get a copy of the MLP 
model, and calculate the weight update with a partition of data.

Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where D 
denotes the training set, d denotes a training instance, t_d denotes the class 
label and y_d denotes the output of the MLP. Also, suppose sigmoid function is 
used as the squashing function, 
squared error is used as the cost function, 
t_i denotes the target value for the ith dimension of the output layer, 
o_i denotes the actual output for the ith dimension of the output layer, 
l denotes the learning rate,
w_{ij} denotes the weight between the jth neuron in previous layer and the ith 
neuron in the next layer. 

The weight of each edge is updated as 

\Delta w_{ij} = l * 1 / m * \delta_j * o_i, 

where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - 
o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - 
o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. 

It is easy to know that \delta_j can be rewritten as 

\delta_j = - \sigma_{i = 1}^k \sigma_{m_i} * o_j^{(m_i)} * (1 - o_j^{(m_i)}) * 
(t_j^{(m_i)} - o_j^{(m_i)})

The above equation indicates that the \delta_j can be divided into k parts.

So for the implementation, each mapper can calculate part of \delta_j with 
given partition of data, and then store the result into a specified location.


3.2 The model update step:

After k parts of \delta_j been calculated, a separate program can be used to 
merge the k parts of \delta_j into one to update the weight matrices.

This program can load the results calculated in the weight update calculation 
step and update the weight matrices. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, 

[jira] [Commented] (MAHOUT-1265) Add Multilayer Perceptron

2013-06-18 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686833#comment-13686833
 ] 

Ted Dunning commented on MAHOUT-1265:
-

Yexi,

I would suggest that a more fluid API would be helpful to people.  For 
instance, 
each layer might be an object which could be composed together to build a model 
which
is then trained.

Secondly, it seems like it would be good to have different kinds of loss 
function and
regularizations.

Also, regarding things like momentum, do you have an idea that this really 
needs to be
commonly adjusted?  or is there a way to set a good default?

 Add Multilayer Perceptron 
 --

 Key: MAHOUT-1265
 URL: https://issues.apache.org/jira/browse/MAHOUT-1265
 Project: Mahout
  Issue Type: New Feature
Reporter: Yexi Jiang
  Labels: machine_learning, neural_network

 Design of multilayer perceptron
 1. Motivation
 A multilayer perceptron (MLP) is a kind of feed forward artificial neural 
 network, which is a mathematical model inspired by the biological neural 
 network. The multilayer perceptron can be used for various machine learning 
 tasks such as classification and regression. It is helpful if it can be 
 included in mahout.
 2. API
 The design goal of API is to facilitate the usage of MLP for user, and make 
 the implementation detail user transparent.
 The following is an example code of how user uses the MLP.
 -
 //  set the parameters
 double learningRate = 0.5;
 double momentum = 0.1;
 double regularization = 0.01;
 int[] layerSizeArray = new int[] {2, 5, 1};
 String costFuncName = “SquaredError”;
 String squashingFuncName = “Sigmoid”;
 //  the location to store the model, if there is already an existing model at 
 the specified location, MLP will throw exception
 URI modelLocation = ...
 MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, 
 regularization, momentum, squashingFuncName, costFuncName, layerSizeArray, 
 modelLocation);
 //  the user can also load an existing model with given URI and update the 
 model with new training data, if there is no existing model at the specified 
 location, an exception will be thrown
 /*
 MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, 
 regularization, momentum, squashingFuncName, costFuncName, modelLocation);
 */
 URI trainingDataLocation = …
 //  the detail of training is transparent to the user, it may running in a 
 single machine or in a distributed environment
 mlp.train(trainingDataLocation);
 //  user can also train the model with one training instance in stochastic 
 gradient descent way
 Vector trainingInstance = ...
 mlp.train(trainingInstance);
 //  prepare the input feature
 Vector inputFeature …
 //  the semantic meaning of the output result is defined by the user
 //  in general case, the dimension of output vector is 1 for regression and 
 two-class classification
 //  the dimension of output vector is n for n-class classification (n  2)
 Vector outputVector = mlp.output(inputFeature); 
 -
 3. Methodology
 The output calculation can be easily implemented with feed-forward approach. 
 Also, the single machine training is straightforward. The following will 
 describe how to train MLP in distributed way with batch gradient descent. The 
 workflow is illustrated as the below figure.
 https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720
 For the distributed training, each training iteration is divided into two 
 steps, the weight update calculation step and the weight update step. The 
 distributed MLP can only be trained in batch-update approach.
 3.1 The partial weight update calculation step:
 This step trains the MLP distributedly. Each task will get a copy of the MLP 
 model, and calculate the weight update with a partition of data.
 Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where 
 D denotes the training set, d denotes a training instance, t_d denotes the 
 class label and y_d denotes the output of the MLP. Also, suppose sigmoid 
 function is used as the squashing function, 
 squared error is used as the cost function, 
 t_i denotes the target value for the ith dimension of the output layer, 
 o_i denotes the actual output for the ith dimension of the output layer, 
 l denotes the learning rate,
 w_{ij} denotes the weight between the jth neuron in previous layer and the 
 ith neuron in the next layer. 
 The weight of each edge is updated as 
 \Delta w_{ij} = l * 1 / m * \delta_j * o_i, 
 where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - 
 o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - 
 o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. 
 It is easy to know 

[jira] [Updated] (MAHOUT-1265) Add Multilayer Perceptron

2013-06-18 Thread Yexi Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yexi Jiang updated MAHOUT-1265:
---

Description: 
Design of multilayer perceptron


1. Motivation
A multilayer perceptron (MLP) is a kind of feed forward artificial neural 
network, which is a mathematical model inspired by the biological neural 
network. The multilayer perceptron can be used for various machine learning 
tasks such as classification and regression. It is helpful if it can be 
included in mahout.

2. API

The design goal of API is to facilitate the usage of MLP for user, and make the 
implementation detail user transparent.

The following is an example code of how user uses the MLP.
-
//  set the parameters
double learningRate = 0.5;
double momentum = 0.1;
int[] layerSizeArray = new int[] {2, 5, 1};
String costFuncName = “SquaredError”;
String squashingFuncName = “Sigmoid”;
//  the location to store the model, if there is already an existing model at 
the specified location, MLP will throw exception
URI modelLocation = ...
MultilayerPerceptron mlp = new MultiLayerPerceptron(layerSizeArray, 
modelLocation);
mlp.setLearningRate(learningRate).setMomentum(momentum).setRegularization(...).setCostFunction(...).setSquashingFunction(...);

//  the user can also load an existing model with given URI and update the 
model with new training data, if there is no existing model at the specified 
location, an exception will be thrown
/*
MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, 
regularization, momentum, squashingFuncName, costFuncName, modelLocation);
*/

URI trainingDataLocation = …
//  the detail of training is transparent to the user, it may running in a 
single machine or in a distributed environment
mlp.train(trainingDataLocation);

//  user can also train the model with one training instance in stochastic 
gradient descent way
Vector trainingInstance = ...
mlp.train(trainingInstance);

//  prepare the input feature
Vector inputFeature …
//  the semantic meaning of the output result is defined by the user
//  in general case, the dimension of output vector is 1 for regression and 
two-class classification
//  the dimension of output vector is n for n-class classification (n  2)
Vector outputVector = mlp.output(inputFeature); 
-


3. Methodology

The output calculation can be easily implemented with feed-forward approach. 
Also, the single machine training is straightforward. The following will 
describe how to train MLP in distributed way with batch gradient descent. The 
workflow is illustrated as the below figure.


https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720

For the distributed training, each training iteration is divided into two 
steps, the weight update calculation step and the weight update step. The 
distributed MLP can only be trained in batch-update approach.


3.1 The partial weight update calculation step:
This step trains the MLP distributedly. Each task will get a copy of the MLP 
model, and calculate the weight update with a partition of data.

Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where D 
denotes the training set, d denotes a training instance, t_d denotes the class 
label and y_d denotes the output of the MLP. Also, suppose sigmoid function is 
used as the squashing function, 
squared error is used as the cost function, 
t_i denotes the target value for the ith dimension of the output layer, 
o_i denotes the actual output for the ith dimension of the output layer, 
l denotes the learning rate,
w_{ij} denotes the weight between the jth neuron in previous layer and the ith 
neuron in the next layer. 

The weight of each edge is updated as 

\Delta w_{ij} = l * 1 / m * \delta_j * o_i, 

where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - 
o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - 
o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. 

It is easy to know that \delta_j can be rewritten as 

\delta_j = - \sigma_{i = 1}^k \sigma_{m_i} * o_j^{(m_i)} * (1 - o_j^{(m_i)}) * 
(t_j^{(m_i)} - o_j^{(m_i)})

The above equation indicates that the \delta_j can be divided into k parts.

So for the implementation, each mapper can calculate part of \delta_j with 
given partition of data, and then store the result into a specified location.


3.2 The model update step:

After k parts of \delta_j been calculated, a separate program can be used to 
merge the k parts of \delta_j into one to update the weight matrices.

This program can load the results calculated in the weight update calculation 
step and update the weight matrices. 


  was:
Design of multilayer perceptron


1. Motivation
A multilayer perceptron (MLP) is a kind of feed forward artificial neural 
network, which is a mathematical model inspired 

[jira] [Commented] (MAHOUT-1265) Add Multilayer Perceptron

2013-06-18 Thread Yexi Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686880#comment-13686880
 ] 

Yexi Jiang commented on MAHOUT-1265:


Ted,

{quote}
I would suggest that a more fluid API would be helpful to people. For instance, 
each layer might be an object which could be composed together to build a model 
which
is then trained.
{quote}

It seems that you suggest a more general neural network, not just the MLP.
A MLP is a kind of feed-forward neural network that the topology is fixed.
It usually consists of several layers and every pair of neurons in adjacent 
layers are connected.
Therefore, specify the size of each layer is enough to determine the topology 
of a MLP.

It is good if we first define a generic neural network, and then build a MLP on 
top of this generic neural network in the way as you said. An advantage is that 
the generic neural network can be reused to build other types of neural 
networks in the future, e.g. autoencoder for dimensional reduction, recurrent 
neural network for sequential mining, or possibly deep nets, etc.


{quote}
Secondly, it seems like it would be good to have different kinds of loss 
function and
regularizations.
{quote}

Yes, the MLP would allow the user to specify different loss function, squashing 
functions, and regularizations.


{quote}
Also, regarding things like momentum, do you have an idea that this really 
needs to be
commonly adjusted? or is there a way to set a good default?
{quote}

As far as I know, there is no empirical way to set a good default momentum 
weight. A good value is determined by the concrete problem. As for learning 
rate, a good way is to enable the decaying learning rate.




 Add Multilayer Perceptron 
 --

 Key: MAHOUT-1265
 URL: https://issues.apache.org/jira/browse/MAHOUT-1265
 Project: Mahout
  Issue Type: New Feature
Reporter: Yexi Jiang
  Labels: machine_learning, neural_network

 Design of multilayer perceptron
 1. Motivation
 A multilayer perceptron (MLP) is a kind of feed forward artificial neural 
 network, which is a mathematical model inspired by the biological neural 
 network. The multilayer perceptron can be used for various machine learning 
 tasks such as classification and regression. It is helpful if it can be 
 included in mahout.
 2. API
 The design goal of API is to facilitate the usage of MLP for user, and make 
 the implementation detail user transparent.
 The following is an example code of how user uses the MLP.
 -
 //  set the parameters
 double learningRate = 0.5;
 double momentum = 0.1;
 int[] layerSizeArray = new int[] {2, 5, 1};
 String costFuncName = “SquaredError”;
 String squashingFuncName = “Sigmoid”;
 //  the location to store the model, if there is already an existing model at 
 the specified location, MLP will throw exception
 URI modelLocation = ...
 MultilayerPerceptron mlp = new MultiLayerPerceptron(layerSizeArray, 
 modelLocation);
 mlp.setLearningRate(learningRate).setMomentum(momentum).setRegularization(...).setCostFunction(...).setSquashingFunction(...);
 //  the user can also load an existing model with given URI and update the 
 model with new training data, if there is no existing model at the specified 
 location, an exception will be thrown
 /*
 MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, 
 regularization, momentum, squashingFuncName, costFuncName, modelLocation);
 */
 URI trainingDataLocation = …
 //  the detail of training is transparent to the user, it may running in a 
 single machine or in a distributed environment
 mlp.train(trainingDataLocation);
 //  user can also train the model with one training instance in stochastic 
 gradient descent way
 Vector trainingInstance = ...
 mlp.train(trainingInstance);
 //  prepare the input feature
 Vector inputFeature …
 //  the semantic meaning of the output result is defined by the user
 //  in general case, the dimension of output vector is 1 for regression and 
 two-class classification
 //  the dimension of output vector is n for n-class classification (n  2)
 Vector outputVector = mlp.output(inputFeature); 
 -
 3. Methodology
 The output calculation can be easily implemented with feed-forward approach. 
 Also, the single machine training is straightforward. The following will 
 describe how to train MLP in distributed way with batch gradient descent. The 
 workflow is illustrated as the below figure.
 https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720
 For the distributed training, each training iteration is divided into two 
 steps, the weight update calculation step and the weight update step. The 
 distributed MLP can only be trained in batch-update approach.
 3.1 

Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #516

2013-06-18 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/516/changes

Changes:

[robinanil] Randomized test for VectorBinaryAggregate

--
[...truncated 5407 lines...]
INFO: Task 'attempt_local_0015_r_00_0' done.
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 100%
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0015
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Counters: 17
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO:   File Output Format Counters 
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Bytes Written=389
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO:   FileSystemCounters
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: FILE_BYTES_READ=1274348775
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: FILE_BYTES_WRITTEN=1285878485
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO:   File Input Format Counters 
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Bytes Read=152
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO:   Map-Reduce Framework
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Map output materialized bytes=61
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Map input records=0
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce shuffle bytes=0
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Spilled Records=40
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Map output bytes=120
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Total committed heap usage (bytes)=3249930240
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: SPLIT_RAW_BYTES=119
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Combine input records=20
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce input records=20
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce input groups=20
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Combine output records=20
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce output records=20
Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log
INFO: Map output records=20
Jun 18, 2013 6:26:11 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: About to run iteration 16 of 20
Jun 18, 2013 6:26:11 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: About to run: Iteration 16 of 20, input path: 
/tmp/mahout-work-hudson/reuters-lda-model/model-15
Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat 
listStatus
INFO: Total input paths to process : 1
Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0016
Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapred.Task initialize
INFO:  Using ResourceCalculatorPlugin : null
Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer init
INFO: io.sort.mb = 100
Jun 18, 2013 6:26:14 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer init
INFO: data buffer = 79691776/99614720
Jun 18, 2013 6:26:14 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer init
INFO: record buffer = 262144/327680
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Retrieving configuration
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Initializing read model
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Initializing write model
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Initializing model trainer
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Starting training threadpool with 4 threads
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Stopping model trainer
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Initiating stopping of training threadpool
Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: threadpool took: 0.752647ms
Jun 18, 2013 6:26:14 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Jun 18, 2013 6:26:15 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: readModel.stop() took 1002.078932ms
Jun 18, 2013 6:26:16 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: writeModel.stop() took 1010.00808ms
Jun 18, 2013 6:26:16 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Writing model
Jun 18, 2013 6:26:16 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Jun 18, 2013 6:26:16 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
sortAndSpill
INFO: Finished spill 0
Jun 18, 2013 6:26:16 PM org.apache.hadoop.mapred.Task done
INFO: 

Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning
I was reading the RowSimilarityJob and it doesn't appear that it does
down-sampling on the original data to minimize the performance impact of
perversely prolific users.

The issue is that if a single user has 100,000 items in their history, we
learn nothing more than if we picked 300 of those while the former would
result in processing 10 billion cooccurrences and the latter would result
in 100,000.  This factor of 10,000 is so large that it can make a big
difference in performance.

I had thought that the code had this down-sampling in place.

If not, I can add row based down-sampling quite easily.


Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Dan Filimon
I think you can get what you need through the --maxPrefsForUser flag.
Any user with more than that many will only keep a random sample of that size.



On Jun 18, 2013, at 23:27, Ted Dunning ted.dunn...@gmail.com wrote:

 I was reading the RowSimilarityJob and it doesn't appear that it does
 down-sampling on the original data to minimize the performance impact of
 perversely prolific users.
 
 The issue is that if a single user has 100,000 items in their history, we
 learn nothing more than if we picked 300 of those while the former would
 result in processing 10 billion cooccurrences and the latter would result
 in 100,000.  This factor of 10,000 is so large that it can make a big
 difference in performance.
 
 I had thought that the code had this down-sampling in place.
 
 If not, I can add row based down-sampling quite easily.


Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning
My recollection as well.

I will read the code again.  Didn't see where that happens.


On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote:

 This is the maxPrefsPerUser option IIRC.

 On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  I was reading the RowSimilarityJob and it doesn't appear that it does
  down-sampling on the original data to minimize the performance impact of
  perversely prolific users.
 
  The issue is that if a single user has 100,000 items in their history, we
  learn nothing more than if we picked 300 of those while the former would
  result in processing 10 billion cooccurrences and the latter would result
  in 100,000.  This factor of 10,000 is so large that it can make a big
  difference in performance.
 
  I had thought that the code had this down-sampling in place.
 
  If not, I can add row based down-sampling quite easily.



Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sean Owen
No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.

On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 Ahh... only effective in RecommenderJob.


Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning
But RecommenderJob seems to call RowSimilarityJob first.  That is where
sampling needs to be done.

  //calculate the co-occurrence matrix
  ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
--input, new Path(prepPath,
PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
--output, similarityMatrixPath.toString(),
--numberOfColumns, String.valueOf(numberOfUsers),
--similarityClassname, similarityClassname,
--maxSimilaritiesPerRow, String.valueOf(maxSimilaritiesPerItem),
--excludeSelfSimilarity, String.valueOf(Boolean.TRUE),
--threshold, String.valueOf(threshold),
--tempDir, getTempPath().toString(),
  });

  // write out the similarity matrix if the user specified that behavior
  if (hasOption(outputPathForSimilarityMatrix)) {
Path outputPathForSimilarityMatrix = new
Path(getOption(outputPathForSimilarityMatrix));

Job outputSimilarityMatrix = prepareJob(similarityMatrixPath,
outputPathForSimilarityMatrix,
SequenceFileInputFormat.class,
ItemSimilarityJob.MostSimilarItemPairsMapper.class,
EntityEntityWritable.class, DoubleWritable.class,
ItemSimilarityJob.MostSimilarItemPairsReducer.class,
EntityEntityWritable.class, DoubleWritable.class,
TextOutputFormat.class);

Configuration mostSimilarItemsConf =
outputSimilarityMatrix.getConfiguration();
mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
new Path(prepPath,
PreparePreferenceMatrixJob.ITEMID_INDEX).toString());

mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM,
maxSimilaritiesPerItem);
outputSimilarityMatrix.waitForCompletion(true);
  }
}




On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen sro...@gmail.com wrote:

 No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
 setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.

 On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  Ahh... only effective in RecommenderJob.



Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sebastian Schelter
Hi,

RowSimilarityJob by itself does not do down-sampling.

The down-sampling is done by the ToItemVectorsMapper in the
PreparePreferenceMatrixJob which is responsible for preparing the inputs
(the matrix of interactions between users and items) for
ItemSimilarityJob and RecommenderJob. As Sean noted, the option
maxPrefsPerUser controls the sampling. By default, we use a 1000
samples per user.

We could also move the sampling directly to RowSimilarityJob if people
consider this more useful.

Best,
Sebastian


On 18.06.2013 22:50, Ted Dunning wrote:
 But RecommenderJob seems to call RowSimilarityJob first.  That is where
 sampling needs to be done.
 
   //calculate the co-occurrence matrix
   ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
 --input, new Path(prepPath,
 PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
 --output, similarityMatrixPath.toString(),
 --numberOfColumns, String.valueOf(numberOfUsers),
 --similarityClassname, similarityClassname,
 --maxSimilaritiesPerRow, String.valueOf(maxSimilaritiesPerItem),
 --excludeSelfSimilarity, String.valueOf(Boolean.TRUE),
 --threshold, String.valueOf(threshold),Hi
 --tempDir, getTempPath().toString(),
   });
 
   // write out the similarity matrix if the user specified that behavior
   if (hasOption(outputPathForSimilarityMatrix)) {
 Path outputPathForSimilarityMatrix = new
 Path(getOption(outputPathForSimilarityMatrix));
 
 Job outputSimilarityMatrix = prepareJob(similarityMatrixPath,
 outputPathForSimilarityMatrix,
 SequenceFileInputFormat.class,
 ItemSimilarityJob.MostSimilarItemPairsMapper.class,
 EntityEntityWritable.class, DoubleWritable.class,
 ItemSimilarityJob.MostSimilarItemPairsReducer.class,
 EntityEntityWritable.class, DoubleWritable.class,
 TextOutputFormat.class);
 
 Configuration mostSimilarItemsConf =
 outputSimilarityMatrix.getConfiguration();
 mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
 new Path(prepPath,
 PreparePreferenceMatrixJob.ITEMID_INDEX).toString());
 
 mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM,
 maxSimilaritiesPerItem);
 outputSimilarityMatrix.waitForCompletion(true);
   }
 }
 
 
 
 
 On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen sro...@gmail.com wrote:
 
 No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
 setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.

 On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 Ahh... only effective in RecommenderJob.

 



[jira] [Updated] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication

2013-06-18 Thread Martin Illecker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Illecker updated MAHOUT-1266:


Description: 
Hello,

I think I have found two minor problems in *DistributedRowMatrix*.

In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
The condition should be like in [2]. 

And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])

Do you have any benchmark results for Mahout MatrixMultiplication?

Thanks!

Martin

[1] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
[2] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226]
[3] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
[4] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232]



  was:
Hello,

I think I have found two minor problems in *DistributedRowMatrix*.

In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
The condition should be like in [2]. 

And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])

Do you have any benchmark results for Mahout MatrixMultiplication?

Thanks!

Martin

[1] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
[2] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226]
[3] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
[4 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232]




 Two minor problems in DistributedRowMatrix using MatrixMultiplication
 -

 Key: MAHOUT-1266
 URL: https://issues.apache.org/jira/browse/MAHOUT-1266
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
Reporter: Martin Illecker
Priority: Trivial
  Labels: newbie
   Original Estimate: 10m
  Remaining Estimate: 10m

 Hello,
 I think I have found two minor problems in *DistributedRowMatrix*.
 In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
 The condition should be like in [2]. 
 And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])
 Do you have any benchmark results for Mahout MatrixMultiplication?
 Thanks!
 Martin
 [1] 
 [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
 [2] 
 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226]
 [3] 
 [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
 [4] 
 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication

2013-06-18 Thread Martin Illecker (JIRA)
Martin Illecker created MAHOUT-1266:
---

 Summary: Two minor problems in DistributedRowMatrix using 
MatrixMultiplication
 Key: MAHOUT-1266
 URL: https://issues.apache.org/jira/browse/MAHOUT-1266
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
Reporter: Martin Illecker
Priority: Trivial


Hello,

I think I have found two minor problems in *DistributedRowMatrix*.

In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
The condition should be like in [2]. 

And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])

Do you have any benchmark results for Mahout MatrixMultiplication?

Thanks!

Martin

[1|https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
[2|https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226]
[3|https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
[4|https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232]



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication

2013-06-18 Thread Martin Illecker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Illecker updated MAHOUT-1266:


Description: 
Hello,

I think I have found two minor problems in *DistributedRowMatrix*.

In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
The condition should be like in [2]. 

And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])

Do you have any benchmark results for Mahout MatrixMultiplication?

Thanks!

Martin

[1] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
[2] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L221-225]
[3] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
[4] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L230-231]



  was:
Hello,

I think I have found two minor problems in *DistributedRowMatrix*.

In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
The condition should be like in [2]. 

And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])

Do you have any benchmark results for Mahout MatrixMultiplication?

Thanks!

Martin

[1] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
[2] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226]
[3] 
[https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
[4] 
[https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232]




 Two minor problems in DistributedRowMatrix using MatrixMultiplication
 -

 Key: MAHOUT-1266
 URL: https://issues.apache.org/jira/browse/MAHOUT-1266
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
Reporter: Martin Illecker
Priority: Trivial
  Labels: newbie
   Original Estimate: 10m
  Remaining Estimate: 10m

 Hello,
 I think I have found two minor problems in *DistributedRowMatrix*.
 In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
 The condition should be like in [2]. 
 And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])
 Do you have any benchmark results for Mahout MatrixMultiplication?
 Thanks!
 Martin
 [1] 
 [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
 [2] 
 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L221-225]
 [3] 
 [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
 [4] 
 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L230-231]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Mahout-Quality #2094

2013-06-18 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/2094/

--
[...truncated 4959 lines...]
Running org.apache.mahout.clustering.spectral.common.TestUnitVectorizerJob
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.643 sec
Running org.apache.mahout.clustering.streaming.cluster.BallKMeansTest
Running org.apache.mahout.clustering.streaming.cluster.StreamingKMeansTest
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.TestClusterInterface
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.532 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.kmeans.TestRandomSeedGenerator
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.514 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.kmeans.TestKmeansClustering
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.146 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running 
org.apache.mahout.clustering.topdown.postprocessor.ClusterCountReaderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.009 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running 
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.039 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.topdown.PathDirectoryTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.classify.ClusterClassificationDriverTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.357 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.dirichlet.TestDistributions
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.497 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.dirichlet.TestMapReduce
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 21.297 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.dirichlet.TestDirichletClustering
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 74.904 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.minhash.TestMinHashClustering
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.762 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.canopy.TestCanopyCreation
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 35.752 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.clustering.TestGaussianAccumulators
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.876 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.234 sec
Running org.apache.mahout.classifier.discriminative.WinnowTrainerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.122 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.classifier.discriminative.PerceptronTrainerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Running org.apache.mahout.classifier.discriminative.LinearModelTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 20.216 sec
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.133 sec
parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false
Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 87.869 sec  
FAILURE!
testRemoval[0](org.apache.mahout.math.neighborhood.SearchSanityTest)  Time 
elapsed: 5.17 sec   FAILURE!
java.lang.AssertionError: 

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning
On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter s...@apache.org wrote:

 We could also move the sampling directly to RowSimilarityJob if people
 consider this more useful.


It will have a large effect on the time for the RowSimilarityJob for some
data.

Does anybody have an idea about how much of the total time is in
RowSimilarityJob?


Mahout vectors/matrices/solvers on spark

2013-06-18 Thread Dmitriy Lyubimov
Hello,

so i finally got around to actually do it.

I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
solvers using spark and Bagel /scala.

I also want to use in-core solvers that run directly on Mahout.

Question #1: which mahout artifacts are better be imported if I don't want
to pick the hadoop stuff dependencies? Is there even such a separation of
code? I know mahout-math seems to try to avoid being hadoop specfic but not
sure if it is followed strictly.

Question #2: which in-core solvers are available for Mahout matrices? I
know there's SSVD, probably Cholesky, is there something else? In
paticular, i need to be solving linear systems, I guess Cholesky should be
equipped enough to do just that?

Question #3: why did we try to import Colt solvers rather than actually
depend on Colt in the first place? Why did we not accept Colt's sparse
matrices and created native ones instead?

Colt seems to have a notion of parse in-core matrices too and seems like a
well-rounded solution. However, it doesn't seem like being actively
supported, whereas I know Mahout experienced continued enhancements to the
in-core matrix support.

Thanks in advance
-Dmitriy


Re: Mahout vectors/matrices/solvers on spark

2013-06-18 Thread Jake Mannix
On Tue, Jun 18, 2013 at 6:14 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Hello,

 so i finally got around to actually do it.

 I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
 solvers using spark and Bagel /scala.

 I also want to use in-core solvers that run directly on Mahout.

 Question #1: which mahout artifacts are better be imported if I don't want
 to pick the hadoop stuff dependencies? Is there even such a separation of
 code? I know mahout-math seems to try to avoid being hadoop specfic but not
 sure if it is followed strictly.


mahout-math should not depend on hadoop apis at all, if you build it and
hadoop gets pulled in via maven, then file a ticket, that's a bug.


 Question #2: which in-core solvers are available for Mahout matrices? I
 know there's SSVD, probably Cholesky, is there something else? In
 paticular, i need to be solving linear systems, I guess Cholesky should be
 equipped enough to do just that?

 Question #3: why did we try to import Colt solvers rather than actually
 depend on Colt in the first place? Why did we not accept Colt's sparse
 matrices and created native ones instead?

 Colt seems to have a notion of parse in-core matrices too and seems like a
 well-rounded solution. However, it doesn't seem like being actively
 supported, whereas I know Mahout experienced continued enhancements to the
 in-core matrix support.


Colt was totally abandoned, and I talked to the original author and he
blessed
it's adoption.  When we pulled it in, we found it was woefully undertested,
and
tried our best to hook it in with proper tests and use APIs that fit with
the use
cases we had.  Plus, we already had the start of some linear apis (i.e.
the Vector interface) and dropping the API completely seemed not terribly
worth it at the time.



 Thanks in advance
 -Dmitriy




-- 

  -jake


[jira] [Commented] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication

2013-06-18 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687584#comment-13687584
 ] 

Jake Mannix commented on MAHOUT-1266:
-

As mentioned in the javadocs for the method, it does *not* implement A * B, it 
implements A.transpose() * B, because this operation can be done in one 
map-reduce pass (with both SequenceFiles backing A and B as inputs), while 
computing A * B takes two map-reduce passes.

Why try and super-speed up the process with GPU, like in your code linked to, 
if you're going to have to make two full passes (your call to .transpose()) 
over your distributed data set?  That will inevitably be way slower than 
anything (unoptimized) you can compute in one MR pass, by nature of all the 
disk IO.

 Two minor problems in DistributedRowMatrix using MatrixMultiplication
 -

 Key: MAHOUT-1266
 URL: https://issues.apache.org/jira/browse/MAHOUT-1266
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
Reporter: Martin Illecker
Priority: Trivial
  Labels: newbie
   Original Estimate: 10m
  Remaining Estimate: 10m

 Hello,
 I think I have found two minor problems in *DistributedRowMatrix*.
 In [1] the condition is wrong, because (l x m) * (m x n) = (l x n).
 The condition should be like in [2]. 
 And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4])
 Do you have any benchmark results for Mahout MatrixMultiplication?
 Thanks!
 Martin
 [1] 
 [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193]
 [2] 
 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L221-225]
 [3] 
 [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206]
 [4] 
 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L230-231]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Mahout vectors/matrices/solvers on spark

2013-06-18 Thread Dmitriy Lyubimov
Thank you, Jake. I suspected as much about Colt.
On Jun 18, 2013 8:30 PM, Jake Mannix jake.man...@gmail.com wrote:

 On Tue, Jun 18, 2013 at 6:14 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Hello,
 
  so i finally got around to actually do it.
 
  I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
  solvers using spark and Bagel /scala.
 
  I also want to use in-core solvers that run directly on Mahout.
 
  Question #1: which mahout artifacts are better be imported if I don't
 want
  to pick the hadoop stuff dependencies? Is there even such a separation of
  code? I know mahout-math seems to try to avoid being hadoop specfic but
 not
  sure if it is followed strictly.
 

 mahout-math should not depend on hadoop apis at all, if you build it and
 hadoop gets pulled in via maven, then file a ticket, that's a bug.


  Question #2: which in-core solvers are available for Mahout matrices? I
  know there's SSVD, probably Cholesky, is there something else? In
  paticular, i need to be solving linear systems, I guess Cholesky should
 be
  equipped enough to do just that?
 
  Question #3: why did we try to import Colt solvers rather than actually
  depend on Colt in the first place? Why did we not accept Colt's sparse
  matrices and created native ones instead?
 
  Colt seems to have a notion of parse in-core matrices too and seems like
 a
  well-rounded solution. However, it doesn't seem like being actively
  supported, whereas I know Mahout experienced continued enhancements to
 the
  in-core matrix support.
 

 Colt was totally abandoned, and I talked to the original author and he
 blessed
 it's adoption.  When we pulled it in, we found it was woefully undertested,
 and
 tried our best to hook it in with proper tests and use APIs that fit with
 the use
 cases we had.  Plus, we already had the start of some linear apis (i.e.
 the Vector interface) and dropping the API completely seemed not terribly
 worth it at the time.


 
  Thanks in advance
  -Dmitriy
 



 --

   -jake



Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sebastian Schelter
On 19.06.2013 01:29, Ted Dunning wrote:
 On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter s...@apache.org wrote:
 
 We could also move the sampling directly to RowSimilarityJob if people
 consider this more useful.
 
 It will have a large effect on the time for the RowSimilarityJob for some
 data.

I put the sampling into PreparePreferenceMatrixJob, because I considered
it to be usecase specific for recommendations.

 Does anybody have an idea about how much of the total time is in
 RowSimilarityJob?

What do you mean by total time? Compared to the rest of the jobs in
ItemSimilarityJob and RecommenderJob?

-sebastian