[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686449#comment-13686449 ] Sebastian Schelter commented on MAHOUT-1214: we should open a new issue for the bug, dont mix it in here Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686529#comment-13686529 ] zhang da commented on MAHOUT-1214: -- i believe the dot product is a false alarm and the problem is in our patch. let me fix it and update the patch tonight. Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1265) Add Multilayer Perceptron
Yexi Jiang created MAHOUT-1265: -- Summary: Add Multilayer Perceptron Key: MAHOUT-1265 URL: https://issues.apache.org/jira/browse/MAHOUT-1265 Project: Mahout Issue Type: New Feature Reporter: Yexi Jiang Design of multilayer perceptron 1. Motivation A multilayer perceptron (MLP) is a kind of feed forward artificial neural network, which is a mathematical model inspired by the biological neural network. The multilayer perceptron can be used for various machine learning tasks such as classification and regression. It is helpful if it can be included in mahout. 2. API The design goal of API is to facilitate the usage of MLP for user, and make the implementation detail user transparent. The following is an example code of how user uses the MLP. - // set the parameters double learningRate = 0.5; double momentum = 0.1; double regularization = 0.01; int[] layerSizeArray = new int[] {2, 5, 1}; String costFuncName = “SquaredError”; String squashingFuncName = “Sigmoid”; // the location to store the model, if there is already an existing model at the specified location, MLP will throw exception URI modelLocation = ... MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, layerSizeArray, modelLocation); // the user can also load an existing model with given URI and update the model with new training data, if there is no existing model at the specified location, an exception will be thrown /* MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, modelLocation); */ URI trainingDataLocation = … // the detail of training is transparent to the user, it may running in a single machine or in a distributed environment mlp.train(trainingDataLocation); // user can also train the model with one training instance in stochastic gradient descent way Vector trainingInstance = ... mlp.train(trainingInstance); // prepare the input feature Vector inputFeature … // the semantic meaning of the output result is defined by the user // in general case, the dimension of output vector is 1 for regression and two-class classification // the dimension of output vector is n for n-class classification (n 2) Vector outputVector = mlp.output(inputFeature); - 3. Methodology The output calculation can be easily implemented with feed-forward approach. Also, the single machine training is straightforward. The following will describe how to train MLP in distributed way with batch gradient descent. The workflow is illustrated as the below figure. https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720 For the distributed training, each training iteration is divided into two steps, the weight update calculation step and the weight update step. The distributed MLP can only be trained in batch-update approach. 3.1 The partial weight update calculation step: This step trains the MLP distributedly. Each task will get a copy of the MLP model, and calculate the weight update with a partition of data. Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where D denotes the training set, d denotes a training instance, t_d denotes the class label and y_d denotes the output of the MLP. Also, suppose sigmoid function is used as the squashing function, squared error is used as the cost function, t_i denotes the target value for the ith dimension of the output layer, o_i denotes the actual output for the ith dimension of the output layer, l denotes the learning rate, w_{ij} denotes the weight between the jth neuron in previous layer and the ith neuron in the next layer. The weight of each edge is updated as \Delta w_{ij} = l * 1 / m * \delta_j * o_i, where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. It is easy to know that \delta_j can be rewritten as \delta_j = - \sigma_{i = 1}^k \sigma_{m_i} * o_j^{(m_i)} * (1 - o_j^{(m_i)}) * (t_j^{(m_i)} - o_j^{(m_i)}) The above equation indicates that the \delta_j can be divided into k parts. So for the implementation, each mapper can calculate part of \delta_j with given partition of data, and then store the result into a specified location. 3.2 The model update step: After k parts of \delta_j been calculated, a separate program can be used to merge the k parts of \delta_j into one to update the weight matrices. This program can load the results calculated in the weight update calculation step and update the weight matrices. -- This message is automatically generated by JIRA. If you think it was sent incorrectly,
[jira] [Commented] (MAHOUT-1265) Add Multilayer Perceptron
[ https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686833#comment-13686833 ] Ted Dunning commented on MAHOUT-1265: - Yexi, I would suggest that a more fluid API would be helpful to people. For instance, each layer might be an object which could be composed together to build a model which is then trained. Secondly, it seems like it would be good to have different kinds of loss function and regularizations. Also, regarding things like momentum, do you have an idea that this really needs to be commonly adjusted? or is there a way to set a good default? Add Multilayer Perceptron -- Key: MAHOUT-1265 URL: https://issues.apache.org/jira/browse/MAHOUT-1265 Project: Mahout Issue Type: New Feature Reporter: Yexi Jiang Labels: machine_learning, neural_network Design of multilayer perceptron 1. Motivation A multilayer perceptron (MLP) is a kind of feed forward artificial neural network, which is a mathematical model inspired by the biological neural network. The multilayer perceptron can be used for various machine learning tasks such as classification and regression. It is helpful if it can be included in mahout. 2. API The design goal of API is to facilitate the usage of MLP for user, and make the implementation detail user transparent. The following is an example code of how user uses the MLP. - // set the parameters double learningRate = 0.5; double momentum = 0.1; double regularization = 0.01; int[] layerSizeArray = new int[] {2, 5, 1}; String costFuncName = “SquaredError”; String squashingFuncName = “Sigmoid”; // the location to store the model, if there is already an existing model at the specified location, MLP will throw exception URI modelLocation = ... MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, layerSizeArray, modelLocation); // the user can also load an existing model with given URI and update the model with new training data, if there is no existing model at the specified location, an exception will be thrown /* MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, modelLocation); */ URI trainingDataLocation = … // the detail of training is transparent to the user, it may running in a single machine or in a distributed environment mlp.train(trainingDataLocation); // user can also train the model with one training instance in stochastic gradient descent way Vector trainingInstance = ... mlp.train(trainingInstance); // prepare the input feature Vector inputFeature … // the semantic meaning of the output result is defined by the user // in general case, the dimension of output vector is 1 for regression and two-class classification // the dimension of output vector is n for n-class classification (n 2) Vector outputVector = mlp.output(inputFeature); - 3. Methodology The output calculation can be easily implemented with feed-forward approach. Also, the single machine training is straightforward. The following will describe how to train MLP in distributed way with batch gradient descent. The workflow is illustrated as the below figure. https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720 For the distributed training, each training iteration is divided into two steps, the weight update calculation step and the weight update step. The distributed MLP can only be trained in batch-update approach. 3.1 The partial weight update calculation step: This step trains the MLP distributedly. Each task will get a copy of the MLP model, and calculate the weight update with a partition of data. Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where D denotes the training set, d denotes a training instance, t_d denotes the class label and y_d denotes the output of the MLP. Also, suppose sigmoid function is used as the squashing function, squared error is used as the cost function, t_i denotes the target value for the ith dimension of the output layer, o_i denotes the actual output for the ith dimension of the output layer, l denotes the learning rate, w_{ij} denotes the weight between the jth neuron in previous layer and the ith neuron in the next layer. The weight of each edge is updated as \Delta w_{ij} = l * 1 / m * \delta_j * o_i, where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. It is easy to know
[jira] [Updated] (MAHOUT-1265) Add Multilayer Perceptron
[ https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yexi Jiang updated MAHOUT-1265: --- Description: Design of multilayer perceptron 1. Motivation A multilayer perceptron (MLP) is a kind of feed forward artificial neural network, which is a mathematical model inspired by the biological neural network. The multilayer perceptron can be used for various machine learning tasks such as classification and regression. It is helpful if it can be included in mahout. 2. API The design goal of API is to facilitate the usage of MLP for user, and make the implementation detail user transparent. The following is an example code of how user uses the MLP. - // set the parameters double learningRate = 0.5; double momentum = 0.1; int[] layerSizeArray = new int[] {2, 5, 1}; String costFuncName = “SquaredError”; String squashingFuncName = “Sigmoid”; // the location to store the model, if there is already an existing model at the specified location, MLP will throw exception URI modelLocation = ... MultilayerPerceptron mlp = new MultiLayerPerceptron(layerSizeArray, modelLocation); mlp.setLearningRate(learningRate).setMomentum(momentum).setRegularization(...).setCostFunction(...).setSquashingFunction(...); // the user can also load an existing model with given URI and update the model with new training data, if there is no existing model at the specified location, an exception will be thrown /* MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, modelLocation); */ URI trainingDataLocation = … // the detail of training is transparent to the user, it may running in a single machine or in a distributed environment mlp.train(trainingDataLocation); // user can also train the model with one training instance in stochastic gradient descent way Vector trainingInstance = ... mlp.train(trainingInstance); // prepare the input feature Vector inputFeature … // the semantic meaning of the output result is defined by the user // in general case, the dimension of output vector is 1 for regression and two-class classification // the dimension of output vector is n for n-class classification (n 2) Vector outputVector = mlp.output(inputFeature); - 3. Methodology The output calculation can be easily implemented with feed-forward approach. Also, the single machine training is straightforward. The following will describe how to train MLP in distributed way with batch gradient descent. The workflow is illustrated as the below figure. https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720 For the distributed training, each training iteration is divided into two steps, the weight update calculation step and the weight update step. The distributed MLP can only be trained in batch-update approach. 3.1 The partial weight update calculation step: This step trains the MLP distributedly. Each task will get a copy of the MLP model, and calculate the weight update with a partition of data. Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where D denotes the training set, d denotes a training instance, t_d denotes the class label and y_d denotes the output of the MLP. Also, suppose sigmoid function is used as the squashing function, squared error is used as the cost function, t_i denotes the target value for the ith dimension of the output layer, o_i denotes the actual output for the ith dimension of the output layer, l denotes the learning rate, w_{ij} denotes the weight between the jth neuron in previous layer and the ith neuron in the next layer. The weight of each edge is updated as \Delta w_{ij} = l * 1 / m * \delta_j * o_i, where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. It is easy to know that \delta_j can be rewritten as \delta_j = - \sigma_{i = 1}^k \sigma_{m_i} * o_j^{(m_i)} * (1 - o_j^{(m_i)}) * (t_j^{(m_i)} - o_j^{(m_i)}) The above equation indicates that the \delta_j can be divided into k parts. So for the implementation, each mapper can calculate part of \delta_j with given partition of data, and then store the result into a specified location. 3.2 The model update step: After k parts of \delta_j been calculated, a separate program can be used to merge the k parts of \delta_j into one to update the weight matrices. This program can load the results calculated in the weight update calculation step and update the weight matrices. was: Design of multilayer perceptron 1. Motivation A multilayer perceptron (MLP) is a kind of feed forward artificial neural network, which is a mathematical model inspired
[jira] [Commented] (MAHOUT-1265) Add Multilayer Perceptron
[ https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686880#comment-13686880 ] Yexi Jiang commented on MAHOUT-1265: Ted, {quote} I would suggest that a more fluid API would be helpful to people. For instance, each layer might be an object which could be composed together to build a model which is then trained. {quote} It seems that you suggest a more general neural network, not just the MLP. A MLP is a kind of feed-forward neural network that the topology is fixed. It usually consists of several layers and every pair of neurons in adjacent layers are connected. Therefore, specify the size of each layer is enough to determine the topology of a MLP. It is good if we first define a generic neural network, and then build a MLP on top of this generic neural network in the way as you said. An advantage is that the generic neural network can be reused to build other types of neural networks in the future, e.g. autoencoder for dimensional reduction, recurrent neural network for sequential mining, or possibly deep nets, etc. {quote} Secondly, it seems like it would be good to have different kinds of loss function and regularizations. {quote} Yes, the MLP would allow the user to specify different loss function, squashing functions, and regularizations. {quote} Also, regarding things like momentum, do you have an idea that this really needs to be commonly adjusted? or is there a way to set a good default? {quote} As far as I know, there is no empirical way to set a good default momentum weight. A good value is determined by the concrete problem. As for learning rate, a good way is to enable the decaying learning rate. Add Multilayer Perceptron -- Key: MAHOUT-1265 URL: https://issues.apache.org/jira/browse/MAHOUT-1265 Project: Mahout Issue Type: New Feature Reporter: Yexi Jiang Labels: machine_learning, neural_network Design of multilayer perceptron 1. Motivation A multilayer perceptron (MLP) is a kind of feed forward artificial neural network, which is a mathematical model inspired by the biological neural network. The multilayer perceptron can be used for various machine learning tasks such as classification and regression. It is helpful if it can be included in mahout. 2. API The design goal of API is to facilitate the usage of MLP for user, and make the implementation detail user transparent. The following is an example code of how user uses the MLP. - // set the parameters double learningRate = 0.5; double momentum = 0.1; int[] layerSizeArray = new int[] {2, 5, 1}; String costFuncName = “SquaredError”; String squashingFuncName = “Sigmoid”; // the location to store the model, if there is already an existing model at the specified location, MLP will throw exception URI modelLocation = ... MultilayerPerceptron mlp = new MultiLayerPerceptron(layerSizeArray, modelLocation); mlp.setLearningRate(learningRate).setMomentum(momentum).setRegularization(...).setCostFunction(...).setSquashingFunction(...); // the user can also load an existing model with given URI and update the model with new training data, if there is no existing model at the specified location, an exception will be thrown /* MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, modelLocation); */ URI trainingDataLocation = … // the detail of training is transparent to the user, it may running in a single machine or in a distributed environment mlp.train(trainingDataLocation); // user can also train the model with one training instance in stochastic gradient descent way Vector trainingInstance = ... mlp.train(trainingInstance); // prepare the input feature Vector inputFeature … // the semantic meaning of the output result is defined by the user // in general case, the dimension of output vector is 1 for regression and two-class classification // the dimension of output vector is n for n-class classification (n 2) Vector outputVector = mlp.output(inputFeature); - 3. Methodology The output calculation can be easily implemented with feed-forward approach. Also, the single machine training is straightforward. The following will describe how to train MLP in distributed way with batch gradient descent. The workflow is illustrated as the below figure. https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720 For the distributed training, each training iteration is divided into two steps, the weight update calculation step and the weight update step. The distributed MLP can only be trained in batch-update approach. 3.1
Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #516
See https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/516/changes Changes: [robinanil] Randomized test for VectorBinaryAggregate -- [...truncated 5407 lines...] INFO: Task 'attempt_local_0015_r_00_0' done. Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 100% reduce 100% Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: Job complete: job_local_0015 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Counters: 17 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: File Output Format Counters Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Bytes Written=389 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: FileSystemCounters Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: FILE_BYTES_READ=1274348775 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: FILE_BYTES_WRITTEN=1285878485 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: File Input Format Counters Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Bytes Read=152 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Map-Reduce Framework Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Map output materialized bytes=61 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Map input records=0 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Reduce shuffle bytes=0 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Spilled Records=40 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Map output bytes=120 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Total committed heap usage (bytes)=3249930240 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: SPLIT_RAW_BYTES=119 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Combine input records=20 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Reduce input records=20 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Reduce input groups=20 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Combine output records=20 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Reduce output records=20 Jun 18, 2013 6:26:11 PM org.apache.hadoop.mapred.Counters log INFO: Map output records=20 Jun 18, 2013 6:26:11 PM org.slf4j.impl.JCLLoggerAdapter info INFO: About to run iteration 16 of 20 Jun 18, 2013 6:26:11 PM org.slf4j.impl.JCLLoggerAdapter info INFO: About to run: Iteration 16 of 20, input path: /tmp/mahout-work-hudson/reuters-lda-model/model-15 Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus INFO: Total input paths to process : 1 Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: Running job: job_local_0016 Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapred.Task initialize INFO: Using ResourceCalculatorPlugin : null Jun 18, 2013 6:26:13 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer init INFO: io.sort.mb = 100 Jun 18, 2013 6:26:14 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer init INFO: data buffer = 79691776/99614720 Jun 18, 2013 6:26:14 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer init INFO: record buffer = 262144/327680 Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Retrieving configuration Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Initializing read model Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Initializing write model Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Initializing model trainer Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Starting training threadpool with 4 threads Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Stopping model trainer Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Initiating stopping of training threadpool Jun 18, 2013 6:26:14 PM org.slf4j.impl.JCLLoggerAdapter info INFO: threadpool took: 0.752647ms Jun 18, 2013 6:26:14 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 0% reduce 0% Jun 18, 2013 6:26:15 PM org.slf4j.impl.JCLLoggerAdapter info INFO: readModel.stop() took 1002.078932ms Jun 18, 2013 6:26:16 PM org.slf4j.impl.JCLLoggerAdapter info INFO: writeModel.stop() took 1010.00808ms Jun 18, 2013 6:26:16 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Writing model Jun 18, 2013 6:26:16 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush INFO: Starting flush of map output Jun 18, 2013 6:26:16 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill INFO: Finished spill 0 Jun 18, 2013 6:26:16 PM org.apache.hadoop.mapred.Task done INFO:
Does RowSimilarity job support down-sampling
I was reading the RowSimilarityJob and it doesn't appear that it does down-sampling on the original data to minimize the performance impact of perversely prolific users. The issue is that if a single user has 100,000 items in their history, we learn nothing more than if we picked 300 of those while the former would result in processing 10 billion cooccurrences and the latter would result in 100,000. This factor of 10,000 is so large that it can make a big difference in performance. I had thought that the code had this down-sampling in place. If not, I can add row based down-sampling quite easily.
Re: Does RowSimilarity job support down-sampling
I think you can get what you need through the --maxPrefsForUser flag. Any user with more than that many will only keep a random sample of that size. On Jun 18, 2013, at 23:27, Ted Dunning ted.dunn...@gmail.com wrote: I was reading the RowSimilarityJob and it doesn't appear that it does down-sampling on the original data to minimize the performance impact of perversely prolific users. The issue is that if a single user has 100,000 items in their history, we learn nothing more than if we picked 300 of those while the former would result in processing 10 billion cooccurrences and the latter would result in 100,000. This factor of 10,000 is so large that it can make a big difference in performance. I had thought that the code had this down-sampling in place. If not, I can add row based down-sampling quite easily.
Re: Does RowSimilarity job support down-sampling
My recollection as well. I will read the code again. Didn't see where that happens. On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote: This is the maxPrefsPerUser option IIRC. On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: I was reading the RowSimilarityJob and it doesn't appear that it does down-sampling on the original data to minimize the performance impact of perversely prolific users. The issue is that if a single user has 100,000 items in their history, we learn nothing more than if we picked 300 of those while the former would result in processing 10 billion cooccurrences and the latter would result in 100,000. This factor of 10,000 is so large that it can make a big difference in performance. I had thought that the code had this down-sampling in place. If not, I can add row based down-sampling quite easily.
Re: Does RowSimilarity job support down-sampling
No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: Ahh... only effective in RecommenderJob.
Re: Does RowSimilarity job support down-sampling
But RecommenderJob seems to call RowSimilarityJob first. That is where sampling needs to be done. //calculate the co-occurrence matrix ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{ --input, new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(), --output, similarityMatrixPath.toString(), --numberOfColumns, String.valueOf(numberOfUsers), --similarityClassname, similarityClassname, --maxSimilaritiesPerRow, String.valueOf(maxSimilaritiesPerItem), --excludeSelfSimilarity, String.valueOf(Boolean.TRUE), --threshold, String.valueOf(threshold), --tempDir, getTempPath().toString(), }); // write out the similarity matrix if the user specified that behavior if (hasOption(outputPathForSimilarityMatrix)) { Path outputPathForSimilarityMatrix = new Path(getOption(outputPathForSimilarityMatrix)); Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, outputPathForSimilarityMatrix, SequenceFileInputFormat.class, ItemSimilarityJob.MostSimilarItemPairsMapper.class, EntityEntityWritable.class, DoubleWritable.class, ItemSimilarityJob.MostSimilarItemPairsReducer.class, EntityEntityWritable.class, DoubleWritable.class, TextOutputFormat.class); Configuration mostSimilarItemsConf = outputSimilarityMatrix.getConfiguration(); mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR, new Path(prepPath, PreparePreferenceMatrixJob.ITEMID_INDEX).toString()); mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, maxSimilaritiesPerItem); outputSimilarityMatrix.waitForCompletion(true); } } On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen sro...@gmail.com wrote: No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: Ahh... only effective in RecommenderJob.
Re: Does RowSimilarity job support down-sampling
Hi, RowSimilarityJob by itself does not do down-sampling. The down-sampling is done by the ToItemVectorsMapper in the PreparePreferenceMatrixJob which is responsible for preparing the inputs (the matrix of interactions between users and items) for ItemSimilarityJob and RecommenderJob. As Sean noted, the option maxPrefsPerUser controls the sampling. By default, we use a 1000 samples per user. We could also move the sampling directly to RowSimilarityJob if people consider this more useful. Best, Sebastian On 18.06.2013 22:50, Ted Dunning wrote: But RecommenderJob seems to call RowSimilarityJob first. That is where sampling needs to be done. //calculate the co-occurrence matrix ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{ --input, new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(), --output, similarityMatrixPath.toString(), --numberOfColumns, String.valueOf(numberOfUsers), --similarityClassname, similarityClassname, --maxSimilaritiesPerRow, String.valueOf(maxSimilaritiesPerItem), --excludeSelfSimilarity, String.valueOf(Boolean.TRUE), --threshold, String.valueOf(threshold),Hi --tempDir, getTempPath().toString(), }); // write out the similarity matrix if the user specified that behavior if (hasOption(outputPathForSimilarityMatrix)) { Path outputPathForSimilarityMatrix = new Path(getOption(outputPathForSimilarityMatrix)); Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, outputPathForSimilarityMatrix, SequenceFileInputFormat.class, ItemSimilarityJob.MostSimilarItemPairsMapper.class, EntityEntityWritable.class, DoubleWritable.class, ItemSimilarityJob.MostSimilarItemPairsReducer.class, EntityEntityWritable.class, DoubleWritable.class, TextOutputFormat.class); Configuration mostSimilarItemsConf = outputSimilarityMatrix.getConfiguration(); mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR, new Path(prepPath, PreparePreferenceMatrixJob.ITEMID_INDEX).toString()); mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, maxSimilaritiesPerItem); outputSimilarityMatrix.waitForCompletion(true); } } On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen sro...@gmail.com wrote: No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: Ahh... only effective in RecommenderJob.
[jira] [Updated] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication
[ https://issues.apache.org/jira/browse/MAHOUT-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Illecker updated MAHOUT-1266: Description: Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232] was: Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4 [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232] Two minor problems in DistributedRowMatrix using MatrixMultiplication - Key: MAHOUT-1266 URL: https://issues.apache.org/jira/browse/MAHOUT-1266 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Reporter: Martin Illecker Priority: Trivial Labels: newbie Original Estimate: 10m Remaining Estimate: 10m Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication
Martin Illecker created MAHOUT-1266: --- Summary: Two minor problems in DistributedRowMatrix using MatrixMultiplication Key: MAHOUT-1266 URL: https://issues.apache.org/jira/browse/MAHOUT-1266 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Reporter: Martin Illecker Priority: Trivial Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1|https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2|https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226] [3|https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4|https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication
[ https://issues.apache.org/jira/browse/MAHOUT-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Illecker updated MAHOUT-1266: Description: Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L221-225] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L230-231] was: Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L222-226] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L231-232] Two minor problems in DistributedRowMatrix using MatrixMultiplication - Key: MAHOUT-1266 URL: https://issues.apache.org/jira/browse/MAHOUT-1266 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Reporter: Martin Illecker Priority: Trivial Labels: newbie Original Estimate: 10m Remaining Estimate: 10m Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L221-225] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L230-231] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Mahout-Quality #2094
See https://builds.apache.org/job/Mahout-Quality/2094/ -- [...truncated 4959 lines...] Running org.apache.mahout.clustering.spectral.common.TestUnitVectorizerJob parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.643 sec Running org.apache.mahout.clustering.streaming.cluster.BallKMeansTest Running org.apache.mahout.clustering.streaming.cluster.StreamingKMeansTest parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.TestClusterInterface Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.532 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.kmeans.TestRandomSeedGenerator Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.514 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.kmeans.TestKmeansClustering Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.146 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.topdown.postprocessor.ClusterCountReaderTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.009 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.039 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.topdown.PathDirectoryTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.classify.ClusterClassificationDriverTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.357 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.dirichlet.TestDistributions Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.497 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.dirichlet.TestMapReduce Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 21.297 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.dirichlet.TestDirichletClustering Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 74.904 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.minhash.TestMinHashClustering Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.762 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.canopy.TestCanopyCreation Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 35.752 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.clustering.TestGaussianAccumulators Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.876 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.234 sec Running org.apache.mahout.classifier.discriminative.WinnowTrainerTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.122 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.classifier.discriminative.PerceptronTrainerTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.classifier.discriminative.LinearModelTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 20.216 sec Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.133 sec parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 87.869 sec FAILURE! testRemoval[0](org.apache.mahout.math.neighborhood.SearchSanityTest) Time elapsed: 5.17 sec FAILURE! java.lang.AssertionError:
Re: Does RowSimilarity job support down-sampling
On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter s...@apache.org wrote: We could also move the sampling directly to RowSimilarityJob if people consider this more useful. It will have a large effect on the time for the RowSimilarityJob for some data. Does anybody have an idea about how much of the total time is in RowSimilarityJob?
Mahout vectors/matrices/solvers on spark
Hello, so i finally got around to actually do it. I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some solvers using spark and Bagel /scala. I also want to use in-core solvers that run directly on Mahout. Question #1: which mahout artifacts are better be imported if I don't want to pick the hadoop stuff dependencies? Is there even such a separation of code? I know mahout-math seems to try to avoid being hadoop specfic but not sure if it is followed strictly. Question #2: which in-core solvers are available for Mahout matrices? I know there's SSVD, probably Cholesky, is there something else? In paticular, i need to be solving linear systems, I guess Cholesky should be equipped enough to do just that? Question #3: why did we try to import Colt solvers rather than actually depend on Colt in the first place? Why did we not accept Colt's sparse matrices and created native ones instead? Colt seems to have a notion of parse in-core matrices too and seems like a well-rounded solution. However, it doesn't seem like being actively supported, whereas I know Mahout experienced continued enhancements to the in-core matrix support. Thanks in advance -Dmitriy
Re: Mahout vectors/matrices/solvers on spark
On Tue, Jun 18, 2013 at 6:14 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hello, so i finally got around to actually do it. I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some solvers using spark and Bagel /scala. I also want to use in-core solvers that run directly on Mahout. Question #1: which mahout artifacts are better be imported if I don't want to pick the hadoop stuff dependencies? Is there even such a separation of code? I know mahout-math seems to try to avoid being hadoop specfic but not sure if it is followed strictly. mahout-math should not depend on hadoop apis at all, if you build it and hadoop gets pulled in via maven, then file a ticket, that's a bug. Question #2: which in-core solvers are available for Mahout matrices? I know there's SSVD, probably Cholesky, is there something else? In paticular, i need to be solving linear systems, I guess Cholesky should be equipped enough to do just that? Question #3: why did we try to import Colt solvers rather than actually depend on Colt in the first place? Why did we not accept Colt's sparse matrices and created native ones instead? Colt seems to have a notion of parse in-core matrices too and seems like a well-rounded solution. However, it doesn't seem like being actively supported, whereas I know Mahout experienced continued enhancements to the in-core matrix support. Colt was totally abandoned, and I talked to the original author and he blessed it's adoption. When we pulled it in, we found it was woefully undertested, and tried our best to hook it in with proper tests and use APIs that fit with the use cases we had. Plus, we already had the start of some linear apis (i.e. the Vector interface) and dropping the API completely seemed not terribly worth it at the time. Thanks in advance -Dmitriy -- -jake
[jira] [Commented] (MAHOUT-1266) Two minor problems in DistributedRowMatrix using MatrixMultiplication
[ https://issues.apache.org/jira/browse/MAHOUT-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687584#comment-13687584 ] Jake Mannix commented on MAHOUT-1266: - As mentioned in the javadocs for the method, it does *not* implement A * B, it implements A.transpose() * B, because this operation can be done in one map-reduce pass (with both SequenceFiles backing A and B as inputs), while computing A * B takes two map-reduce passes. Why try and super-speed up the process with GPU, like in your code linked to, if you're going to have to make two full passes (your call to .transpose()) over your distributed data set? That will inevitably be way slower than anything (unoptimized) you can compute in one MR pass, by nature of all the disk IO. Two minor problems in DistributedRowMatrix using MatrixMultiplication - Key: MAHOUT-1266 URL: https://issues.apache.org/jira/browse/MAHOUT-1266 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Reporter: Martin Illecker Priority: Trivial Labels: newbie Original Estimate: 10m Remaining Estimate: 10m Hello, I think I have found two minor problems in *DistributedRowMatrix*. In [1] the condition is wrong, because (l x m) * (m x n) = (l x n). The condition should be like in [2]. And in *times*[3] the {{this.transpose()}} seems to be missing? (See [4]) Do you have any benchmark results for Mahout MatrixMultiplication? Thanks! Martin [1] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L191-193] [2] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L221-225] [3] [https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java#L190-206] [4] [https://github.com/millecker/applications/blob/master/hadoop/rootbeer/matrixmultiplication/src/at/illecker/hadoop/rootbeer/examples/matrixmultiplication/DistributedRowMatrix.java#L230-231] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Mahout vectors/matrices/solvers on spark
Thank you, Jake. I suspected as much about Colt. On Jun 18, 2013 8:30 PM, Jake Mannix jake.man...@gmail.com wrote: On Tue, Jun 18, 2013 at 6:14 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hello, so i finally got around to actually do it. I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some solvers using spark and Bagel /scala. I also want to use in-core solvers that run directly on Mahout. Question #1: which mahout artifacts are better be imported if I don't want to pick the hadoop stuff dependencies? Is there even such a separation of code? I know mahout-math seems to try to avoid being hadoop specfic but not sure if it is followed strictly. mahout-math should not depend on hadoop apis at all, if you build it and hadoop gets pulled in via maven, then file a ticket, that's a bug. Question #2: which in-core solvers are available for Mahout matrices? I know there's SSVD, probably Cholesky, is there something else? In paticular, i need to be solving linear systems, I guess Cholesky should be equipped enough to do just that? Question #3: why did we try to import Colt solvers rather than actually depend on Colt in the first place? Why did we not accept Colt's sparse matrices and created native ones instead? Colt seems to have a notion of parse in-core matrices too and seems like a well-rounded solution. However, it doesn't seem like being actively supported, whereas I know Mahout experienced continued enhancements to the in-core matrix support. Colt was totally abandoned, and I talked to the original author and he blessed it's adoption. When we pulled it in, we found it was woefully undertested, and tried our best to hook it in with proper tests and use APIs that fit with the use cases we had. Plus, we already had the start of some linear apis (i.e. the Vector interface) and dropping the API completely seemed not terribly worth it at the time. Thanks in advance -Dmitriy -- -jake
Re: Does RowSimilarity job support down-sampling
On 19.06.2013 01:29, Ted Dunning wrote: On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter s...@apache.org wrote: We could also move the sampling directly to RowSimilarityJob if people consider this more useful. It will have a large effect on the time for the RowSimilarityJob for some data. I put the sampling into PreparePreferenceMatrixJob, because I considered it to be usecase specific for recommendations. Does anybody have an idea about how much of the total time is in RowSimilarityJob? What do you mean by total time? Compared to the rest of the jobs in ItemSimilarityJob and RecommenderJob? -sebastian