[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842960#comment-13842960 ] Gokhan Capan commented on MAHOUT-1354: -- Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project. CC'ing [~frankscholten] Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Attachments: MAHOUT-1354_initial.patch Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1265) Add Multilayer Perceptron
[ https://issues.apache.org/jira/browse/MAHOUT-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yexi Jiang updated MAHOUT-1265: --- Attachment: Mahout-1265-11.patch This is the final version of the patch. It has been reviewed by [~smarthi]. Add Multilayer Perceptron -- Key: MAHOUT-1265 URL: https://issues.apache.org/jira/browse/MAHOUT-1265 Project: Mahout Issue Type: New Feature Reporter: Yexi Jiang Labels: machine_learning, neural_network Attachments: Mahout-1265-11.patch, Mahout-1265-6.patch, mahout-1265.patch Design of multilayer perceptron 1. Motivation A multilayer perceptron (MLP) is a kind of feed forward artificial neural network, which is a mathematical model inspired by the biological neural network. The multilayer perceptron can be used for various machine learning tasks such as classification and regression. It is helpful if it can be included in mahout. 2. API The design goal of API is to facilitate the usage of MLP for user, and make the implementation detail user transparent. The following is an example code of how user uses the MLP. - // set the parameters double learningRate = 0.5; double momentum = 0.1; int[] layerSizeArray = new int[] {2, 5, 1}; String costFuncName = “SquaredError”; String squashingFuncName = “Sigmoid”; // the location to store the model, if there is already an existing model at the specified location, MLP will throw exception URI modelLocation = ... MultilayerPerceptron mlp = new MultiLayerPerceptron(layerSizeArray, modelLocation); mlp.setLearningRate(learningRate).setMomentum(momentum).setRegularization(...).setCostFunction(...).setSquashingFunction(...); // the user can also load an existing model with given URI and update the model with new training data, if there is no existing model at the specified location, an exception will be thrown /* MultilayerPerceptron mlp = new MultiLayerPerceptron(learningRate, regularization, momentum, squashingFuncName, costFuncName, modelLocation); */ URI trainingDataLocation = … // the detail of training is transparent to the user, it may running in a single machine or in a distributed environment mlp.train(trainingDataLocation); // user can also train the model with one training instance in stochastic gradient descent way Vector trainingInstance = ... mlp.train(trainingInstance); // prepare the input feature Vector inputFeature … // the semantic meaning of the output result is defined by the user // in general case, the dimension of output vector is 1 for regression and two-class classification // the dimension of output vector is n for n-class classification (n 2) Vector outputVector = mlp.output(inputFeature); - 3. Methodology The output calculation can be easily implemented with feed-forward approach. Also, the single machine training is straightforward. The following will describe how to train MLP in distributed way with batch gradient descent. The workflow is illustrated as the below figure. https://docs.google.com/drawings/d/1s8hiYKpdrP3epe1BzkrddIfShkxPrqSuQBH0NAawEM4/pub?w=960h=720 For the distributed training, each training iteration is divided into two steps, the weight update calculation step and the weight update step. The distributed MLP can only be trained in batch-update approach. 3.1 The partial weight update calculation step: This step trains the MLP distributedly. Each task will get a copy of the MLP model, and calculate the weight update with a partition of data. Suppose the training error is E(w) = ½ \sigma_{d \in D} cost(t_d, y_d), where D denotes the training set, d denotes a training instance, t_d denotes the class label and y_d denotes the output of the MLP. Also, suppose sigmoid function is used as the squashing function, squared error is used as the cost function, t_i denotes the target value for the ith dimension of the output layer, o_i denotes the actual output for the ith dimension of the output layer, l denotes the learning rate, w_{ij} denotes the weight between the jth neuron in previous layer and the ith neuron in the next layer. The weight of each edge is updated as \Delta w_{ij} = l * 1 / m * \delta_j * o_i, where \delta_j = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * (t_j^{(m)} - o_j^{(m)}) for output layer, \delta = - \sigma_{m} * o_j^{(m)} * (1 - o_j^{(m)}) * \sigma_k \delta_k * w_{jk} for hidden layer. It is easy to know that \delta_j can be rewritten as \delta_j = - \sigma_{i = 1}^k \sigma_{m_i} * o_j^{(m_i)} * (1 - o_j^{(m_i)}) * (t_j^{(m_i)} - o_j^{(m_i)}) The above equation indicates that the \delta_j can be divided into k parts. So for the implementation, each mapper can calculate part of \delta_j with
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843226#comment-13843226 ] Gokhan Capan commented on MAHOUT-1354: -- Yeah, I agree Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Attachments: MAHOUT-1354_initial.patch Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Jenkins build is back to normal : Mahout-Examples-Cluster-Reuters-II #689
See https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/689/changes
[jira] [Created] (MAHOUT-1375) Apache Mahout
kaan can created MAHOUT-1375: Summary: Apache Mahout Key: MAHOUT-1375 URL: https://issues.apache.org/jira/browse/MAHOUT-1375 Project: Mahout Issue Type: Bug Reporter: kaan can Hello, Firstly, thank you for spending time in read my letter! well,my question is : 1) Which tools are used in Carrot2? 2) Carrot2 is provide suitable for supervised learning or unsupervised? 3) Which preprocessing methods tools in Carrot2? Kind regards -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (MAHOUT-1375) Apache Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843651#comment-13843651 ] Suneel Marthi commented on MAHOUT-1375: --- Is this about Carrot2? This should be discussed on Carrot2 forums then. Apache Mahout - Key: MAHOUT-1375 URL: https://issues.apache.org/jira/browse/MAHOUT-1375 Project: Mahout Issue Type: Bug Reporter: kaan can Hello, Firstly, thank you for spending time in read my letter! well,my question is : 1) Which tools are used in Carrot2? 2) Carrot2 is provide suitable for supervised learning or unsupervised? 3) Which preprocessing methods tools in Carrot2? Kind regards -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (MAHOUT-1375) Apache Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843673#comment-13843673 ] kaan can commented on MAHOUT-1375: -- sorry, i messed.. 1) Which tools are used in Apache Mahout? 2) Apache Mahout is provide suitable for supervised learning or unsupervised? 3) Which preprocessing methods tools in Apache Mahout? Apache Mahout - Key: MAHOUT-1375 URL: https://issues.apache.org/jira/browse/MAHOUT-1375 Project: Mahout Issue Type: Bug Reporter: kaan can Hello, Firstly, thank you for spending time in read my letter! well,my question is : 1) Which tools are used in Carrot2? 2) Carrot2 is provide suitable for supervised learning or unsupervised? 3) Which preprocessing methods tools in Carrot2? Kind regards -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (MAHOUT-1375) Apache Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843695#comment-13843695 ] Suneel Marthi commented on MAHOUT-1375: --- This seems more like a question that should have been posted to the user@ mailing list. Please post your question to mailing lists. Apache Mahout - Key: MAHOUT-1375 URL: https://issues.apache.org/jira/browse/MAHOUT-1375 Project: Mahout Issue Type: Bug Reporter: kaan can Hello, Firstly, thank you for spending time in read my letter! well,my question is : 1) Which tools are used in Carrot2? 2) Carrot2 is provide suitable for supervised learning or unsupervised? 3) Which preprocessing methods tools in Carrot2? Kind regards -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1371) Arff loader can misinterprete nominals with integer, real or string
[ https://issues.apache.org/jira/browse/MAHOUT-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mansur updated MAHOUT-1371: --- Attachment: (was: MAHOUT-1371.patch) Arff loader can misinterprete nominals with integer, real or string --- Key: MAHOUT-1371 URL: https://issues.apache.org/jira/browse/MAHOUT-1371 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.9 Environment: all Reporter: mansur Labels: ARFF Fix For: 0.9 Attachments: MAHOUT-1371.patch If the nominal values contain a value like integer, real or string it will be misinterpreted as such instead of nominal. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1371) Arff loader can misinterprete nominals with integer, real or string
[ https://issues.apache.org/jira/browse/MAHOUT-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mansur updated MAHOUT-1371: --- Attachment: MAHOUT-1371.patch Unit tests written and passed. Arff loader can misinterprete nominals with integer, real or string --- Key: MAHOUT-1371 URL: https://issues.apache.org/jira/browse/MAHOUT-1371 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.9 Environment: all Reporter: mansur Labels: ARFF Fix For: 0.9 Attachments: MAHOUT-1371.patch If the nominal values contain a value like integer, real or string it will be misinterpreted as such instead of nominal. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1371) Arff loader can misinterprete nominals with integer, real or string
[ https://issues.apache.org/jira/browse/MAHOUT-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mansur updated MAHOUT-1371: --- Status: Patch Available (was: Open) Arff loader can misinterprete nominals with integer, real or string --- Key: MAHOUT-1371 URL: https://issues.apache.org/jira/browse/MAHOUT-1371 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.9 Environment: all Reporter: mansur Labels: ARFF Fix For: 0.9 Attachments: MAHOUT-1371.patch If the nominal values contain a value like integer, real or string it will be misinterpreted as such instead of nominal. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1371) Arff loader can misinterprete nominals with integer, real or string
[ https://issues.apache.org/jira/browse/MAHOUT-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mansur updated MAHOUT-1371: --- Attachment: (was: MAHOUT-1371.patch) Arff loader can misinterprete nominals with integer, real or string --- Key: MAHOUT-1371 URL: https://issues.apache.org/jira/browse/MAHOUT-1371 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.9 Environment: all Reporter: mansur Labels: ARFF Fix For: 0.9 Attachments: MAHOUT-1371.patch If the nominal values contain a value like integer, real or string it will be misinterpreted as such instead of nominal. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1371) Arff loader can misinterprete nominals with integer, real or string
[ https://issues.apache.org/jira/browse/MAHOUT-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mansur updated MAHOUT-1371: --- Attachment: MAHOUT-1371.patch Arff loader can misinterprete nominals with integer, real or string --- Key: MAHOUT-1371 URL: https://issues.apache.org/jira/browse/MAHOUT-1371 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.9 Environment: all Reporter: mansur Labels: ARFF Fix For: 0.9 Attachments: MAHOUT-1371.patch If the nominal values contain a value like integer, real or string it will be misinterpreted as such instead of nominal. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (MAHOUT-1376) when mahout train data, there is Task Id : attempt_201312031842_0751_m_000000_0, Status : FAILED java.lang.IllegalArgumentException
wangqiaoshi created MAHOUT-1376: --- Summary: when mahout train data, there is Task Id : attempt_201312031842_0751_m_00_0, Status : FAILED java.lang.IllegalArgumentException Key: MAHOUT-1376 URL: https://issues.apache.org/jira/browse/MAHOUT-1376 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Environment: Hadoop 1.0.3,mahout 0.8 Reporter: wangqiaoshi Fix For: 0.8 vm001:/usr/local/hadoop/mahout-distribution-0.8 # ./bin/mahout trainnb -i /tmp/mahout-work-root/20news-train-vectors -el -o /tmp/mahout-work-root/model -li /tmp/mahout-work-root/labelindex -ow -c Running on hadoop, using /usr/local/hadoop/hadoop-0.20.2/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/hadoop/mahout-distribution-0.8/mahout-examples-0.8-job.jar 13/12/10 10:29:56 WARN driver.MahoutDriver: No trainnb.props found on classpath, will use command-line arguments only 13/12/10 10:29:56 INFO common.AbstractJob: Command line arguments: {--alphaI=[1.0], --endPhase=[2147483647], --extractLabels=null, --input=[/tmp/mahout-work-root/20news-train-vectors], --labelIndex=[/tmp/mahout-work-root/labelindex], --output=[/tmp/mahout-work-root/model], --overwrite=null, --startPhase=[0], --tempDir=[temp], --trainComplementary=null} 13/12/10 10:29:56 INFO common.HadoopUtil: Deleting temp 13/12/10 10:29:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/12/10 10:29:57 INFO zlib.ZlibFactory: Successfully loaded initialized native-zlib library 13/12/10 10:29:57 INFO compress.CodecPool: Got brand-new decompressor 13/12/10 10:30:00 INFO input.FileInputFormat: Total input paths to process : 1 13/12/10 10:30:01 INFO mapred.JobClient: Running job: job_201312031842_0750 13/12/10 10:30:02 INFO mapred.JobClient: map 0% reduce 0% 13/12/10 10:30:18 INFO mapred.JobClient: map 100% reduce 0% 13/12/10 10:30:30 INFO mapred.JobClient: map 100% reduce 100% 13/12/10 10:30:35 INFO mapred.JobClient: Job complete: job_201312031842_0750 13/12/10 10:30:35 INFO mapred.JobClient: Counters: 29 13/12/10 10:30:35 INFO mapred.JobClient: Job Counters 13/12/10 10:30:35 INFO mapred.JobClient: Launched reduce tasks=1 13/12/10 10:30:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12445 13/12/10 10:30:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/12/10 10:30:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/12/10 10:30:35 INFO mapred.JobClient: Rack-local map tasks=1 13/12/10 10:30:35 INFO mapred.JobClient: Launched map tasks=1 13/12/10 10:30:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10355 13/12/10 10:30:35 INFO mapred.JobClient: File Output Format Counters 13/12/10 10:30:35 INFO mapred.JobClient: Bytes Written=97 13/12/10 10:30:35 INFO mapred.JobClient: FileSystemCounters 13/12/10 10:30:35 INFO mapred.JobClient: FILE_BYTES_READ=119 13/12/10 10:30:35 INFO mapred.JobClient: HDFS_BYTES_READ=270 13/12/10 10:30:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=45827 13/12/10 10:30:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=97 13/12/10 10:30:35 INFO mapred.JobClient: File Input Format Counters 13/12/10 10:30:35 INFO mapred.JobClient: Bytes Read=133 13/12/10 10:30:35 INFO mapred.JobClient: Map-Reduce Framework 13/12/10 10:30:35 INFO mapred.JobClient: Map output materialized bytes=14 13/12/10 10:30:35 INFO mapred.JobClient: Map input records=0 13/12/10 10:30:35 INFO mapred.JobClient: Reduce shuffle bytes=0 13/12/10 10:30:35 INFO mapred.JobClient: Spilled Records=0 13/12/10 10:30:35 INFO mapred.JobClient: Map output bytes=0 13/12/10 10:30:35 INFO mapred.JobClient: CPU time spent (ms)=2080 13/12/10 10:30:35 INFO mapred.JobClient: Total committed heap usage (bytes)=1016594432 13/12/10 10:30:35 INFO mapred.JobClient: Combine input records=0 13/12/10 10:30:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=137 13/12/10 10:30:35 INFO mapred.JobClient: Reduce input records=0 13/12/10 10:30:35 INFO mapred.JobClient: Reduce input groups=0 13/12/10 10:30:35 INFO mapred.JobClient: Combine output records=0 13/12/10 10:30:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=313008128 13/12/10 10:30:35 INFO mapred.JobClient: Reduce output records=0 13/12/10 10:30:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2980098048 13/12/10 10:30:35 INFO mapred.JobClient: Map output records=0 13/12/10 10:30:38 INFO input.FileInputFormat: Total input paths to process : 1 13/12/10 10:30:38 INFO mapred.JobClient: Running job: job_201312031842_0751 13/12/10 10:30:39 INFO mapred.JobClient: map 0% reduce 0% 13/12/10 10:30:55 INFO mapred.JobClient: Task Id : attempt_201312031842_0751_m_00_0, Status : FAILED java.lang.IllegalArgumentException at