Ok I think I got it. The problem was that I wasn't naming the files properly. If I'm not mistaken I'll need to organize my training data like: -bash-3.2$ hadoop dfs -lsr /user/rfcompton/emotion-training-labeled/ -rw-r--r-- 3 rfcompton hadoop 2896850 2013-04-11 16:23 /user/rfcompton/emotion-training-labeled/ANGER_RAGE -rw-r--r-- 3 rfcompton hadoop 3239449 2013-04-11 16:24 /user/rfcompton/emotion-training-labeled/JOY
where the contents of /user/rfcompton/emotion-training-labeled/JOY look like: JOY actually turning decent new year ☺ JOY best New Years tonight! ready 2013. <U+1F609> <U+1F38A><U+1F389> ... On Thu, Apr 11, 2013 at 4:02 PM, Ryan Compton <compton.r...@gmail.com> wrote: > Also, right before the screen dump I see: > > 13/04/11 15:46:40 INFO mapred.JobClient: Combine output records=462236 > 13/04/11 15:46:40 INFO mapred.JobClient: Physical memory (bytes) > snapshot=1618497536 > 13/04/11 15:46:40 INFO mapred.JobClient: Reduce output records=419058 > 13/04/11 15:46:40 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=4697526272 > 13/04/11 15:46:40 INFO mapred.JobClient: Map output records=702535 > 13/04/11 15:46:40 INFO cbayes.CBayesDriver: Calculating Tf-Idf... > 13/04/11 15:46:41 INFO common.BayesTfIdfDriver: Counts of documents in > Each Label > 13/04/11 15:46:42 INFO common.BayesTfIdfDriver: {ANGER_RAGE family's > personal fucking bank.=1.0, ANGER_RAGE give up life...=1.0, ANGER_RAGE > understand peopleS=1.0, ANGER_RAGE many episodes record day?5=1.0, > ANGER_RAGE! need punching bag take out angerC=1.0, ANGER_RAGE& right > now�� insults make laugh.A=1.0, ANGER_RAGEunny a > > On Thu, Apr 11, 2013 at 3:58 PM, Ryan Compton <compton.r...@gmail.com> wrote: >> I'm trying to train a simple text classifier using cbayes. I've got >> formatted <Text,Text> sequence files created with >> com.twitter.elephantbird.pig.store.SequenceFileStorage(), eg: >> >> JOY actually turning decent new year ☺ >> JOY best New Years tonight! ready 2013. <U+1F609> <U+1F38A><U+1F389> >> JOY playing Dream League Soccer iPad 2 earned 13 coins! >> JOY Great way start new ear >> JOY good sober New Years Eve >> ANGER_RAGE Last night frank hasn't done revision prelims >> ANGER_RAGE hell cut forehead such ball ache! Cheers pleb chucks >> glass bottles around! >> ANGER_RAGE shops open today customer services shut apparently >> being paid "come back tomorrow". >> >> These are stored in a directory as: >> /emotion-training-labeled/part-m-0000* >> >> I pass the labeled data into cbayes: >> >> mahout trainclassifier -i /emotion-training-labeled/ -o emotion-model/ >> -type cbayes -ng 1 -source hdfs >> >> Both map and reduce get to 100%, then I see something about Tf-Idf >> followed by what looks like a complete dump of my training data print >> to the screen for the next few minutes and then a stack trace: >> >> rything life teach lesson, willing observe learn.” YUP!GJOYB Halbrecht >> DAN CASTAIC CA found local Videographer. Register FREE:"JOY Palm Read >> Easy Created WorldJOY=1.0, ANGER_RAGE people fisty latelyK=1.0, >> ANGER_RAGE ew gon lot em ��=1.0, ANGER_RAGE ain't gonna love =1.0} >> 13/04/11 15:46:51 INFO common.BayesTfIdfDriver: {dataSource=hdfs, >> alpha_i=1.0, minDf=1, gramSize=1} >> 13/04/11 15:46:51 WARN mapred.JobClient: Use GenericOptionsParser for >> parsing the arguments. Applications should implement Tool for the >> same. >> 13/04/11 15:46:57 INFO mapred.FileInputFormat: Total input paths to process >> : 3 >> 13/04/11 15:46:58 INFO mapred.JobClient: Cleaning up the staging area >> hdfs://master/user/rfcompton/.staging/job_201303271312_2786 >> 13/04/11 15:46:58 ERROR security.UserGroupInformation: >> PriviledgedActionException as:rfcompton (auth:SIMPLE) >> cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: >> java.io.IOException: Exceeded max jobconf size: 10706309 limit: >> 5242880 >> at >> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766) >> at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) >> Caused by: java.io.IOException: Exceeded max jobconf size: 10706309 >> limit: 5242880 >> at >> org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406) >> at >> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764) >> ... 10 more >> >> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: >> java.io.IOException: java.io.IOException: Exceeded max jobconf size: >> 10706309 limit: 5242880 >> at >> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766) >> at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) >> Caused by: java.io.IOException: Exceeded max jobconf size: 10706309 >> limit: 5242880 >> at >> org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406) >> at >> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764) >> ... 10 more >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1107) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >> at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904) >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242) >> at >> org.apache.mahout.classifier.bayes.mapreduce.common.BayesTfIdfDriver.runJob(BayesTfIdfDriver.java:97) >> at >> org.apache.mahout.classifier.bayes.mapreduce.cbayes.CBayesDriver.runJob(CBayesDriver.java:51) >> at >> org.apache.mahout.classifier.bayes.TrainClassifier.trainCNaiveBayes(TrainClassifier.java:58) >> at >> org.apache.mahout.classifier.bayes.TrainClassifier.main(TrainClassifier.java:151) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:197)