Also, right before the screen dump I see: 13/04/11 15:46:40 INFO mapred.JobClient: Combine output records=462236 13/04/11 15:46:40 INFO mapred.JobClient: Physical memory (bytes) snapshot=1618497536 13/04/11 15:46:40 INFO mapred.JobClient: Reduce output records=419058 13/04/11 15:46:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4697526272 13/04/11 15:46:40 INFO mapred.JobClient: Map output records=702535 13/04/11 15:46:40 INFO cbayes.CBayesDriver: Calculating Tf-Idf... 13/04/11 15:46:41 INFO common.BayesTfIdfDriver: Counts of documents in Each Label 13/04/11 15:46:42 INFO common.BayesTfIdfDriver: {ANGER_RAGE family's personal fucking bank.=1.0, ANGER_RAGE give up life...=1.0, ANGER_RAGE understand peopleS=1.0, ANGER_RAGE many episodes record day?5=1.0, ANGER_RAGE! need punching bag take out angerC=1.0, ANGER_RAGE& right now�� insults make laugh.A=1.0, ANGER_RAGEunny a
On Thu, Apr 11, 2013 at 3:58 PM, Ryan Compton <compton.r...@gmail.com> wrote: > I'm trying to train a simple text classifier using cbayes. I've got > formatted <Text,Text> sequence files created with > com.twitter.elephantbird.pig.store.SequenceFileStorage(), eg: > > JOY actually turning decent new year ☺ > JOY best New Years tonight! ready 2013. <U+1F609> <U+1F38A><U+1F389> > JOY playing Dream League Soccer iPad 2 earned 13 coins! > JOY Great way start new ear > JOY good sober New Years Eve > ANGER_RAGE Last night frank hasn't done revision prelims > ANGER_RAGE hell cut forehead such ball ache! Cheers pleb chucks > glass bottles around! > ANGER_RAGE shops open today customer services shut apparently > being paid "come back tomorrow". > > These are stored in a directory as: > /emotion-training-labeled/part-m-0000* > > I pass the labeled data into cbayes: > > mahout trainclassifier -i /emotion-training-labeled/ -o emotion-model/ > -type cbayes -ng 1 -source hdfs > > Both map and reduce get to 100%, then I see something about Tf-Idf > followed by what looks like a complete dump of my training data print > to the screen for the next few minutes and then a stack trace: > > rything life teach lesson, willing observe learn.” YUP!GJOYB Halbrecht > DAN CASTAIC CA found local Videographer. Register FREE:"JOY Palm Read > Easy Created WorldJOY=1.0, ANGER_RAGE people fisty latelyK=1.0, > ANGER_RAGE ew gon lot em ��=1.0, ANGER_RAGE ain't gonna love =1.0} > 13/04/11 15:46:51 INFO common.BayesTfIdfDriver: {dataSource=hdfs, > alpha_i=1.0, minDf=1, gramSize=1} > 13/04/11 15:46:51 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the > same. > 13/04/11 15:46:57 INFO mapred.FileInputFormat: Total input paths to process : > 3 > 13/04/11 15:46:58 INFO mapred.JobClient: Cleaning up the staging area > hdfs://master/user/rfcompton/.staging/job_201303271312_2786 > 13/04/11 15:46:58 ERROR security.UserGroupInformation: > PriviledgedActionException as:rfcompton (auth:SIMPLE) > cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: > java.io.IOException: Exceeded max jobconf size: 10706309 limit: > 5242880 > at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766) > at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) > Caused by: java.io.IOException: Exceeded max jobconf size: 10706309 > limit: 5242880 > at > org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406) > at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764) > ... 10 more > > Exception in thread "main" org.apache.hadoop.ipc.RemoteException: > java.io.IOException: java.io.IOException: Exceeded max jobconf size: > 10706309 limit: 5242880 > at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766) > at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) > Caused by: java.io.IOException: Exceeded max jobconf size: 10706309 > limit: 5242880 > at > org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406) > at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764) > ... 10 more > > at org.apache.hadoop.ipc.Client.call(Client.java:1107) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242) > at > org.apache.mahout.classifier.bayes.mapreduce.common.BayesTfIdfDriver.runJob(BayesTfIdfDriver.java:97) > at > org.apache.mahout.classifier.bayes.mapreduce.cbayes.CBayesDriver.runJob(CBayesDriver.java:51) > at > org.apache.mahout.classifier.bayes.TrainClassifier.trainCNaiveBayes(TrainClassifier.java:58) > at > org.apache.mahout.classifier.bayes.TrainClassifier.main(TrainClassifier.java:151) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:197)