Ok I think I got it.

The problem was that I wasn't naming the files properly. If I'm not
mistaken I'll need to organize my training data like:
-bash-3.2$ hadoop dfs -lsr /user/rfcompton/emotion-training-labeled/
-rw-r--r--   3 rfcompton hadoop    2896850 2013-04-11 16:23
/user/rfcompton/emotion-training-labeled/ANGER_RAGE
-rw-r--r--   3 rfcompton hadoop    3239449 2013-04-11 16:24
/user/rfcompton/emotion-training-labeled/JOY

where the contents of /user/rfcompton/emotion-training-labeled/JOY look like:
JOY      actually turning decent new year ☺
JOY      best New Years tonight! ready 2013. <U+1F609> <U+1F38A><U+1F389>
...



On Thu, Apr 11, 2013 at 4:02 PM, Ryan Compton <compton.r...@gmail.com> wrote:
> Also, right before the screen dump I see:
>
> 13/04/11 15:46:40 INFO mapred.JobClient:     Combine output records=462236
> 13/04/11 15:46:40 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=1618497536
> 13/04/11 15:46:40 INFO mapred.JobClient:     Reduce output records=419058
> 13/04/11 15:46:40 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=4697526272
> 13/04/11 15:46:40 INFO mapred.JobClient:     Map output records=702535
> 13/04/11 15:46:40 INFO cbayes.CBayesDriver: Calculating Tf-Idf...
> 13/04/11 15:46:41 INFO common.BayesTfIdfDriver: Counts of documents in
> Each Label
> 13/04/11 15:46:42 INFO common.BayesTfIdfDriver: {ANGER_RAGE  family's
> personal fucking bank.=1.0, ANGER_RAGE give up life...=1.0, ANGER_RAGE
> understand peopleS=1.0, ANGER_RAGE many episodes record day?5=1.0,
> ANGER_RAGE! need punching bag take out angerC=1.0, ANGER_RAGE& right
> now�� insults make laugh.A=1.0, ANGER_RAGEunny a
>
> On Thu, Apr 11, 2013 at 3:58 PM, Ryan Compton <compton.r...@gmail.com> wrote:
>> I'm trying to train a simple text classifier using cbayes. I've got
>> formatted <Text,Text> sequence files created with
>> com.twitter.elephantbird.pig.store.SequenceFileStorage(), eg:
>>
>> JOY      actually turning decent new year ☺
>> JOY      best New Years tonight! ready 2013. <U+1F609> <U+1F38A><U+1F389>
>> JOY      playing Dream League Soccer iPad 2 earned 13 coins!
>> JOY      Great way start new ear
>> JOY      good sober New Years Eve
>> ANGER_RAGE       Last night frank hasn't done revision prelims
>> ANGER_RAGE       hell cut forehead such ball ache! Cheers pleb chucks
>> glass bottles around!
>> ANGER_RAGE       shops open today customer services shut apparently
>> being paid "come back tomorrow".
>>
>> These are stored in a directory as:
>> /emotion-training-labeled/part-m-0000*
>>
>> I pass the labeled data into cbayes:
>>
>> mahout trainclassifier -i /emotion-training-labeled/ -o emotion-model/
>> -type cbayes -ng 1 -source hdfs
>>
>> Both map and reduce get to 100%,  then I see something about Tf-Idf
>> followed by what looks like a complete dump of my training data print
>> to the screen for the next few minutes and then a stack trace:
>>
>> rything life teach lesson, willing observe learn.” YUP!GJOYB Halbrecht
>> DAN CASTAIC CA found local Videographer. Register FREE:"JOY Palm Read
>> Easy Created WorldJOY=1.0, ANGER_RAGE people fisty latelyK=1.0,
>> ANGER_RAGE ew gon lot em ��=1.0, ANGER_RAGE ain't gonna love =1.0}
>> 13/04/11 15:46:51 INFO common.BayesTfIdfDriver: {dataSource=hdfs,
>> alpha_i=1.0, minDf=1, gramSize=1}
>> 13/04/11 15:46:51 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the
>> same.
>> 13/04/11 15:46:57 INFO mapred.FileInputFormat: Total input paths to process 
>> : 3
>> 13/04/11 15:46:58 INFO mapred.JobClient: Cleaning up the staging area
>> hdfs://master/user/rfcompton/.staging/job_201303271312_2786
>> 13/04/11 15:46:58 ERROR security.UserGroupInformation:
>> PriviledgedActionException as:rfcompton (auth:SIMPLE)
>> cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException:
>> java.io.IOException: Exceeded max jobconf size: 10706309 limit:
>> 5242880
>>         at 
>> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
>>         at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
>> Caused by: java.io.IOException: Exceeded max jobconf size: 10706309
>> limit: 5242880
>>         at 
>> org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406)
>>         at 
>> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
>>         ... 10 more
>>
>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
>> java.io.IOException: java.io.IOException: Exceeded max jobconf size:
>> 10706309 limit: 5242880
>>         at 
>> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
>>         at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
>> Caused by: java.io.IOException: Exceeded max jobconf size: 10706309
>> limit: 5242880
>>         at 
>> org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406)
>>         at 
>> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
>>         ... 10 more
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1107)
>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>>         at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904)
>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>         at 
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242)
>>         at 
>> org.apache.mahout.classifier.bayes.mapreduce.common.BayesTfIdfDriver.runJob(BayesTfIdfDriver.java:97)
>>         at 
>> org.apache.mahout.classifier.bayes.mapreduce.cbayes.CBayesDriver.runJob(CBayesDriver.java:51)
>>         at 
>> org.apache.mahout.classifier.bayes.TrainClassifier.trainCNaiveBayes(TrainClassifier.java:58)
>>         at 
>> org.apache.mahout.classifier.bayes.TrainClassifier.main(TrainClassifier.java:151)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at 
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

Reply via email to