Re: Options in TrainClassifier.java

2010-10-07 Thread Gangadhar Nittala
Ted, I've added the patch MAHOUT-509_1.patch in Jira [ https://issues.apache.org/jira/browse/MAHOUT-509 ] . Thank you On Thu, Oct 7, 2010 at 12:57 PM, Ted Dunning wrote: > Can you attach the patch there?  The mailing list strips attachments. > > On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala

Re: Options in TrainClassifier.java

2010-10-07 Thread Ted Dunning
Can you attach the patch there? The mailing list strips attachments. On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala wrote: > I have attached a patch which has the modified testclassifier.props > and the fix with the parseInt. I think both these belong to > MAHOUT-509 >

Re: Options in TrainClassifier.java

2010-10-06 Thread Gangadhar Nittala
Joe / others, I was finally able to test the changes that were done as part of MAHOUT-509[ https://issues.apache.org/jira/browse/MAHOUT-509] and follow the instructions in the wiki for the Bayes example [ https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example ]. The instruction

Re: Options in TrainClassifier.java

2010-09-26 Thread Gangadhar Nittala
Joe, I am out of town for this week and won't have access to my machine. I will check this during the weekend and will get back to you. Will follow the steps in the wiki. Thank you On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar wrote: > Hi Gangadhar, > > I ran TestClassifier with similar parameters.

Re: Options in TrainClassifier.java

2010-09-24 Thread Joe Kumar
Hi Gangadhar, I ran TestClassifier with similar parameters. It didnt take me 2 hrs though. I have documented the steps that worked for me at https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example Can you please get the patch available at MAHOUT-509 and apply it and then try th

Re: Options in TrainClassifier.java

2010-09-23 Thread Gangadhar Nittala
Joe, Can you let me know what was the command you used to test the classifier ? With the ngrams set to 1 as suggested by Robin, I was able to train the classifier. The command: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bay

Re: Options in TrainClassifier.java

2010-09-20 Thread Ted Dunning
There is a test program called TrainNewsGroups in org.apache.mahout.classifier.sgd in the examples module. I would love to work with you to get better documentation pulled together. On Mon, Sep 20, 2010 at 8:13 PM, Gangadhar Nittala wrote: > Joe, > I will try with the ngram setting of 1 and let

Re: Options in TrainClassifier.java

2010-09-20 Thread Gangadhar Nittala
Joe, I will try with the ngram setting of 1 and let you know how it goes. Robin, the ngram parameter is used to check the number of subsequences of characters isn't it ? Or is it evaluated differently w.r.t to the Bayesian classifier ? Ted, like Joe mentioned, if you could point us to some informa

Re: Options in TrainClassifier.java

2010-09-20 Thread Joe Kumar
Robin / Gangadhar, With ngram as 1 and all the countries in the country.txt , the model is getting created without any issues. $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput -o wikipediamodel -type bayes -sour

Re: Options in TrainClassifier.java

2010-09-20 Thread Joe Kumar
Robin, Thanks for your tip. Will try it out and post updates. reg Joe. On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil wrote: > Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You > need atleast 2 countries. otherwise there is no classification. Secondly > ngram =3 is a bit t

Re: Options in TrainClassifier.java

2010-09-20 Thread Robin Anil
Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You need atleast 2 countries. otherwise there is no classification. Secondly ngram =3 is a bit too high. With wikipedia this will result in a huge number of features. Why dont you try with one and see. Robin On Mon, Sep 20, 201

Re: Options in TrainClassifier.java

2010-09-19 Thread Joe Kumar
Hi Ted, sure. will keep digging.. About SGD, I dont have an idea about how it works et al. If there is some documentation / reference / quick summary to read about it that'll be gr8. Just saw one reference in https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. I am assuming w

Re: Options in TrainClassifier.java

2010-09-19 Thread deneche abdelhakim
I don't know if it's related, but I remember getting a similar Exception one year ago when I was working on the implementation of Random Forests. In my case it was caused by SequenceFile.Sorter.merge(). I ended up writing my own merge function because I really didn't need to sort the output. On M

Re: Options in TrainClassifier.java

2010-09-19 Thread Joe Kumar
Gangadhar, Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4 (revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space. commands tat I executed. I had issues with my namenode and so did a format using hadoop namenode -format. $MAHOUT_HOME/examples/src/test/

Re: Options in TrainClassifier.java

2010-09-19 Thread Ted Dunning
I am watching these efforts with interest, but have been unable to contribute much to the process. I would encourage Joe and others to keep whittling this problem down so that we can understand what is causing it. In the meantime, I think that the SGD classifiers are close to production quality.

Re: Options in TrainClassifier.java

2010-09-19 Thread Gangadhar Nittala
Joe, Even I tried with reducing the number of countries in the country.txt. That didn't help. And in my case, I was monitoring the disk space and at no time did it reach 0%. So, I am not sure if that is the case. To remove the dependency on the number of countries, I even tried with the subjects.tx

Re: Options in TrainClassifier.java

2010-09-19 Thread Joe Kumar
Gangadhar, I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the wikipediainput data set and then ran TrainClassifier and it worked. when I ran TestClassifier as below, I got blank results in the output. $

Options in TrainClassifier.java

2010-09-18 Thread Joe Kumar
Gangadhar, After running TrainClassifier again, the map task just failed with the same exception and I am pretty sure it is an issue with disk space. As the map was progressing, I was monitoring my free disk space dropping from 81GB. It came down to 0 after almost 66% through the map task and then

Re: Options in TrainClassifier.java

2010-09-18 Thread Gangadhar Nittala
Joe, I don't think it is the disk space that could be the problem because I did have enough disk space (well, not 81GB, but around 40GB free) . I will try if the suggestions in the thread you mentioned make any difference. Will keep you posted. Thank you On Fri, Sep 17, 2010 at 11:33 PM, Joe Kuma

Re: Options in TrainClassifier.java

2010-09-17 Thread Joe Kumar
Gangadhar, I couldnt find any concrete reason behind this error. Some of them have reported this to happen very sporadic. As per some suggestions in this thread ( http://www.mail-archive.com/core-u...@hadoop.apache.org/msg09250.html) , I have changed the location of hadoop tmp dir. Also I have cle

Re: Options in TrainClassifier.java

2010-09-17 Thread Gangadhar Nittala
Thank you Joe for the confirmation. I am also checking the code to see what is causing this issue. May be others in the list will know what can cause this issue. I am guessing the root cause is not Mahout but something in Hadoop. On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar wrote: > Gangadhar, > >

Re: Options in TrainClassifier.java

2010-09-16 Thread Joe Kumar
Gangadhar, After some system issues, I finally ran the TrainClassifier. After almost 65% into the map job, I got the same error that you have mentioned. INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_00_0, Status : FAILED org.apache.hadoop.util.DiskChecker$DiskErrorException: Cou

Re: Options in TrainClassifier.java

2010-09-15 Thread Joe Kumar
Hi Gangadhar, rite. I did the same to execute the TrainClassifier but then since the default datasource is hdfs, we should not be mandated to provide this parameter. I havent completed executing the TrainClassifier yet. I'll do it tonite and let you know if I get into trouble. reg, Joe. On Wed,

Re: Options in TrainClassifier.java

2010-09-15 Thread Gangadhar Nittala
I ran into the issue that Joe mentioned about the command line parameters. I just added the datasource to the command line to execute thus $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 --inp

Re: Options in TrainClassifier.java

2010-09-14 Thread Joe Kumar
Robin, sure. I'll submit a patch. The command line flag already has the default behavior specified. --classifierType (-type) classifierTypeType of classifier: bayes|cbayes. Default: bayes --dataSource (-source) dataSource Location of

Re: Options in TrainClassifier.java

2010-09-14 Thread Robin Anil
On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar wrote: > Hi all, > > As I was going through wikipedia example, I encountered a situation with > TrainClassifier wherein some of the options with default values are > actually > mandatory. > The documentation / command line help says that > > 1. defaul

Options in TrainClassifier.java

2010-09-14 Thread Joe Kumar
Hi all, As I was going through wikipedia example, I encountered a situation with TrainClassifier wherein some of the options with default values are actually mandatory. The documentation / command line help says that 1. default source (--datasource) is hdfs but TrainClassifier has withRequi