Re: Searching more Mahout content

2010-09-30 Thread Alex Baranau
Done: https://issues.apache.org/jira/browse/MAHOUT-514 Alex Baranau On Sat, Sep 25, 2010 at 2:59 AM, Ted Dunning wrote: > That would be fabulous. > > On Fri, Sep 24, 2010 at 6:07 AM, Alex Baranau >wrote: > > > I'd suggest to use the approach discussed (and accepted) at > > https://issues.apach

Re: Text Classification using Mahout

2010-09-30 Thread Neil Ghosh
Hi, I am running the twenty-newsgroups example without hadoop , with the following command $ mvn -e exec:java \ -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier \ -Dexec.args="-i 20news-input \ -o 20news-model \ -type cbayes \ -ng 1 \ -source hdfs" Everytime I run this to crea

Re: Text Classification using Mahout

2010-09-30 Thread Sean Owen
Ignore it, it's just Maven doing its thing in the background. It should work fine without internet connectivity. On Thu, Sep 30, 2010 at 1:54 PM, Neil Ghosh wrote: > Hi, > > I am running the twenty-newsgroups example without hadoop , with the > following command > > $ mvn -e exec:java \ > -Dexec.

Re: kmeans vectors

2010-09-30 Thread Jeff Eastman
Not using the synthetic control jobs. They always run Canopy over the converted data and you need to choose t1 and t2 to get the initial k. Once you have run it once; however, copy the data file from output into another folder. From there you can run k-means or any of the other clustering prog

Re: Getting error in Training the classifier as in TwentyNewsgroup

2010-09-30 Thread Neil Ghosh
Hi Robin , The below scripts always try to run it on hadoop , Can I run without hadoop using these scripts ? On Fri, Sep 24, 2010 at 11:59 PM, Robin Anil wrote: > also use the mahout shell script > > bin/mahout trainclassifier > bin/mahout test classifier > > using the same parameters. > > > Ro

Re: Text Classification using Mahout

2010-09-30 Thread Isabel Drost
On Thu, 30 Sep 2010 Sean Owen wrote: > Ignore it, it's just Maven doing its thing in the background. It > should work fine without internet connectivity. To speed up the build process when you do not have internet connectivity you can give a "-o" to the command line to tell maven that you are no

unknown test data twenty-newsgroups example

2010-09-30 Thread Neil Ghosh
Hi, In this example https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html The test is done on the already classified input text documents. My Question is , If I want to test unknown, documents , do I need it in specific format ? or just keep them (as raw text ) in the input folder while testin

Re: Text Classification using Mahout

2010-09-30 Thread Neil Ghosh
Yes,With -o option Maven looks ti be executing in offline mode. Thanks Isabel and Sean On Thu, Sep 30, 2010 at 7:21 PM, Isabel Drost wrote: > On Thu, 30 Sep 2010 Sean Owen wrote: > > > Ignore it, it's just Maven doing its thing in the background. It > > should work fine without internet conne

Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Neil Ghosh
Does anybody have examples/reference how to use TF-IDF weights in mahout cbayes for particular words and phrases while doing text classification ? -- Thanks and Regards Neil http://neilghosh.com

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Robin Anil
It does that by default for all words. What else do you have in mind? On Thu, Sep 30, 2010 at 8:07 PM, Neil Ghosh wrote: > Does anybody have examples/reference how to use TF-IDF weights in mahout > cbayes for particular words and phrases while doing text classification ? > > -- > Thanks and Rega

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Neal Richter
On Thu, Sep 30, 2010 at 8:37 AM, Neil Ghosh wrote: > Does anybody have examples/reference how to use TF-IDF weights in mahout > cbayes for particular words and phrases while doing text classification ? http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf - Neal

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Neil Ghosh
Thanks for replying Robin , I am quoting conversation between Grant and Me earlier Now I want to know how to implement the second problem ? To be specific my problem is to classify a piece text crawled from web into > two classes > > 1.It is a +ve feedback > 2.It is

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Robin Anil
You may split the dataset in 80/20 or some other ratio and try. You can split them after you have created the data in Bayes classifier format or split it into different folders and make them as described in the documentation. Robin On Thu, Sep 30, 2010 at 7:30 PM, Neil Ghosh wrote: > Hi, > > I

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Robin Anil
> > >> Or Do I have flexibility to give some other input specific to my problem ? >> Such as if words like "Problem", "Complaint" etc are more likely to appear >> in a text containing grievance. >> >> >> > You can provide a Weight, usually TF-IDF, that often does a good job of >> factoring in the

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Neil Ghosh
Do you mean , I should 1st create the model with correct data in correct folder (Label). Then now randomly distribute the raw text files in among two folders and generate input data. Now I should run the tester for the mis-labelled data ? On Thu, Sep 30, 2010 at 9:37 PM, Robin Anil wrote: > Yo

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Neil Ghosh
So All I have to do is add an extra file containing LABELproblemcomplaintproblemo Along with the usual training data in Bayes format ? On Thu, Sep 30, 2010 at 9:44 PM, Robin Anil wrote: > >>> Or Do I have flexibility to give some other input specific to my problem >>> ? Such as if words like "

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Robin Anil
extra file or extra line, duplicated instances(to decrease the weights) or duplicate feature in the same instance to increase the weights(classic tf-idf) Robin On Thu, Sep 30, 2010 at 9:50 PM, Neil Ghosh wrote: > So All I have to do is add an extra file containing > > LABELproblemcomplaintprobl

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Robin Anil
On Thu, Sep 30, 2010 at 9:45 PM, Neil Ghosh wrote: > > Do you mean , I should 1st create the model with correct data in correct > folder (Label). > > Now you throw an instance at it and you will get the correct label, well most of the time.

Re: kmeans vectors

2010-09-30 Thread Matt Tanquary
Hi Jeff, Thanks for your reply. I just got trunk and started the install. It ended with this error: Error loading supplemental data models: Cannot create file-based resource. org.codehaus.plexus.resource.loader.FileResourceCreationException: Cannot create file-based resource. A lot built, so I

Re: kmeans vectors

2010-09-30 Thread Jeff Eastman
Don't think so. Try "mvn clean install" and let me know what happens. On 9/30/10 12:48 PM, Matt Tanquary wrote: Hi Jeff, Thanks for your reply. I just got trunk and started the install. It ended with this error: Error loading supplemental data models: Cannot create file-based resource. org.co

Re: kmeans vectors

2010-09-30 Thread Matt Tanquary
Thanks, It was a permission issue. I had to change the group owner to the current user's group, it's now building. I moved the build from one server to another (which caused the user sync problem). 2010/9/30 Jeff Eastman : >  Don't think so. Try "mvn clean install" and let me know what happens.

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Ted Dunning
That is exactly what it does. On Thu, Sep 30, 2010 at 8:37 AM, Neal Richter wrote: > On Thu, Sep 30, 2010 at 8:37 AM, Neil Ghosh wrote: > > Does anybody have examples/reference how to use TF-IDF weights in mahout > > cbayes for particular words and phrases while doing text classification ? > >

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Ted Dunning
A very good practice is to use a data set like this: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz Segregating by date avoids problems with duplicate documents appearing in both training and test. It also gives you a standard split so that you can compare to other peoples'

Re: Mahout usage

2010-09-30 Thread Ted Dunning
Wow. And 24% planning to use it. On Thu, Sep 30, 2010 at 7:13 AM, Grant Ingersoll wrote: > > http://www.businesswire.com/news/home/20100929005052/en/Karmasphere-Study-Shows-Hadoop-Projects-Start-Skunkworkspegs > Mahout usage at 14% of 102 Hadoop devs surveyed. Granted, its a small > sample, bu

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Drew Farris
On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh wrote: > > My Question is , If I want to test unknown, documents , do I need it in > specific format ? or just keep them (as raw text ) in the input folder while > testing ? If I interpret your question correctly, you're saying "I've trained my classif

Re: Getting error in Training the classifier as in TwentyNewsgroup

2010-09-30 Thread Robin Anil
No Its implemented purely using Hadoop M/R. You can choose to run it on a remote Cluster on a local cluster or in a process(where Hadoop runs an in-process M/R) Robin On Thu, Sep 30, 2010 at 6:54 PM, Neil Ghosh wrote: > Hi Robin , > > The below scripts always try to run it on hadoop , Can I run

Re: kmeans vectors

2010-09-30 Thread Matt Tanquary
I tried to use -k with the syntheticcontrol.kmeans.Job program, but it didn't recognize that argument. On Thu, Sep 30, 2010 at 6:18 AM, Jeff Eastman wrote: >  Not using the synthetic control jobs. They always run Canopy over the > converted data and you need to choose t1 and t2 to get the initial

Computing userSimilarity in Taste AbstractSimilarity

2010-09-30 Thread Abigail Gertner
Hello - I noticed something that I think might be a problem with the userSimilarity computation in the AbstractSimilarity class. After updating the running sums, the method checks the value of compare and moves to the next preference value in the list that has the smaller item index, or both if the

Re: Computing userSimilarity in Taste AbstractSimilarity

2010-09-30 Thread Sean Owen
I think it does work, but this code is definitely hard to grok. In my defense it is complex for a reason at least -- performance. When the end of one list of prefs is reached (the line "if (++xPrefIndex >= xLength)") it does check for an inferrer in the next line. If there is one, it sets "xIndex

Re: Computing userSimilarity in Taste AbstractSimilarity

2010-09-30 Thread Abigail Gertner
I think I must be looking at an older version of the file. I have mahout-0.3 (the most recent one) downloaded from sourceforge. Maybe it is updated since then in SVN? On 9/30/2010 5:38 PM, Sean Owen wrote: > I think it does work, but this code is definitely hard to grok. In my > defense it is co

Re: Computing userSimilarity in Taste AbstractSimilarity

2010-09-30 Thread Sean Owen
Yep looks like this was added since 0.3. You should definitely follow SVN HEAD in general as things change fast. On Thu, Sep 30, 2010 at 10:53 PM, Abigail Gertner wrote: > I think I must be looking at an older version of the file. I have > mahout-0.3 (the most recent one) downloaded from source

recommendation mechanism

2010-09-30 Thread web service
I have got the group lens example working. Had a couple of doubts though - The dataset in grouplens has movieid, userid and the corresponding ratings. However a rating is meant to rate a movie but there are other things related to a movie to which the rating contributes. For example, the actors, di

Re: recommendation mechanism

2010-09-30 Thread Sebastian Schelter
Hi Mac, Collaborative Filtering algorithms only learn from interaction data (known preferences) and are content agnostic, which means they don't look at the actual content of the items. This might sound awkward and counterintuitive at a first look but it works really well when applied. The relat

Re: recommendation mechanism

2010-09-30 Thread Ted Dunning
And if you want to see more about recommendation using side data as well as interaction data, the best reference I know of is Menon and Elkan's recent paper: http://arxiv.org/abs/1006.2156 On Thu, Sep 30, 2010 at 4:45 PM, Sebastian Schelter wrote: > If you just wanna know more about the theory

Re: Computing userSimilarity in Taste AbstractSimilarity

2010-09-30 Thread Ted Dunning
Also, it is best to get the source from apache or a mirror. See SVN URL: http://svn.apache.org/repos/asf/mahout/trunk Apache git mirror of same: git://git.apache.org/mahout.git Github mirror of mahout: http://github.com/apache/mahout On Thu, Sep 30, 2010 at 2:53 PM, Abigail Gertner wrote: >

Share your experience using Mahout in the real-world?

2010-09-30 Thread Joseph Turian
Hi, I'm maybe giving a talk on Mahout, and I wanted to present some case-studies from other people using Mahout in real-world, production scenarios. Would you be interested in sharing concrete stories about using Mahout in industry, etc? You can share the information anonymously, or receive credit

Re: kmeans vectors

2010-09-30 Thread Lahiru Samarakoon
Hi Matt, As Jeff has mentioned earlier, you have to choose t1 and t2 to get the k when you are using * syntheticcontrol.kmeans.Job* program. So what you have experienced is correct. Thanks, Lahiru