Re: How to approach this? Classification vs Recommendation

2012-05-18 Thread Ted Dunning
Not so trivially, these classifiers can help each other. What you have is a form of transduction or example based learnng. On Fri, May 18, 2012 at 5:24 PM, Sean Owen wrote: > Trivially it's four classifiers. You have just one input here, and > it's binary. That seems like too little info to dis

Re: How to approach this? Classification vs Recommendation

2012-05-18 Thread Sean Owen
Trivially it's four classifiers. You have just one input here, and it's binary. That seems like too little info to discriminate on. All you can learn -- and it doesn't really need a classifier algorithm -- is there's an x% chance of encountering problem a if funded, and (100-x)% of a if not. On Fr

Re: Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-18 Thread Sean Owen
Yes it's just like making any other servlet-based app. You can find the servlet (and Axis JWS file if you want) and web.xml in the project. Just put it together with compiled code in a .war file. On Fri, May 18, 2012 at 10:09 PM, Dhananjay Sampath wrote: > Thanks Sean, for the super quick respons

How to approach this? Classification vs Recommendation

2012-05-18 Thread fht
Hi, I suppose this a combination of a generic machine learning question and a mahout question. I have a data set. A user may or may not be part of a funded scheme. If there are not part of the funded scheme they might be susceptible to certain problems a, b, c and d. If there are part of the fu

Re: Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-18 Thread Dhananjay Sampath
Thanks Sean, for the super quick response! I wish I asked this question 3 days ago ! Ok, so I have to package my own recommender. Got it. The only place where I found some instructions on packaging a (custom/example) recommender is Manuel's response to Ben on the User archives ( http://mail-archiv

NoClassDefFoundError calling custom analyzer in seq2sparse

2012-05-18 Thread DAN HELM
Hello,   I'm sure variations of this question have been posted before but I'm having trouble using my own custom analyzer with seq2sparse.  I'm using the -a parameter to pass my class name.   To build the class I basically cloned the concept in Mahout's org.apache.mahout.vectorizer.DefaultAnalyz

Re: Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-18 Thread Sean Owen
(You can use -DskipTests in Maven. You don't need to run the very lengthy tests.) The bad news is that this example only worked in 0.5, and was removed in 0.6. The underlying pieces are still there, you just would have to assemble the WAR yourself. I'll try to figure out how to remove this; I did

Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-18 Thread Dhananjay Sampath
Hi mahout-devs/mahout-users, I am here stuck on trying to get the mahout taste app demo working and could really use some help. I know that several people have come before me asking the same question and I have gone through almost all of them and have met with little success. I followed all of th

Re: tokenizer for text

2012-05-18 Thread Jiaan Zeng
very helpful info! Thanks a lot. On Fri, May 18, 2012 at 11:37 AM, John Conwell wrote: > Noise in OCR often manifests itself as a whole bunch of singletons in the > corpus of meaningless terms like "lsdjfdslkfj".  So the minFrequency flag > can help in filtering out these terms. > > Stopwords sho

Re: Judging the quality of clustering

2012-05-18 Thread Pat Ferrel
Thanks Jeff. When I did my experiment it used kmeans for three runs k = 10, 20, 10. Number of documents around 3000 (guessing here). The k=10 run did not prune, k=30 pruned 4 clusters. I'll run this again to see if it is repeatable and you are welcome to the dataset. I read that comment but w

Re: tokenizer for text

2012-05-18 Thread John Conwell
Noise in OCR often manifests itself as a whole bunch of singletons in the corpus of meaningless terms like "lsdjfdslkfj". So the minFrequency flag can help in filtering out these terms. Stopwords should be handled by tfidf. For example the word "the" probably has a high frequency in every docume

LDA, printing Topics

2012-05-18 Thread Simon Handley
I'm trying to understand how LDA prints out the words per topic. If I run the reuters example, the topics are printed out like this: Topic 0 > === > dlrs [p(dlrs|topic_0) = 0.09982075792238235 > mln [p(mln|topic_0) = 0.05160370562850524 > its [p(its|topic_0) = 0.026424106119119467 > earni

Re: tokenizer for text

2012-05-18 Thread Jiaan Zeng
Thanks for the quick reply. Stop word filtering or stemming may not help much I think. Too, the point of using tf-idf vector is to deal with high occurrence frequency word. Stop word filtering or stemming seems counter against the tf-idf intention. The problem is that the text has lots of noises (

Re: tokenizer for text

2012-05-18 Thread Baoqiang Cao
In addition. You could try to increase the word occurance thresholds in -s and -md options. On Fri, May 18, 2012 at 9:41 AM, John Conwell wrote: > What do you have in mind as far as a different tokenizer?  Are you doing > stopword filtering?  Maybe look at the stopword list and see if there are >

Re: tokenizer for text

2012-05-18 Thread John Conwell
What do you have in mind as far as a different tokenizer? Are you doing stopword filtering? Maybe look at the stopword list and see if there are other noise words you wish to add. If you are using Lucene to filter stopwords, its stopword list if pretty small(20 or so words). Stemming is another

tokenizer for text

2012-05-18 Thread Jiaan Zeng
Hi List, I am trying to use Mahout to do cluster on text. The problem is after running the procedure SparseVectorsFromSequenceFiles, the dimension of tf-idf vector is too high (about 50K) and it increases as the number of document increases. I think there are two ways to handle that. One is to use