Re: Clustering accuracy

2013-02-06 Thread Ted Dunning
Estimating accuracy this way will almost always give you very poor results. The reason is that unsupervised clustering will draw its own boundaries which are very unlikely to match your own. If you want to make this work you can do a few different things: a) semi-supervised clustering. Include

Question about latent dirichlet allocation

2013-02-06 Thread yunming zhang
Hi, I am trying to get Latent Dirichlet Allocation to work, I was following the instructions on this page https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html I have two questions 1) I want to make sure LDA is a different algorithm than the dirichlet clustering algorithm? I am only a

Re: Mahout 0.8-SNAPSHOT and Hadoop 2.0.2 Alpha

2013-02-06 Thread Ellen Friedman
Bojan, Welcome to Mahout! Thanks for bringing your question to the mailing list. Someone else with more technical experience will hopefully be able to answer. Best wishes, Ellen On Wed, Feb 6, 2013 at 4:50 PM, Bojan Kostić wrote: > Hallo, my first post and i hope not last. > For some time i pl

Mahout 0.8-SNAPSHOT and Hadoop 2.0.2 Alpha

2013-02-06 Thread Bojan Kostić
Hallo, my first post and i hope not last. For some time i play with hadoop and mahout. Still learning... And i wish to set up Mahout 0.8-SNAPSHOT and Hadoop 2.0.2 Alpha to work together. I read on dev mailing list that Marty Kube build Mahout against Hadoop 2.0.2. Has anyone else tried? And if y

How to dump/interpret CVB output

2013-02-06 Thread keeyong han
Hello there, After some struggle, I managed to run cvb successfully. But I found that dumping the output isn't much easier either. I tried to dump some keywords per cluster by running the following command: mahout vectordump -i [final_state_output_directory_used_in_cvb_run] -o [output_file_pat

Re: Using IDF in CF recommender

2013-02-06 Thread Paulo Villegas
> The affect of downweighting the popular items is very similar to > removing them from recommendations so I still suspect precision will > go down using IDF. Obviously this can pretty easily be tested, I just > wondered if anyone had already done it. > > This brings up a problem with holdout bas

Re: How to segment seq2sparse output into predefined training set and test set?

2013-02-06 Thread Adam Baron
Thanks for the advice. I tried out seq2encoded and that addressed my issue of making the training set and test set use the same feature indices for the same words. However, I'm a little disappointed there is no dictionary file produced by seq2encoded. It would be nice to understand which word(s)

Re: Regarding mahout clustering algorithms

2013-02-06 Thread Jeff Eastman
Note that the old clustering algorithms also run without Hadoop in sequential execution mode from the local file system. On 2/6/13 11:04 AM, Tanguy tlrx wrote: Thanks! -- Tanguy 2013/2/6 Ted Dunning https://github.com/tdunning/knn/ especially the docs directory On Wed, Feb 6, 2013 at 7:5

Re: Using IDF in CF recommender

2013-02-06 Thread Pat Ferrel
The affect of downweighting the popular items is very similar to removing them from recommendations so I still suspect precision will go down using IDF. Obviously this can pretty easily be tested, I just wondered if anyone had already done it. This brings up a problem with holdout based precisi

Visualize labels of classified news (classify-20newsgroups example)

2013-02-06 Thread BS TLC
Hi,   I'm a complete novice of Mahout, and I'm currently learning how to use it by examples. I'm running the 20newsgroups clustering example and I'm wondering how to visualize labels of classified document. Do anyone know it? Thanks, Albert

Re: Using IDF in CF recommender

2013-02-06 Thread Paulo Villegas
This results in no information for universally preferred items, which is indeed what I was looking for. It looks like this should also work for other values or explicit preferences--item prices, ratings, etc.. Intuition says this will result in a lower precision related cross validation measu

Re: Using IDF in CF recommender

2013-02-06 Thread Pat Ferrel
oops, forgot the log So... idf weighted preference value = item preference value * log (number of all items/number of users with specific item pref) items 1 0 0 users 1 0 0 1 1 0 freq

Re: Regarding mahout clustering algorithms

2013-02-06 Thread Tanguy tlrx
Thanks! -- Tanguy 2013/2/6 Ted Dunning > https://github.com/tdunning/knn/ > > especially the docs directory > > On Wed, Feb 6, 2013 at 7:54 AM, Tanguy tlrx wrote: > > > Hi Ted, > > > > Where can I find more details about these new algorithms? > > > > Thanks, > > > > -- Tanguy > > > > 2013/2/6

Re: Regarding mahout clustering algorithms

2013-02-06 Thread Ted Dunning
https://github.com/tdunning/knn/ especially the docs directory On Wed, Feb 6, 2013 at 7:54 AM, Tanguy tlrx wrote: > Hi Ted, > > Where can I find more details about these new algorithms? > > Thanks, > > -- Tanguy > > 2013/2/6 Ted Dunning > > > Yes they can. > > > > The new algorithms that are j

Re: Rating scale

2013-02-06 Thread Ted Dunning
If you want relative error, you should model the log of the target variable. This is very commonly done with prices. My beefs with SVD methods in general are a) they are often implemented without regularization b) they are typically used to model ratings instead of the desired target behavior

Re: Regarding mahout clustering algorithms

2013-02-06 Thread Tanguy tlrx
Hi Ted, Where can I find more details about these new algorithms? Thanks, -- Tanguy 2013/2/6 Ted Dunning > Yes they can. > > The new algorithms that are just now arriving are particularly suited to > non-Hadoop use. > > On Wed, Feb 6, 2013 at 2:06 AM, vivek bairathi >wrote: > > > Hi, > > > >

Re: Rating scale

2013-02-06 Thread Sean Owen
Scaling values scales errors, yes. Yes you would have to normalize by a range to meaningfully compare. On Feb 6, 2013 2:58 PM, "Zia mel" wrote: > In this case where there is different scaling or range, would MAE test > be suitable and understandable ? For example, if we have range 1-5 and > anoth

Re: Regarding mahout clustering algorithms

2013-02-06 Thread Ted Dunning
Yes they can. The new algorithms that are just now arriving are particularly suited to non-Hadoop use. On Wed, Feb 6, 2013 at 2:06 AM, vivek bairathi wrote: > Hi, > > I want to know that Mahout's clustering algorithms run without Hadoop or > not? > I mean can they be used without Hadoop? > > > -

Re: Rating scale

2013-02-06 Thread Zia mel
In this case where there is different scaling or range, would MAE test be suitable and understandable ? For example, if we have range 1-5 and another 1-20 , to make the same interpretation of MAE we need to divide by the range ? Many thanks On Wed, Feb 6, 2013 at 4:13 AM, 万代豊 <20525entrad...@gmail

Re: Rating scale

2013-02-06 Thread 万代豊
Sean Thanks for your clarification. I'll also keep in mind about what I need to be carefull with SVD. Regards,,, Yutaka 2013/2/6 Sean Owen > Yes that would be valid in the sense that the neighborhood based approaches > are outputting a weighted average of prices here which is also a price. You >

Re: Rating scale

2013-02-06 Thread Sean Owen
Yes that would be valid in the sense that the neighborhood based approaches are outputting a weighted average of prices here which is also a price. You would have to think about which similarity metrics are meaningful though. The SVD has a perhaps undesirable behavior here. Because it treats the s

Re: Rating scale

2013-02-06 Thread 万代豊
Hi I also have a similar question regarding result interpretation based on how we provide data to recommender. Typcally , we provide rating data say in scale from 1-5 and get the result in the same scale range.(and need to be consistent as Sean points out) If we assume the provided data with other