Re: Why each time the classification model trained by using TrainNewsGroup are not the same?

2012-07-04 Thread Caspar Hsieh
Hi, Ted Dunning I comment the line Collection.shuffle(files); in TrainNewsGroups.java, let the model trained with same order of examples each time. After recompile the code and redo the experiment, the model are still not the same each time. :( And I make sure the vectors before input to

Re: Approaches for combining multiple types of item data for user-user similarity

2012-07-04 Thread Sean Owen
The best default answer is to put them all in one model. The math doesn't care what the things are. Unless you have a strong reason to weight one data set I wouldn't. If you do, then two models is best. It is hard to weight a subset of the data within most similarity functions. I don't think it

Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Matt Mitchell
Hi Sean, Myrrix does look interesting! I'll keep an eye on it. What I'd like to do is recommend items to users yes. I looked at the IdRescorer and it did the job perfectly (pre filtering). I was a little misleading in regard to the size of the data. The raw data files are around 1GB. But after

recommendations for new users

2012-07-04 Thread Matt Mitchell
Hi, Slowly prototyping a recommendation here. The system does not have user accounts. Since the users on the system don't have accounts, I'm struggling a bit with completely new users, and what to recommend them. I do have information about the user, like what referring site they came from (1 of

Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Sean Owen
If your input is 10MB then the good news is you are not near the scale where you need Hadoop. A simple non-distributed Mahout recommender works well, and includes the Rescorer capability you need. That's a fine place to start. The book ought to give a pretty good tour of how that works in chapter

Re: recommendations for new users

2012-07-04 Thread Sean Owen
Have a look at the PlusAnonymousUserDataModel, which is a bit of a hack but a decent sort of solution for this case. It lets you temporarily add a user to the system and then everything else works as normal, so you can make recommendations to these new / temp users. There isn't a way to inject

Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Matt Mitchell
Thanks Sean! Nice to know I can stay simple for now. - Matt On Wed, Jul 4, 2012 at 9:59 AM, Sean Owen sro...@gmail.com wrote: If your input is 10MB then the good news is you are not near the scale where you need Hadoop. A simple non-distributed Mahout recommender works well, and includes the

A bunch of SVD questions...

2012-07-04 Thread Razon, Oren
Hi, I'm exploring Mahout SVD parallel implementation over Hadoop (ALS), and I would like to clarify a few things : 1. How do you recommend top K items with this job? Does the job factorize the ranking matrix, than compute a predicted ranking for each cell in the matrix, so when you need a

Re: Why each time the classification model trained by using TrainNewsGroup are not the same?

2012-07-04 Thread Ted Dunning
On Wed, Jul 4, 2012 at 12:09 AM, Caspar Hsieh caspar.hs...@9x9.tv wrote: Hi, Ted Dunning I comment the line Collection.shuffle(files); in TrainNewsGroups.java, let the model trained with same order of examples each time. This will prevent effective learning. You must shuffle the data at

custom file data model?

2012-07-04 Thread Matt Mitchell
Hi, I'd like to store additional information in my user preference data files. Is it possible, to add more columns to the file that FileDataModel uses? For example, an additional ID that maps to my applications database ID for item-ids, a simple 3 char code for possible use in custom user-user

Re: custom file data model?

2012-07-04 Thread Sean Owen
Sure. It will ignore columns beyond the fourth, which is an optional timestamp. If you just want it to read some common input file but ignore the unused columns, that's easy. You can copy and modify FileDataModel to do whatever you like, if you want it to use this data. You'd have to change other

Re: Extracting document/topic inference with the new lda cvb algorithm

2012-07-04 Thread Andy Schlaikjer
Hi Caroline, Jake Mannix and I wrote the LDA CVB implementation. Apologies for the light documentation. When you invoked Mahout, did you supply the --doc_topic_output path parameter? If this is present, after training a model the driver app will apply the model to the input term-vectors, storing

Re: Extracting document/topic inference with the new lda cvb algorithm

2012-07-04 Thread Andy Schlaikjer
I haven't looked into the vector dumper code in detail, but I remember having successfully run some version of it without an input dictionary. Perhaps you've stumbled into a legitimate bug with the utility? For the time being you might also try the sequence file dumper util which is somewhat more

Re: custom file data model?

2012-07-04 Thread Matt Mitchell
Thanks Sean. I'll have a look at creating a custom model! A somewhat related question here... I've also thought about using a separate database for user prefs, either riak or amazons dynamo db. Any tips on how to create a custom data source? - Matt On Jul 4, 2012, at 11:55 AM, Sean Owen

Re: custom file data model?

2012-07-04 Thread Sean Owen
Look at the example DataModels in integration. The pattern is the same: load it all into memory! it's too slow for real-time otherwise. So there is no point in say moving your data from a DB to Dynamo for scalability if you're using non-distributed code. If you're using Hadoop, DataModel is not

Difference when we don't use partial implementation

2012-07-04 Thread Nowal, Akshay
Hi All, I am running Decision forest in Mahout, below are the commands that I have used to implement the algo: Info file: mahout org.apache.mahout.df.tools.Describe -p /user/an32665/KDD/KDDTrain+.arff -f /user/an32665/KDD/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Building

Re: Difference when we don't use partial implementation

2012-07-04 Thread deneche abdelhakim
Hi Akshay, when you don't use the -p parameter, the builder loads the whole dataset in memory in every computing node, so every tree grown is trained on the whole dataset (of course using bagging to select a subset of it). When using -p, every computing node loads a part of the dataset (thus the