Hi, Ted Dunning
I comment the line Collection.shuffle(files); in TrainNewsGroups.java,
let the model trained with same order of examples each time.
After recompile the code and redo the experiment, the model are still
not the same each time. :(
And I make sure the vectors before input to
The best default answer is to put them all in one model. The math
doesn't care what the things are. Unless you have a strong reason to
weight one data set I wouldn't. If you do, then two models is best. It
is hard to weight a subset of the data within most similarity
functions. I don't think it
Hi Sean,
Myrrix does look interesting! I'll keep an eye on it.
What I'd like to do is recommend items to users yes. I looked at the
IdRescorer and it did the job perfectly (pre filtering).
I was a little misleading in regard to the size of the data. The raw
data files are around 1GB. But after
Hi,
Slowly prototyping a recommendation here. The system does not have
user accounts. Since the users on the system don't have accounts, I'm
struggling a bit with completely new users, and what to recommend
them. I do have information about the user, like what referring site
they came from (1 of
If your input is 10MB then the good news is you are not near the scale
where you need Hadoop. A simple non-distributed Mahout recommender
works well, and includes the Rescorer capability you need. That's a
fine place to start.
The book ought to give a pretty good tour of how that works in chapter
Have a look at the PlusAnonymousUserDataModel, which is a bit of a
hack but a decent sort of solution for this case. It lets you
temporarily add a user to the system and then everything else works as
normal, so you can make recommendations to these new / temp users.
There isn't a way to inject
Thanks Sean! Nice to know I can stay simple for now.
- Matt
On Wed, Jul 4, 2012 at 9:59 AM, Sean Owen sro...@gmail.com wrote:
If your input is 10MB then the good news is you are not near the scale
where you need Hadoop. A simple non-distributed Mahout recommender
works well, and includes the
Hi,
I'm exploring Mahout SVD parallel implementation over Hadoop (ALS), and I would
like to clarify a few things :
1. How do you recommend top K items with this job? Does the job factorize
the ranking matrix, than compute a predicted ranking for each cell in the
matrix, so when you need a
On Wed, Jul 4, 2012 at 12:09 AM, Caspar Hsieh caspar.hs...@9x9.tv wrote:
Hi, Ted Dunning
I comment the line Collection.shuffle(files); in TrainNewsGroups.java,
let the model trained with same order of examples each time.
This will prevent effective learning. You must shuffle the data at
Hi,
I'd like to store additional information in my user preference data
files. Is it possible, to add more columns to the file that
FileDataModel uses? For example, an additional ID that maps to my
applications database ID for item-ids, a simple 3 char code for
possible use in custom user-user
Sure. It will ignore columns beyond the fourth, which is an optional
timestamp. If you just want it to read some common input file but
ignore the unused columns, that's easy.
You can copy and modify FileDataModel to do whatever you like, if you
want it to use this data. You'd have to change other
Hi Caroline,
Jake Mannix and I wrote the LDA CVB implementation. Apologies for the light
documentation.
When you invoked Mahout, did you supply the --doc_topic_output path
parameter? If this is present, after training a model the driver app will
apply the model to the input term-vectors, storing
I haven't looked into the vector dumper code in detail, but I remember
having successfully run some version of it without an input dictionary.
Perhaps you've stumbled into a legitimate bug with the utility? For the
time being you might also try the sequence file dumper util which is
somewhat more
Thanks Sean. I'll have a look at creating a custom model!
A somewhat related question here... I've also thought about using a separate
database for user prefs, either riak or amazons dynamo db. Any tips on how to
create a custom data source?
- Matt
On Jul 4, 2012, at 11:55 AM, Sean Owen
Look at the example DataModels in integration. The pattern is the
same: load it all into memory! it's too slow for real-time otherwise.
So there is no point in say moving your data from a DB to Dynamo for
scalability if you're using non-distributed code. If you're using
Hadoop, DataModel is not
Hi All,
I am running Decision forest in Mahout, below are the commands that I
have used to implement the algo:
Info file:
mahout org.apache.mahout.df.tools.Describe -p
/user/an32665/KDD/KDDTrain+.arff -f /user/an32665/KDD/KDDTrain+.info -d
N 3 C 2 N C 4 N C 8 N 2 C 19 N L
Building
Hi Akshay,
when you don't use the -p parameter, the builder loads the whole dataset
in memory in every computing node, so every tree grown is trained on the
whole dataset (of course using bagging to select a subset of it). When
using -p, every computing node loads a part of the dataset (thus the
17 matches
Mail list logo