Re: Why each time the classification model trained by using TrainNewsGroup are not the same?

2012-07-04 Thread Caspar Hsieh

Hi, Ted Dunning

I comment the line Collection.shuffle(files); in TrainNewsGroups.java, 
let the model trained with same order of examples each time.


After recompile the code and redo the experiment, the model are still 
not the same each time. :(


And I make sure the vectors before input to train are the same value and 
order each time.


How do I fixed the order of examples??


Thanks and regards








於 西元2012年07月03日 23:19, Ted Dunning 提到:

Because the order of the examples is randomized.

On Tue, Jul 3, 2012 at 8:13 AM, Caspar Hsieh caspar.hs...@9x9.tv wrote:


I use Mahout classification example TrainNewsGroup to train the model
with leak type 3, then use TestNewsGroups to test the model,
then I re-trained the model and test again, the test result are not the
same before.

why each time the model trained are not the same?

Thanks.






Re: Approaches for combining multiple types of item data for user-user similarity

2012-07-04 Thread Sean Owen
The best default answer is to put them all in one model. The math
doesn't care what the things are. Unless you have a strong reason to
weight one data set I wouldn't. If you do, then two models is best. It
is hard to weight a subset of the data within most similarity
functions. I don't think it would in Pearson for instance but could
work in Tanimoto.

On Wed, Jul 4, 2012 at 1:20 AM, Ken Krugler kkrugler_li...@transpac.com wrote:
 Hi all,

 I'm curious what approaches are recommended for generating user-user 
 similarity, when I've got two (or more) distinct types of item data, both of 
 which are fairly large.

 E.g. let's say I had a set of users where I knew both (a) what books they had 
 bought on Amazon, and (b) what YouTube videos they had watched.

 For each user, I want to find the 10 most similar other users.

  - I could create two separate models, find the nearest 30 users for each 
 user, and combine (maybe with weighting) the results.
  - I could toss all of the data into one model - and I could use a value of  
 1.0 for whichever type of preference is less important.

 Any other suggestions? Input on the above two approaches?

 Thanks!

 -- Ken

 --
 Ken Krugler
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Mahout  Solr






Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Matt Mitchell
Hi Sean,

Myrrix does look interesting! I'll keep an eye on it.

What I'd like to do is recommend items to users yes. I looked at the
IdRescorer and it did the job perfectly (pre filtering).

I was a little misleading in regard to the size of the data. The raw
data files are around 1GB. But after the interesting data is extracted
-- session-id, item-id and type-of-event (product image clicked,
product description viewed etc.), the data file comes out to about
10MB. Not so bad.

Btw, just bought the Mahout in Action book!

- Matt

On Tue, Jul 3, 2012 at 10:40 AM, Sean Owen sro...@gmail.com wrote:
 I'm not sure if Mridul's suggestion does what you want. Do you want to
 recommend items to users? then no, you do not start with item IDs and
 recommend to them.

 It sounds like your question is how to compute similarity data. The
 first answer is that you do not use Hadoop unless you must use Hadoop.

 You don't compute it yourself, you let the framework do it with
 LogLikelihoodSimilarity. It just happens automatically. You can use
 caching, you can use precomputation, but that comes after you decide
 that you have too much data to do it all in real-time.

 1GB of input data suggests you have a lot of data. Is that tens of
 millions of user-item associations? then yes you are not in simple
 non-Hadoop land anymore and you need to look at RecommenderJob /
 Hadoop. This doesn't have anything to do with FileDataModel or the
 non-distributed bits.


 To your second point -- this is really what Rescorer does for you,
 lets you filter or boost certain results at query time. But this is
 part of the non-distributed code. You could try stitching together
 some offline similarities from the Hadoop job, and loading them
 selectively in memory as part of the real-time Recommender, but it's
 going to be a bit dicey to get it to work fast.


 I don't mind mentioning that this is exactly the kind of problem I'm
 working on in Myrrix (myrrix.com). It does the offline model building
 on Hadoop and still lets you do real-time recommendations, with
 Rescorer objects if you want. The whole point is to fix up this
 dicey hard part mentioned above. Might take a look.



 On Tue, Jul 3, 2012 at 3:15 PM, Matt Mitchell goodie...@gmail.com wrote:
 Thanks Mridul, I'll try this out. Does getItemIDs return every item id
 from the file in your example?

 This kind of leads me to another, related question... I want to have
 my recommender engine recommend items to a user, but the items should
 be from a known set of item ids. For example, if a user is doing a
 search for gaming system, I only want recommendations for gaming
 system items. I was thinking I could feed the recommendation engine a
 set of item IDs that are known to be gaming systems as a candidate
 set *when executing that actual recommendation*. Does this make sense?
 If so, do you know how I can do this? I basically want to constrain
 the recommendations to a set of known item IDs at recommendation time.

 Thanks again!

 - Matt

 On Tue, Jul 3, 2012 at 8:01 AM, Mridul Kapoor mridulkap...@gmail.com wrote:
 I'm thinking the session ID (in the cookie) would be used as the user ID.
 The events
 are tied to product IDs, so these would be used in generating the
 preferences.


 I guess if you consider product-preference on a per session-basis (i.e.
 only items for which a user expresses preference for, in a single session,
 are similar to each other, in some way or the other). This way, you would
 be considering the session-ids as dummy user-ids, which I think should be
 good.


 I'd like to eventually run this on Hadoop, but it'd also be nice to know if
 there is a way to do this locally, while developing the app, maybe using a
 smaller
 dataset?


 Yes just writing a small offline recommender (made to run on a local
 machine) should do; you could take a subset of the data, use a
 FileDataModel, then do something like

 LongPrimitiveIterator itemIDs = dataModel.getItemIDs();


 and iterate over these; getting _n_ recommended items for each, storing
 them somewhere (and maybe use this evaluating the recommender somehow)

 Best,
 Mridul


recommendations for new users

2012-07-04 Thread Matt Mitchell
Hi,

Slowly prototyping a recommendation here. The system does not have
user accounts. Since the users on the system don't have accounts, I'm
struggling a bit with completely new users, and what to recommend
them. I do have information about the user, like what referring site
they came from (1 of n partner sites), the city they want to shop in,
and rating value of the product. I wonder if I could use this
information, to find the most similar existing user. Then use that
most similar user to generate recommendations? Anyone have tips for
dealing this this?

I'm not sure if Mahout supports finding most similar user based on
user attributes, if not, this should be a simple sql/where-like select
from the database.

- Matt


Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Sean Owen
If your input is 10MB then the good news is you are not near the scale
where you need Hadoop. A simple non-distributed Mahout recommender
works well, and includes the Rescorer capability you need. That's a
fine place to start.

The book ought to give a pretty good tour of how that works in chapter 2-5 yes.

Separately we can talk offline about Myrrix as needed.

Sean

On Wed, Jul 4, 2012 at 4:50 PM, Matt Mitchell goodie...@gmail.com wrote:
 Hi Sean,

 Myrrix does look interesting! I'll keep an eye on it.

 What I'd like to do is recommend items to users yes. I looked at the
 IdRescorer and it did the job perfectly (pre filtering).

 I was a little misleading in regard to the size of the data. The raw
 data files are around 1GB. But after the interesting data is extracted
 -- session-id, item-id and type-of-event (product image clicked,
 product description viewed etc.), the data file comes out to about
 10MB. Not so bad.

 Btw, just bought the Mahout in Action book!

 - Matt


Re: recommendations for new users

2012-07-04 Thread Sean Owen
Have a look at the PlusAnonymousUserDataModel, which is a bit of a
hack but a decent sort of solution for this case. It lets you
temporarily add a user to the system and then everything else works as
normal, so you can make recommendations to these new / temp users.

There isn't a way to inject anything but rating/pref information
directly, no. You can use this info in a Rescorer to influence
recommendations; this is not specific to the case of a new user. You
can also decide to make recommendations by a completely different
means for new users -- for example, some canned list of top-10 recs
that is appropriate for their city or referring site. That's
legitimate too in practice.

Yes you can also find most-similar users based on this info. You'd
have to write the similarity metric yourself. I assume this is also
not the metric you use in your real recommender. So maybe you could
use it to find the nearest 1 real user and sub in those
recommendations? or a neighborhood. You would have to rewrite a bit of
what a recommender does to go this way but it's not so hard.

No there is no content-based similarity metric; this is so domain specific.

On Wed, Jul 4, 2012 at 4:54 PM, Matt Mitchell goodie...@gmail.com wrote:
 Hi,

 Slowly prototyping a recommendation here. The system does not have
 user accounts. Since the users on the system don't have accounts, I'm
 struggling a bit with completely new users, and what to recommend
 them. I do have information about the user, like what referring site
 they came from (1 of n partner sites), the city they want to shop in,
 and rating value of the product. I wonder if I could use this
 information, to find the most similar existing user. Then use that
 most similar user to generate recommendations? Anyone have tips for
 dealing this this?

 I'm not sure if Mahout supports finding most similar user based on
 user attributes, if not, this should be a simple sql/where-like select
 from the database.

 - Matt


Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Matt Mitchell
Thanks Sean! Nice to know I can stay simple for now.

- Matt

On Wed, Jul 4, 2012 at 9:59 AM, Sean Owen sro...@gmail.com wrote:
 If your input is 10MB then the good news is you are not near the scale
 where you need Hadoop. A simple non-distributed Mahout recommender
 works well, and includes the Rescorer capability you need. That's a
 fine place to start.

 The book ought to give a pretty good tour of how that works in chapter 2-5 
 yes.

 Separately we can talk offline about Myrrix as needed.

 Sean

 On Wed, Jul 4, 2012 at 4:50 PM, Matt Mitchell goodie...@gmail.com wrote:
 Hi Sean,

 Myrrix does look interesting! I'll keep an eye on it.

 What I'd like to do is recommend items to users yes. I looked at the
 IdRescorer and it did the job perfectly (pre filtering).

 I was a little misleading in regard to the size of the data. The raw
 data files are around 1GB. But after the interesting data is extracted
 -- session-id, item-id and type-of-event (product image clicked,
 product description viewed etc.), the data file comes out to about
 10MB. Not so bad.

 Btw, just bought the Mahout in Action book!

 - Matt


A bunch of SVD questions...

2012-07-04 Thread Razon, Oren
Hi,
I'm exploring Mahout SVD parallel implementation over Hadoop (ALS), and I would 
like to clarify a few things :
1.  How do you recommend top K items with this job? Does the job factorize 
the ranking matrix, than compute a predicted ranking for each cell in the 
matrix, so when you need a recommendation you only need to retrieve the top K 
items according to prediction value for the user? Or is it factorize the matrix 
and require some online logic when the recommendation is being asked?
2.  From my knowledge, applying a SVD technique require first to fill in 
all empty cells in the ranking matrix (with average ranking for example). Is it 
something done during the ALS job (and if so, what is the way it's being 
filled), or should it be done as a preprocessing step?
3.  From my understanding SVD recommenders are used to predict user 
implicit preference. By doing so you can recommend top K items (top K items 
over descending orders according to the prediction). I wonder, could it be 
applied on a binary dataset (explicit), where my ranking matrix contain only 
1\0?
4.  From doing some readings I found that the timeSVD++ developed by Yehuda 
Koren is considered as the superior SVD implementation for SVD recommenders. I 
wondered if there is any kind of a parallel implementation of it on top of 
Hadoop? I found this proposal: https://issues.apache.org/jira/browse/MAHOUT-371
  I wonder, what is the status of it? Was it being checked already? Is it 
stable? Did anyone experienced with it?

Thanks,
Oren





-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Why each time the classification model trained by using TrainNewsGroup are not the same?

2012-07-04 Thread Ted Dunning
On Wed, Jul 4, 2012 at 12:09 AM, Caspar Hsieh caspar.hs...@9x9.tv wrote:

 Hi, Ted Dunning

 I comment the line Collection.shuffle(files); in TrainNewsGroups.java,
 let the model trained with same order of examples each time.


This will prevent effective learning.  You must shuffle the data at least
once.



 After recompile the code and redo the experiment, the model are still not
 the same each time. :(


There is also non-determinism in the AdaptiveLogisticRegression in the
CrossFoldLearner.


 And I make sure the vectors before input to train are the same value and
 order each time.

 How do I fixed the order of examples??


One thing that might help is to tell RandomUtils that you are running a
test.  This will make the random number generators that Mahout controls
become deterministic.

You can add a random number generator argument to the shuffle call to get
determinism there as well.


custom file data model?

2012-07-04 Thread Matt Mitchell
Hi,

I'd like to store additional information in my user preference data
files. Is it possible, to add more columns to the file that
FileDataModel uses? For example, an additional ID that maps to my
applications database ID for item-ids, a simple 3 char code for
possible use in custom user-user similarity etc.. Is this possible?

Thanks!

- Matt


Re: custom file data model?

2012-07-04 Thread Sean Owen
Sure. It will ignore columns beyond the fourth, which is an optional
timestamp. If you just want it to read some common input file but
ignore the unused columns, that's easy.

You can copy and modify FileDataModel to do whatever you like, if you
want it to use this data. You'd have to change other code to use your
new data somehow.


On Wed, Jul 4, 2012 at 6:52 PM, Matt Mitchell goodie...@gmail.com wrote:
 Hi,

 I'd like to store additional information in my user preference data
 files. Is it possible, to add more columns to the file that
 FileDataModel uses? For example, an additional ID that maps to my
 applications database ID for item-ids, a simple 3 char code for
 possible use in custom user-user similarity etc.. Is this possible?

 Thanks!

 - Matt


Re: Extracting document/topic inference with the new lda cvb algorithm

2012-07-04 Thread Andy Schlaikjer
Hi Caroline,

Jake Mannix and I wrote the LDA CVB implementation. Apologies for the light
documentation.

When you invoked Mahout, did you supply the --doc_topic_output path
parameter? If this is present, after training a model the driver app will
apply the model to the input term-vectors, storing inference results in the
specified path. If the parameter isn't specified, this final inference run
is skipped:

https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L74
https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L331

So, assuming you did generate inference output, I should note that both the
model and inference output have the *same* format: Both the topic-term
matrix and doc-topic inference output are stored as
SequenceFileIntWritable, VectorWritable data. If you point the vectordump
util at either data set and supply a dictionary, it'll happily map term ids
or topic ids into term strings using that dictionary... Quite confusing.
Just make sure that when you run vectordump against the doc-topic data that
you don't supply the dictionary-- This way, you'll see the raw topic ids
(zero-based indices) in output, instead of whatever terms those indices
might correspond to in your dictionary.

Best,
Andy
@sagemintblue


On Wed, Jul 4, 2012 at 2:30 AM, Caroline Meyer caromeye...@gmail.comwrote:

 Hey Guys,

 I have been able to successfully execute the new lda algorithm as well as
 extract the topic/term inference with vectordump. What I was not able to do
 was get the document/topic inference. When I run the same vectordump
 command I get the same kinds of vectors (term:probability) as before.
 Should the vectors not be (topic:probability)?

 The command I run is:

 vectordump -s temp/lda-cvb-doc/part-m-0 -d
 temp/vectors/dictionary.file-* -dt sequencefile -o temp/lda-cvb-topics.txt

 I have not been able to find any documentation except what's in the code.
 Thanks for the help.

 Cheers,
 Caroline



Re: Extracting document/topic inference with the new lda cvb algorithm

2012-07-04 Thread Andy Schlaikjer
I haven't looked into the vector dumper code in detail, but I remember
having successfully run some version of it without an input dictionary.
Perhaps you've stumbled into a legitimate bug with the utility? For the
time being you might also try the sequence file dumper util which is
somewhat more generic but may suit your purpose here.

Andy


On Wed, Jul 4, 2012 at 9:42 AM, Caroline Meyer caromeye...@gmail.comwrote:

 Hi Andy

 If I only use the -s and -o options I get this null pointer exception:

 Exception in thread main java.lang.NullPointerException
 at
 org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:118)
 at
 org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:115)
 at com.google.common.collect.Iterators$8.next(Iterators.java:765)
 at java.util.AbstractCollection.toArray(AbstractCollection.java:124)
 at java.util.ArrayList.init(ArrayList.java:131)
 at com.google.common.collect.Lists.newArrayList(Lists.java:119)
 at

 org.apache.mahout.utils.vectors.VectorHelper.toWeightedTerms(VectorHelper.java:114)
 at

 org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:124)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:241)

 In the code it looks like it is looking for a dictionary that is
 not specified. Is there another option i am missing?

 Cheers,
 Caroline


 On Wed, Jul 4, 2012 at 6:10 PM, Andy Schlaikjer 
 andrew.schlaik...@gmail.com
  wrote:

  Hi Caroline,
 
  Jake Mannix and I wrote the LDA CVB implementation. Apologies for the
 light
  documentation.
 
  When you invoked Mahout, did you supply the --doc_topic_output path
  parameter? If this is present, after training a model the driver app will
  apply the model to the input term-vectors, storing inference results in
 the
  specified path. If the parameter isn't specified, this final inference
 run
  is skipped:
 
 
 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L74
 
 
 https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L331
 
  So, assuming you did generate inference output, I should note that both
 the
  model and inference output have the *same* format: Both the topic-term
  matrix and doc-topic inference output are stored as
  SequenceFileIntWritable, VectorWritable data. If you point the
 vectordump
  util at either data set and supply a dictionary, it'll happily map term
 ids
  or topic ids into term strings using that dictionary... Quite confusing.
  Just make sure that when you run vectordump against the doc-topic data
 that
  you don't supply the dictionary-- This way, you'll see the raw topic ids
  (zero-based indices) in output, instead of whatever terms those indices
  might correspond to in your dictionary.
 
  Best,
  Andy
  @sagemintblue
 
 
  On Wed, Jul 4, 2012 at 2:30 AM, Caroline Meyer caromeye...@gmail.com
  wrote:
 
   Hey Guys,
  
   I have been able to successfully execute the new lda algorithm as well
 as
   extract the topic/term inference with vectordump. What I was not able
 to
  do
   was get the document/topic inference. When I run the same vectordump
   command I get the same kinds of vectors (term:probability) as before.
   Should the vectors not be (topic:probability)?
  
   The command I run is:
  
   vectordump -s temp/lda-cvb-doc/part-m-0 -d
   temp/vectors/dictionary.file-* -dt sequencefile -o
  temp/lda-cvb-topics.txt
  
   I have not been able to find any documentation except what's in the
 code.
   Thanks for the help.
  
   Cheers,
   Caroline
  
 



Re: custom file data model?

2012-07-04 Thread Matt Mitchell
Thanks Sean. I'll have a look at creating a custom model!

A somewhat related question here... I've also thought about using a separate 
database for user prefs, either riak or amazons dynamo db. Any tips on how to 
create a custom data source?

- Matt

On Jul 4, 2012, at 11:55 AM, Sean Owen sro...@gmail.com wrote:

 Sure. It will ignore columns beyond the fourth, which is an optional
 timestamp. If you just want it to read some common input file but
 ignore the unused columns, that's easy.
 
 You can copy and modify FileDataModel to do whatever you like, if you
 want it to use this data. You'd have to change other code to use your
 new data somehow.
 
 
 On Wed, Jul 4, 2012 at 6:52 PM, Matt Mitchell goodie...@gmail.com wrote:
 Hi,
 
 I'd like to store additional information in my user preference data
 files. Is it possible, to add more columns to the file that
 FileDataModel uses? For example, an additional ID that maps to my
 applications database ID for item-ids, a simple 3 char code for
 possible use in custom user-user similarity etc.. Is this possible?
 
 Thanks!
 
 - Matt


Re: custom file data model?

2012-07-04 Thread Sean Owen
Look at the example DataModels in integration. The pattern is the
same: load it all into memory! it's too slow for real-time otherwise.
So there is no point in say moving your data from a DB to Dynamo for
scalability if you're using non-distributed code. If you're using
Hadoop, DataModel is not relevant.

On Wed, Jul 4, 2012 at 9:09 PM, Matt Mitchell goodie...@gmail.com wrote:
 Thanks Sean. I'll have a look at creating a custom model!

 A somewhat related question here... I've also thought about using a separate 
 database for user prefs, either riak or amazons dynamo db. Any tips on how to 
 create a custom data source?


Difference when we don't use partial implementation

2012-07-04 Thread Nowal, Akshay
Hi All,

 

I am running Decision forest in Mahout, below are the commands that I
have used to implement the algo:

 

Info file:

mahout org.apache.mahout.df.tools.Describe -p
/user/an32665/KDD/KDDTrain+.arff -f /user/an32665/KDD/KDDTrain+.info -d
N 3 C 2 N C 4 N C 8 N 2 C 19 N L

Building Forest:

mahout org.apache.mahout.df.mapreduce.BuildForest
-Dmapred.max.split.size=1874231 -oob -d /user/an32665/KDD/KDDTrain+.arff
-ds /user/an32665/KDD/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest

Testing Forest:

mahout org.apache.mahout.df.mapreduce.TestForest -i
/user/an32665/KDD/KDDTest+.arff -ds /user/an32665/KDD/KDDTrain+.info -m
nsl-forest -a -mr -o predictions

 

So while building the forest we use -P for implementing partial
implementation. I just wanted to know the difference in algorithm when
we use -p and when we don't use -p.

 

 

Regards,

Akshay Nowal

 



Re: Difference when we don't use partial implementation

2012-07-04 Thread deneche abdelhakim
Hi Akshay,

when you don't use the -p parameter, the builder loads the whole dataset
in memory in every computing node, so every tree grown is trained on the
whole dataset (of course using bagging to select a subset of it). When
using -p, every computing node loads a part of the dataset (thus the name
partial) so the trees are trained on parts of the dataset. The training
algorithm is the same in both implementations, and the partial
implementation is used when the dataset is too big to fit in memory.

On Thu, Jul 5, 2012 at 4:38 AM, Nowal, Akshay akshay_no...@syntelinc.comwrote:

 Hi All,



 I am running Decision forest in Mahout, below are the commands that I
 have used to implement the algo:



 Info file:

 mahout org.apache.mahout.df.tools.Describe -p
 /user/an32665/KDD/KDDTrain+.arff -f /user/an32665/KDD/KDDTrain+.info -d
 N 3 C 2 N C 4 N C 8 N 2 C 19 N L

 Building Forest:

 mahout org.apache.mahout.df.mapreduce.BuildForest
 -Dmapred.max.split.size=1874231 -oob -d /user/an32665/KDD/KDDTrain+.arff
 -ds /user/an32665/KDD/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest

 Testing Forest:

 mahout org.apache.mahout.df.mapreduce.TestForest -i
 /user/an32665/KDD/KDDTest+.arff -ds /user/an32665/KDD/KDDTrain+.info -m
 nsl-forest -a -mr -o predictions



 So while building the forest we use -P for implementing partial
 implementation. I just wanted to know the difference in algorithm when
 we use -p and when we don't use -p.





 Regards,

 Akshay Nowal