Approaches for combining multiple types of item data for user-user similarity

2012-07-03 Thread Ken Krugler
Hi all, I'm curious what approaches are recommended for generating user-user similarity, when I've got two (or more) distinct types of item data, both of which are fairly large. E.g. let's say I had a set of users where I knew both (a) what books they had bought on Amazon, and (b) what YouTube

Re: ItemSimilarity algorithm

2012-07-03 Thread Sean Owen
Item-item similarity is a property of the information you have on two items and just those items. Whether there are just those 2 items over 500K users, or 2M items over 500K users, makes no difference. So no I don't think that this skew implies you should use any particular algorithm, by itself. I

ItemSimilarity algorithm

2012-07-03 Thread Saikat Kanjilal
Hello Everyone,I was reading through the documentation on the different itemsimilarity algorithms in mahout and had a question, if one has a scenario where the number of items are significantly less than the number of users (say 500,000 users to 1000 items) are there particular item similarity

Re: Why each time the classification model trained by using TrainNewsGroup are not the same?

2012-07-03 Thread Ted Dunning
Because the order of the examples is randomized. On Tue, Jul 3, 2012 at 8:13 AM, Caspar Hsieh wrote: > I use Mahout classification example "TrainNewsGroup" to train the model > with leak type 3, then use "TestNewsGroups" to test the model, > then I re-trained the model and test again, the test r

Why each time the classification model trained by using TrainNewsGroup are not the same?

2012-07-03 Thread Caspar Hsieh
I use Mahout classification example "TrainNewsGroup" to train the model with leak type 3, then use "TestNewsGroups" to test the model, then I re-trained the model and test again, the test result are not the same before. why each time the model trained are not the same? Thanks.

Re: Does mahout classification depends on amount of data in each category?

2012-07-03 Thread Sean Owen
(Please don't "ping" your questions on the list -- bad form and makes people less likely to answer.) You do not have to have equal numbers of positive/negative examples. I think you need to go back and read up on the basics of how Bayesian classification works before you dig in to Mahout. This is

Re: Generating similarity file(s) for item recommender?

2012-07-03 Thread Sean Owen
I'm not sure if Mridul's suggestion does what you want. Do you want to recommend items to users? then no, you do not start with item IDs and recommend to them. It sounds like your question is how to compute similarity data. The first answer is that you do not use Hadoop unless you must use Hadoop.

Re: Does mahout classification depends on amount of data in each category?

2012-07-03 Thread damodar shetyo
Can someone help me with this? Regards, Damodar On Tue, Jul 3, 2012 at 4:27 PM, damodar shetyo wrote: > Hi, > I plan to use mahout classification feature.I have a lot of data on which > i am planning to train my model.Now i have few queries as follows: > 1)Suppose i have 2 types of data: Spam

Re: Generating similarity file(s) for item recommender?

2012-07-03 Thread Matt Mitchell
Thanks Mridul, I'll try this out. Does getItemIDs return every item id from the file in your example? This kind of leads me to another, related question... I want to have my recommender engine recommend items to a user, but the items should be from a known set of item ids. For example, if a user i

Re: Generating similarity file(s) for item recommender?

2012-07-03 Thread Mridul Kapoor
> I'm thinking the session ID (in the cookie) would be used as the user ID. > The events > are tied to product IDs, so these would be used in generating the > preferences. I guess if you consider product-preference on a per session-basis (i.e. only items for which a user expresses preference for,

Generating similarity file(s) for item recommender?

2012-07-03 Thread Matt Mitchell
Hi, I'm just beginning to play with the Mahout recommendation framework. I'm wondering if I could get some advice for implementing this thing. My data comes from a web app's, event logs, where the users accounts are only persisted for 30 days -- cookie data. I'm thinking the session ID (in the co

Does mahout classification depends on amount of data in each category?

2012-07-03 Thread damodar shetyo
Hi, I plan to use mahout classification feature.I have a lot of data on which i am planning to train my model.Now i have few queries as follows: 1)Suppose i have 2 types of data: Spam and not spam (this is just for example and not real use case , but similar to my real use case).The amount of sp

Re: nutch and mahout integration

2012-07-03 Thread Alexander Aristov
Hi Lance I understand that pages are pages but nutch stores pages in its own format while mahout operates with other data formats. I would like to merge nutch and mahout with minimun efforts that's why I question what is easier. Alter mahout and implement logic to read/write nutch data or impleme