Re: How does SVDRecommender work in mahout?

2012-04-29 Thread Sean Owen
They're implicitly zero as far as the math goes IIRC On Sun, Apr 29, 2012 at 10:45 PM, Daniel Quach wrote: > ah sorry, I meant in the context of the SVDRecommender. > > Your earlier email mentioned that the DataModel does NOT do any subtraction, > nor add back in the end, ensuring the matrix rem

Re: integrating databases

2012-04-29 Thread Sean Owen
Can you connect to an Oracle database? sure, just do so. I think the SQL just works, but you'll find out. On Mon, Apr 30, 2012 at 2:36 AM, Amrhal Lelasm wrote: > > I had a nice week playing with the Mahout CF Libray and its > MySQLJDBCDataModel to get the input data from  a database. > But then

Re: integrating databases

2012-04-30 Thread Sean Owen
You would have to write code to manually splice together the data. You'd also have to figure out what to do with updates -- which DB gets it? Or you may say you don't have updates. Querying two DBs and manually combining their results is even slower than one. You're going to have to get all the da

Re: integrating databases

2012-05-01 Thread Sean Owen
(I think the question is more blending two data sources than two recommenders.) On Tue, May 1, 2012 at 9:37 AM, Manuel Blechschmidt < manuel.blechschm...@gmx.de> wrote: > Hi Amrhal, > combining data for a recommender from two data sources is current > research. Search for ensemble learning or ble

Re: Problem Running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob on Hadoop

2012-05-03 Thread Sean Owen
The format is always "user,item,pref" -- I think it makes that pretty clear. On Thu, May 3, 2012 at 7:25 AM, Utkarsh Gupta wrote: > Hi All, > > I am new to Mahout and I am currently reading Mahout in Action. > I was trying to run the RecommenderJob as explained in chapter 6 of this > book with Wi

Re: Mahout + BigDataR Linux

2012-05-03 Thread Sean Owen
*V*owpal Wabbit ? :) On Thu, May 3, 2012 at 5:32 PM, Ted Dunning wrote: > Gently here: > > You misspelled woWpal wabbit. > >

Re: Re: Mahout + BigDataR Linux

2012-05-03 Thread Sean Owen
A machine image is not the only deployment model to be sure, but is the kind of deployment model you need if you're offering something as a cloud service. I'm a big fan of the AWS Marketplace which of course is based on this kind of model. (I'm also about to make the stand-alone Myrrix server avail

Re: Recommendation scores from LogLikelihood Similarity recommender

2012-05-06 Thread Sean Owen
That sounds a lot like something that the cosine similarity would pick up on for sure. On Sun, May 6, 2012 at 6:48 PM, Will C wrote: > So I've taken another try at using recommendations values. However, unlike > something that a user is explicitly rating on a scale of 0-5. I am using a > user's

Re: how to implement item-based recommender on movie genre data?

2012-05-08 Thread Sean Owen
So you have already decided, for each movie, whether it's in or not in each genre? And then you want to create a "profile" -- assuming you mean some kind of meta-genre? This isn't a recommender problem; it's just a clustering problem. I'd use the Tanimoto similarity. You could run the clustering-b

Re: How to index by long ID in RandomAccessSparseVector

2012-05-08 Thread Sean Owen
That's right. It ought to be uncommon but can happen. For recommenders, it "only" means that you start to treat two users or two items as the same thing. That doesn't do much harm though. Maybe one user's recs are a little funny. I do think it would have been useful to index by long, but that woul

Re: Exclusing certain ratings when running recommender

2012-05-08 Thread Sean Owen
Actually that's how IDRescorer already works. It will filter before scoring. On Wed, May 9, 2012 at 2:00 AM, Mugoma Joseph Okomba wrote: > Hello, > > I have database with ratings: 1,2,3,4 > > However, when running the recommender I would, in some cases, want to > exclude items with rating 4. > >

Re: Exclusing certain ratings when running recommender

2012-05-09 Thread Sean Owen
at 7:21 AM, Mugoma Joseph Okomba wrote: > On Wed, May 9, 2012 7:37 am, Sean Owen wrote: > > Actually that's how IDRescorer already works. It will filter before > > scoring. > > > > Does 'before scoring' mean before the recommender extracts recommendations? >

Re: Exclusing certain ratings when running recommender

2012-05-09 Thread Sean Owen
Trust me, I'm telling you how it works since I wrote it. What is unclear about the steps I listed below? Filtering happens before even *scoring*. It works exactly how you want it to. If you're still confused please just read the source code. On Wed, May 9, 2012 at 9:08 AM, Mugoma Joseph Okomba wr

Re: Exclusing certain ratings when running recommender

2012-05-09 Thread Sean Owen
What do you mean "original items"? The user's preferred items are already not candidates for recommendation, but that is nothing to do with the rescorer. It operates on all *candidate* items, *before* scoring. What is your distinction between filtering *recommended* items and *original* items? Eit

Re: Exclusing certain ratings when running recommender

2012-05-09 Thread Sean Owen
f ratings to feed into the recommender so that the > recommender sees 7 ratings instead of 10. But this doesn't look intuitive > so I thought there's a better way of handling this within mahout. > > Probably what I need is a new data model that overwrites > getPreferencesFrom

Re: From Item-based Recommender to User-based Recommender

2012-05-09 Thread Sean Owen
Both of these have been implemented since before Mahout. Can you clarify? Don't understand at the moment. On May 9, 2012 3:42 PM, "冯伟" wrote: > I found that only item-based recommender is implemented in version-0.6. > When I want to use the user-based recommender, all I need to do is to > transpo

Re: From Item-based Recommender to User-based Recommender

2012-05-09 Thread Sean Owen
ndations. When I exchange the postions of uid and itemid, the input > file would be > "itemid, uid, rating" > "itemid, uid, rating" > "itemid, uid, rating" > According to the above input, > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob would gen

Re: Theory question about Pearson Correlation and user based recommender

2012-05-09 Thread Sean Owen
True. But that neighborhood of more-similar users is smaller, and the things they know about are fewer. Maybe better recommendations are found attached to users a bit farther away, that are not recommendable when considering only a very small and close neighborhood. On Wed, May 9, 2012 at 5:13 PM,

Re: how to implement item-based recommender on movie genre data?

2012-05-10 Thread Sean Owen
om an efficiency, accuracy, and > machine learning standpoint, I am not an expert on the subject at all. > > On May 8, 2012, at 12:58 AM, Sean Owen wrote: > > > So you have already decided, for each movie, whether it's in or not in > each > > genre? And then you

Re: Extra low speed mahout distribution with hadoop

2012-05-10 Thread Sean Owen
Hadoop just about always makes things slower, in terms of total resources needed. It adds a lot of overhead; such is the price of parallelism. My rule of thumb is that Hadoop-based algorithms will, all else equal, take 4x more CPU hours. But of course Hadoop lets you distribute. However, I doubt t

Re: Extra low speed mahout distribution with hadoop

2012-05-10 Thread Sean Owen
at run-time. This is the architecture I'm building in the (Mahout-based) Myrrix recommender engine (myrrix.com) Sean On Thu, May 10, 2012 at 10:53 AM, Maksim Areshkau wrote: > On 10.05.12 12:31, "Sean Owen" wrote: > > >>Hadoop just about always makes things slower,

Re: Improve Recommendations

2012-05-10 Thread Sean Owen
The best, or perhaps only, way to integrate such information is to implement your own version of UserSimilarity or ItemSimilarity and then use a user-based or item-based recommender. You can implement whatever similarity rule you think is best according to your metadata. There's not a lot to be sai

Re: Improve Recommendations

2012-05-10 Thread Sean Owen
Thu, May 10, 2012 at 11:52 AM, Sean Owen wrote: > >> The best, or perhaps only, way to integrate such information is to >> implement your own version of UserSimilarity or ItemSimilarity and >> then use a user-based or item-based recommender. You can implement >> whateve

Re: Some guidance for this noob - "Metadata Matching Engine"

2012-05-10 Thread Sean Owen
It's closest to a clustering problem. Because your clusters are so particular -- the elements are very close to each other, very distinct from others -- it reduces to something similar. If you had a good similarity metric for docs, you would just match a new doc against each other doc and figure o

Re: Recommender with ratings takes a long time to process

2012-05-11 Thread Sean Owen
You need to apply a CandidateItemStrategy to reduce the number of elements you consider, or else it will take a very long time because almost the entire model is a candidate for recommendation. On Fri, May 11, 2012 at 6:18 PM, Emilio Suarez wrote: > Hi there, > > The usual setting for the Mahout

Re: Recommender with ratings takes a long time to process

2012-05-11 Thread Sean Owen
Yes, you want the sampling one so you can reduce the number of neighbors you consider. On Fri, May 11, 2012 at 6:47 PM, Emilio Suarez wrote: > Thanks Sean, > > So, do you suggest something like this? > >        LogLikelihoodSimilarity similarity = new > LogLikelihoodSimilarity(fileDataModel); >

Re: Recommender with item features

2012-05-12 Thread Sean Owen
You can write your own ItemSimilarity metric based on the features and then use an item-based recommender. That piece you'd have to do yourself by making up some notion of similarity; if the features were all numeric and normalized you can look at repurposing something based on Euclidean distance o

Re: Persistent Data Model

2012-05-14 Thread Sean Owen
Can you persist a DataModel? sure. The easiest thing is to read/write a CSV file. Or put the data in a database. The existing implementations already read such things into memory for you. I am not sure what you mean about computing the DataModel as a separate process. The DataModel exists and shoul

Re: Persistent Data Model

2012-05-14 Thread Sean Owen
Yes, you don't want to use the database directly. A relational database will never be fast enough. You want to use RefreshFromJDBCDataModel to load the data into memory periodically. This is really what I was referring to. On Mon, May 14, 2012 at 1:49 PM, Nikolaos Romanos Katsipoulakis wrote: >

Re: Persistent Data Model

2012-05-15 Thread Sean Owen
2012 05:09 PM, Sean Owen wrote: >> >> Yes, you don't want to use the database directly. A relational >> database will never be fast enough. >> >> You want to use RefreshFromJDBCDataModel to load the data into memory >> periodically. This is really what I w

Re: CachingRecommender versus Recommender

2012-05-15 Thread Sean Owen
Recommender is the interface that all implementations implement. CachingRecommender wraps any other instance of Recommender and transparently caches its responses. You would use this in a case where you have memory to spare and want to save CPU by not recomputing recs per user frequently. This he

Re: Save a UserSimilarity in a File

2012-05-17 Thread Sean Owen
I think it will be way too slow to update a file like this, as you must seek and rewrite the whole thing. Files aren't for this sort of update. A DB might be. Even if it were it is probably too slow to read the similarity from backing store each time. It needs to be mostly in memory. That's why th

Re: Save a UserSimilarity in a File

2012-05-17 Thread Sean Owen
ote: > On 05/17/2012 02:06 PM, Sean Owen wrote: >> >> I think it will be way too slow to update a file like this, as you must >> seek and rewrite the whole thing. Files aren't for this sort of update. A >> DB might be. >> >> Even if it were it is probably to

Re: Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-18 Thread Sean Owen
(You can use -DskipTests in Maven. You don't need to run the very lengthy tests.) The bad news is that this example only worked in 0.5, and was removed in 0.6. The underlying pieces are still there, you just would have to assemble the WAR yourself. I'll try to figure out how to remove this; I did

Re: Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-18 Thread Sean Owen
t; > best, > DJ > > On Fri, May 18, 2012 at 11:12 AM, Sean Owen wrote: > >> (You can use -DskipTests in Maven. You don't need to run the very >> lengthy tests.) >> >> The bad news is that this example only worked in 0.5, and was removed >> in 0.6. The

Re: How to approach this? Classification vs Recommendation

2012-05-18 Thread Sean Owen
Trivially it's four classifiers. You have just one input here, and it's binary. That seems like too little info to discriminate on. All you can learn -- and it doesn't really need a classifier algorithm -- is there's an x% chance of encountering problem a if funded, and (100-x)% of a if not. On Fr

Re: Mixing simiarity measures

2012-05-19 Thread Sean Owen
On Sun, May 20, 2012 at 2:31 AM, Mugoma Joseph Okomba wrote: > E.g we use PearsonCorrelationSimilarity to get similarity between users > but find that only overlaps in ratings between the 2 users are being > considered in final result, without consideration of the overall > population. Yes, this

Re: Help with running taste-demo on mahout-examples-0.7-SNAPSHOT.jar

2012-05-20 Thread Sean Owen
On Sun, May 20, 2012 at 6:43 PM, Dhananjay Sampath wrote: > My eventual plan is to use the mongo-mahout lib to push data in and out of > mongo. A rails app will eventually read from it and/or update it. So while > the packaging is not entirely a deal-breaker, a purely subjective opinion > of mine

Re: Getting Recommendation for User not in Input Data

2012-05-21 Thread Sean Owen
You can't get recommendations for that user -- there is no data on which to recommend anything. The system knows nothing about the user. The best you can do is temporarily add the user. See PlusAnonymousUserDataModel. On Mon, May 21, 2012 at 1:28 PM, Utkarsh Gupta wrote: > Hi All, > > > I am usi

Re: Getting Recommendation for User not in Input Data

2012-05-22 Thread Sean Owen
For that you simply split the data into test/training data, and compare the test data to the training results. This is all already done for you in GenericRecommenderIRStatsEvaluator. On Tue, May 22, 2012 at 5:52 AM, Utkarsh Gupta wrote: > Hi Sean, > > Thanks for your quick reply :) > I have looke

Re: Any other names for generic user/item based recommenders?

2012-05-24 Thread Sean Owen
"User-based" and "item-based" are commonly used. I might also refer to them both as similarity-based and/or neighborhood-based approaches. On Fri, May 25, 2012 at 6:06 AM, Daniel Quach wrote: > I'm writing a short paper about recommender systems and I have been using > mahout's generic item base

Re: RecommenderJob Hadoop execution times

2012-05-29 Thread Sean Owen
I am almost certain it is the combiner phase. The mappers are locally "compacting" the output so much less must be sent to the reducer. You can often speed it up by increasing io.sort.factor (merge more ways) and io.sort.mb (give more space for merging in memory). On Tue, May 29, 2012 at 12:27 PM,

Re: RecommenderJob not working for boolean Data?

2012-05-29 Thread Sean Owen
Did you set the flag for boolean data, --booleanData? Processing is somewhat different without ratings, since you are no longer predicting ratings. On Tue, May 29, 2012 at 10:58 PM, Oliver B. Fischer wrote: > Hi all, > > today tried to use RecommenderJob to produce some simple recommendations. >

Re: RecommenderJob not working for boolean Data?

2012-05-29 Thread Sean Owen
Oops, you said that. What are your input rows like? What similarity metric -- one that doesn't need ratings? On Tue, May 29, 2012 at 11:30 PM, Oliver B. Fischer wrote: > Yes, I tried --booleanData and --booleanData true. The result was the same. > > Prior that I tried the Wikipedia link database

Re: Server sizing Hadoop + Mahout

2012-05-30 Thread Sean Owen
You haven't even said what algorithm. It even depends on the distribution of your data, in addition to amount, not to mention the type of servers, configuration, etc. It's impossible to give a meaningful baseline. You can run your real data on a real cluster to get some notion. Run-time and require

Re: Training Data and Precision/Recall evaluation

2012-05-30 Thread Sean Owen
Yes, because it tests on a user-by-user basis. There's not the same notion of test/training set. Each user is split individually one at a time, according to the "at" parameter. On Wed, May 30, 2012 at 10:16 AM, Daniel Quach wrote: > I want to use the GenericRecommenderIRStatsEvaluator to get > p

Re: RecommenderJob Hadoop execution times

2012-05-30 Thread Sean Owen
I don't know if you'd observe much difference -- it's not a "bug". It does take time to combine the records. It would take longer if you didn't. Heap size is not relevant here, but yes that is a function of the Hadoop config. Have a look at the source code to see what generates which output and whe

Re: Clustering a large crawl

2012-05-31 Thread Sean Owen
On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel wrote: > I see >double denominator = Math.sqrt(lengthSquaredp1) * > Math.sqrt(lengthSquaredp2); >// correct for floating-point rounding errors >if (denominator < dotProduct) { > denominator = dotProduct; >} >return 1.0 - dotPro

Re: NetflixRecommender Data

2012-06-01 Thread Sean Owen
It is no longer officially available because of the lawsuit against Netflix. On Jun 1, 2012 10:05 PM, "Swapna" wrote: > I want to run the examples but do not know where to get the netflix data > from. > > >

Re: NetflixRecommender Data

2012-06-02 Thread Sean Owen
Yes it was never a great example as it was not distributed. I'll remove after the release. On Jun 2, 2012 3:38 PM, "Isabel Drost" wrote: > On 01.06.2012 Sean Owen wrote: > > It is no longer officially available because of the lawsuit against > > Netflix. > > H

Re: How to make recommendations using ALS

2012-06-04 Thread Sean Owen
Yes, that's how you do it. You just keep the top N. This is typically quite fast and is parallelizable across cores trivially. Yes you can also mix in neighborhood based techniques. You can calculate user and item similarities in feature space, fast. Cosine similarity is fine in this space. It wou

Re: ItemSimilarityJob

2012-06-04 Thread Sean Owen
That's how it used to work but it was restricted to integers a long time ago purely for speed and memory. It makes a big difference. Many (most?) use cases have some numeric ID for these guys already. Otherwise no reason it needs to be an integer it just needs to have an ordering. You can retain

Re: libjars not working for ItemSimilarityJob

2012-06-04 Thread Sean Owen
That's a Hadoop flag not Mahout, right? On Jun 4, 2012 10:49 PM, "Something Something" wrote: > Hello, > > Trying to use the 'ItemSimilarityJob', but the '--libjars' argument doesn't > seem to work. Anybody run into similar issue? Please let me know. > Thanks. >

Re: libjars not working for ItemSimilarityJob

2012-06-05 Thread Sean Owen
jobs. My other MR jobs seem > to run successfully & seem to recognize this param. Anyway, let me look > into it some more before asking on the Hadoop mailing list. > > On Mon, Jun 4, 2012 at 3:29 PM, Sean Owen wrote: > > > That's a Hadoop flag not Mahout, right? &

Re: ItemSimilarityJob

2012-06-05 Thread Sean Owen
; > sure what you mean by 'needs to have an ordering'. > > > > > > On Mon, Jun 4, 2012 at 3:29 PM, Sean Owen wrote: > > > >> That's how it used to work but it was restricted to integers a long time > >> ago purely for speed and memory. It

Re: Web Service Interface for triggering a Hadoop Job

2012-06-05 Thread Sean Owen
-D arguments are to the JVM. You can't pass args to the JVM here and you are passing them to the program instead. These are really just setting key value pairs in the Configuration object. So just do that instead. In general I don't think this is a good design for a long running Hadoop job to be

Re: ItemSimilarityJob creates no output

2012-06-05 Thread Sean Owen
Is your input very small? It is probably getting mostly pruned as a result, as most of it looks like low-count data. And then there is almost no info on which to compute similarity. On Tue, Jun 5, 2012 at 7:13 PM, Something Something wrote: > One thing I noticed is that in step 4 of this process

Re: Frequent itemset mining

2012-06-05 Thread Sean Owen
It wouldn't surprise me, though I don't know this implementation or your setup. Locally, you're not really running Hadoop -- it's all local, and there is no HDFS to replicate and such. You are saving the big overhead of shuffling data across machines, and the overhead of starting new workers. For s

Re: Java Heap Error: ItemSimilarityJob

2012-06-06 Thread Sean Owen
You need to increase the size of the children's heap. mapred.child.java.opts can be set to -Xmx4g for example. This is usually put in mapred-site.xml. Sampling does decrease the size of the intermediate outputs; probably not the final output so much. But this is not your problem. You are running o

Re: compiling Mahout in Intellij IDE

2012-06-06 Thread Sean Owen
I don't get any errors... what do you get? Are you perhaps not getting the top-level pom.xml which defines hadoop.version? Mahout uses Hadoop. That's why you need Hadoop. On Wed, Jun 6, 2012 at 10:58 AM, Yaprak Ayazoglu wrote: > Hi, > > I'm a newbie for Mahout project. I'm trying to "compile" the

Re: compiling Mahout in Intellij IDE

2012-06-06 Thread Sean Owen
The error from Maven indicates your installation is messed up. No you do not need to install Hadoop. But Mahout needs Hadoop to compile. Maven takes care of that for you, and it just works. It sounds like you have a Maven problem locally. On Wed, Jun 6, 2012 at 11:14 AM, Yaprak Ayazoglu wrote:

Re: ItemSimilarityJob creates no output

2012-06-06 Thread Sean Owen
That sounds like plenty of data -- doubting that's any issue. Is it very sparse? Meaning many items exist just for one user? It's really sparseness that might produce few or no similarities. I think something else is at work here but don't know off the top of my head based on the info so far. Yes

Re: ItemSimilarityJob creates no output

2012-06-06 Thread Sean Owen
s fairly similar. Is there a > place where I can upload part of my file for someone else to try? > > OR BETTER YET - Can someone provide a small file that always returns a few > similarities? Does a file such as this included in the source? > > Thanks for the help. > > On

Re: Can't get correct co-occurrence count

2012-06-06 Thread Sean Owen
This isn't much info ... ? What counts what is your relevant data. Is it not simply being pruned ? On Jun 6, 2012 6:26 PM, "邓路" wrote: > Hi All: > > I can't get correct co-occurrence count from the ItemSimilarity job; > > The following parameters are used: --similarityClassname, > "SIMILARITY_

Re: ItemIDs in order of input

2012-06-07 Thread Sean Owen
Define "previous item"? Data points have no ordering in the framework. It sounds like something you will have to track yourself. On Thu, Jun 7, 2012 at 8:27 AM, chanju Jeon wrote: > How can I get ItemIDs in order of input from DataModel? > > I want to know previous item when estimate preference f

Re: ItemIDs in order of input

2012-06-07 Thread Sean Owen
There is nothing for this in mahout, since the ordering of IDs generally has no meaning. On Thu, Jun 7, 2012 at 10:01 AM, chanju Jeon wrote: > When input-data have [1,101,3 / 1,105,2 / 1,103,5 / 1,104,1 / 1,102,4], > I want to get item-list which order-list is written like [101, 105, 103, > 104,

Re: Scalability of ParallelALSFactorizationJob with implicit feedback

2012-06-11 Thread Sean Owen
Not so with ALS. The matrix in question is (# users) x (# features), so the number of rated items by any user won't matter. I didn't write this job, but implemented a similar pipeline. I struggled with this kind of tradeoff: loading the user feature matrix in memory is a scalability bottleneck (st

Re: Scalability of ParallelALSFactorizationJob with implicit feedback

2012-06-11 Thread Sean Owen
ParallelALSFactorizationJob? No, not based on co-occurrence. This is just a matrix-factorization approach using alternating least squares as opposed to say SGD to solve for the factors. On Mon, Jun 11, 2012 at 8:28 PM, Ted Dunning wrote: > Decomposition techniques do help with this, but it sound

Re: A question about "RecommenderIntro.java" example in" Mahout in Action" book

2012-06-13 Thread Sean Owen
(Since you also asked in the publisher's forum, and it pertains to the book, I will answer there rather than on the general Mahout mailing list.) On Wed, Jun 13, 2012 at 2:25 PM, Yaprak Ayazoglu wrote: > Hi, > > I'm following "Mahout in Action" book to learn Mahout. I'm trying to run > the exampl

Re: How to recommend users?

2012-06-14 Thread Sean Owen
To recommend users to users, you need some kind of user-user interaction data. You are right, you don't have that directly. But at first you described this as merely finding similar users. For that, you can use your data. You don't need a Recommender even. You just need any implementation of UserSi

Re: Few items in the recommendation list

2012-06-14 Thread Sean Owen
The problem is the sparseness of your data. On average, each user made about 1.3 ratings. Few users even had 2, I'd imagine. So, it is hard to establish any similarity between any two users, because most users overlap in 0 or 1 items, and that means Pearson correlation is undefined. (Using log-lik

Re: Few items in the recommendation list

2012-06-15 Thread Sean Owen
ender, It gives me prediction for 656 > items :D (I think the other items that are left also have prediction of 0) > > The recommender has an average absolute deviation of ~1.13, I'll take a > look at myrrix > > Thanks again. > . > > 2012/6/14 Sean Owen > > > T

Re: A question about "RecommenderIntro.java" example in" Mahout in Action" book

2012-06-17 Thread Sean Owen
rsion from svn I was not > able to get it to build (it seemed to be missing a few jar files but I > forget which ones), but that's a different issue. Right now I am trying to > figure out how to run the first examples in some version of Mahout. If > anybody knows how to run the exam

Re: Mahout in Action Ch.05 Example

2012-06-18 Thread Sean Owen
It wasn't in 0.6 either -- it has since been removed. The book goes with version 0.5. Packaging a WAR file is just a matter of constructing a zip file with the right structure. I can explain it or you can look it up, but if you're following the book, best to just use 0.5. Sean On Mon, Jun 18, 20

Re: Mahout in Action Ch.06 Hadoop RecommenderJob Example

2012-06-19 Thread Sean Owen
... but this is a problem to do with bad input, it seems. And the book examples go with 0.5. The bug you are thinking of does not affect anything written about in the book. 0.5 is the right version to use as far as the book examples are concerned. What format is your input? should be "user,item,ra

Re: MinHash implementation thoughts

2012-06-19 Thread Sean Owen
Lance, what are you referring to here? Sam if you have some patch to suggest that would be good, open a JIRA issue. On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog wrote: > Some Mahout people are not sure that the MinHash implementation is > right. You should try your versions of how the algorith

Re: Mahout in Action Ch.06 Hadoop RecommenderJob Example

2012-06-19 Thread Sean Owen
Yes, it also says that you need to either translate this file into canonical "user,item" CSV format, or, swap in the Mapper provided in the book in place of the normal first Mapper to read this data. This file by itself doesn't work with the project since it is not in the right format. On Tue, Jun

Re: Mahout in Action Book (printed) already available from MEAP orders in Germany?

2012-06-19 Thread Sean Owen
(Since this is the project user list, probably not the right place to ask. Manning should address it directly with you.) On Tue, Jun 19, 2012 at 6:36 PM, Christoph Hermann wrote: > Hey there, > > i ordered my Mahout in Action through the MEAP program on 11.2.2010 and have > not yet received my so

Re: NNMF Recommendations

2012-06-20 Thread Sean Owen
I'd say it this way: the matrix is *approximately* factored as Vapprox = W * H'. Vapprox is like V but has slightly different values. But it has values everywhere, not just where the 1s were. And so you just look at the user's row in Vapprox and pick the top values (ignoring items that were already

Re: Scalability of ParallelALSFactorizationJob with implicit feedback

2012-06-20 Thread Sean Owen
uched by a reducer. On Mon, Jun 11, 2012 at 8:19 PM, Sean Owen wrote: > If you like ALS on Hadoop, I don't mind again plugging the Myrrix > Computation Layer (http://myrrix.com/documentation-computation-layer/), a > sort of spin off of this kind of work I've been doing in Mahout

Re: Kmeans clustering with Tanimoto distance measure in Mahout

2012-06-21 Thread Sean Owen
Granted I too may be missing something since I am not familiar with the code so much, but... The Tanimoto "distance" isn't a proper distance metric is it? not when defined over real-valued vectors like it is here. That seems like the root issue here. I'm pretty sure we need a distance metric in cl

Re: Kmeans clustering with Tanimoto distance measure in Mahout

2012-06-21 Thread Sean Owen
Erm, I think I am thinking of canopy clustering. For k-means, I suppose you could say the choice of the k means isn't quite right if k-1 of them are nowhere near most points. I don't know how they were chosen. But again maybe not the real issue at heart here. On Thu, Jun 21, 2012 at 12:36 PM, Shl

Re: Kmeans clustering with Tanimoto distance measure in Mahout

2012-06-21 Thread Sean Owen
Rather than reinvent the wheel here, I'd stick to more well-understood metrics. I did my homework and indeed the generalized Tanimoto distance is not a distance metric. It would be, if all values were 0 or 1. So, try rounding the vector coordinates to 0 or 1. You have anecdotal evidence that this

Re: "Direction" of co-occurence and log-likelihood ratio

2012-06-21 Thread Sean Owen
On Thu, Jun 21, 2012 at 9:01 PM, Nimrod Priell wrote: > On a completely different subject: I wrote a simple RelevantItemsDataSplitter > and RecommenderIRStatsEvaluator which take a list of item IDs, and run CF > evaluation by hiding items only out of that list, and asking to recommend > only ou

Re: "Direction" of co-occurence and log-likelihood ratio

2012-06-21 Thread Sean Owen
Is this not just a matter of comparing the frequency of "the" with "the the"? If "the" is 1/n of the words, then "the the" ought to be 1/n^2. If it's less, it's under-represented. On Thu, Jun 21, 2012 at 9:01 PM, Nimrod Priell wrote: > I am wondering if there's a way to detect whether the deviati

Re: Performance issue with Item-based Recommendation and User-based Recommendation

2012-06-21 Thread Sean Owen
I would suggest pruning similarities near 0, and then treating missing similarities as 0 later at runtime. It may take a bit of coding. But you should be able to throw away a lot without compromising much of the result. On Thu, Jun 21, 2012 at 10:16 PM, Way Cool wrote: > Hi, guys, > > For item-ba

Re: "Direction" of co-occurence and log-likelihood ratio

2012-06-21 Thread Sean Owen
The idea is sound but for a different and I think stronger reason. For this kind of test you need to hold out items that are some of the best recommendations, since that's what the recommender is trying to find. Holding out random items isn't OK since the recommender is not simply trying to parrot

Re: Performance issue with Item-based Recommendation and User-based Recommendation

2012-06-21 Thread Sean Owen
happen > without loading everything in memory? > > Thanks. > > > On Thu, Jun 21, 2012 at 3:29 PM, Sean Owen wrote: > >> I would suggest pruning similarities near 0, and then treating missing >> similarities as 0 later at runtime. It may take a bit of coding. But >

Re: Performance issue with Item-based Recommendation and User-based Recommendation

2012-06-22 Thread Sean Owen
SVD and ALS aren't "user-based" or "item-based". They don't operate by computing similarities, which is the good news. I think the ALS model is more appropriate. The SVD is more sophisticated (complex and hard to compute), arguably "overkill" for what recommenders need to do, and doesn't deal with

Re: Mahout org.apache.mahout.cf.taste.* / Maven pom.xml

2012-06-22 Thread Sean Owen
You can't "mvn install" from a subdirectory since it will not have built the other required artifacts. Build from the top level. If your follow on questions concern the book only, it's best to keep it to the book forum: http://www.manning-sandbox.com/forum.jspa?forumID=623 On Fri, Jun 22, 2012 a

Re: Converting UUID to Long

2012-06-22 Thread Sean Owen
Just hashing is almost surely fine. I'd XOR 64 bit chunks of the UUID to make a 64-bit value. The probability of collision at this size is vanishingly small, and collisions do little damage anyway. note that in the Hadoop jobs the longs are hashed down to ints anyway! On Fri, Jun 22, 2012 at 3:43

Re: Re-scorer in Distributed recommender

2012-06-22 Thread Sean Owen
No it would happen in AggregateAndRecommender when it computer the score to rank on. On Jun 22, 2012 8:24 PM, "Jaspreet Singh" wrote: > Hi > > I am trying to include a re-scorer like functionality in the distributed > model. Should I be targetting the VectorNormMapper in the RowSimilarityJob > ?

Re: Question about Item Based Collaborative Filtering

2012-06-22 Thread Sean Owen
Using 1 is just fine for the reasons you give. You would be surprised how OK it is to use this even for dislikes. In fact just omit the third field in your CSV. However you need to set the boolean data flag and choose a similarity metric that is defined over such data. Pearson / cosine is not for

Re: Mahout-Hive Integration..

2012-06-23 Thread Sean Owen
You just write the output to HDFS or whatever you want and then parse it in Hive. You may need to put another M/R job in place to convert into some form convenient for Hive. There is no need for direct Mahout-Hive integration and so it does not exist. On Sat, Jun 23, 2012 at 10:06 AM, VIGNESH PRAJ

Re: Question about Item Based Collaborative Filtering

2012-06-24 Thread Sean Owen
arding memory settings? > > Thanks once again for your help. > > > On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen wrote: > >> Using 1 is just fine for the reasons you give. You would be surprised how >> OK it is to use this even for dislikes. In fact just omit the thir

Re: Question about Item Based Collaborative Filtering

2012-06-25 Thread Sean Owen
The error doesn't seem to relate to memory anyway: java.lang.IllegalArgumentException: unresolved address On Mon, Jun 25, 2012 at 7:06 AM, Something Something wrote: > Please ignore the latest email.  When I increased the memory size to 8g, > all steps worked.  Now validating output.  Thanks a l

Re: simple OnlineLogisticRegression classication example using mahout

2012-06-27 Thread Sean Owen
Those are both true; they may not be the issue here. The test point definitely belongs in the first of the two groups you created. Why is the result surprising? On Wed, Jun 27, 2012 at 9:15 AM, Lance Norskog wrote: > Not enough samples. Machine learning algorithms in general do well if > you ha

Re: TFIDF values from seq2sparse

2012-06-27 Thread Sean Owen
Yes this is a known bug. Grant, I had an open question to you on this one -- what do you think about the fix? On Wed, Jun 27, 2012 at 3:11 PM, Yuval Feinstein wrote: > I believe I found the problem. > As this contradicts Mahout's documentation, it might be a bug in > Mahout 0.6 or the documentati

Re: simple OnlineLogisticRegression classication example using mahout

2012-06-28 Thread Sean Owen
Because equals() is implemented. Two Points that are equals() will not have the same hashCode(), which is wrong. It only matters, I suppose, if Point is used in some context where it matters, like a HashMap key. But it is used as a HashMap key here! It happens to succeed because get() is only ever

Re: Pseudodistributed recommender hangs on AWS EMR

2012-06-28 Thread Sean Owen
I don't think this is something to do with Mahout. Looks like an error from EMR. I have not seen anything like this. On Jun 28, 2012 1:40 PM, "Oliver B. Fischer" wrote: > Hi, > > I try to run some test with the pseudodistributed recommender job at AWS > using one of the late 0.7 snapshots. > > Ev

<    4   5   6   7   8   9   10   11   12   13   >