Re: Mahout performance issues

2011-12-04 Thread Sebastian Schelter
I created a jira to supply a non-distributed counterpart of the sampling that is done in the distributed item similarity computation: https://issues.apache.org/jira/browse/MAHOUT-914 2011/12/2 Sean Owen sro...@gmail.com: For your purposes, it's LogLikelihoodSimilarity. I made similar changes

Re: problem at : Installing and testing Taste

2011-12-04 Thread VIGNESH PRAJAPATI
ya i have not modified but after referring this link i have replaced my pom of taste-web. there is another errors like.. *At command mvn compile* [INFO] Scanning for projects... [WARNING] [WARNING] Some problems were encountered while building the effective model for

Re: problem at : Installing and testing Taste

2011-12-04 Thread Sean Owen
OK, only the errors are relevant, and some indicate that you're missing some dirs. For example make the lib directory referenced below. I think you should go back to version 0.5 as-is, rather than try any modifications. I think it was built with Maven 2.x rather than 3.x, so you may have to try

Re: Mahout performance issues

2011-12-04 Thread Daniel Zohar
Combining the latest commits with my optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) I achieved satisfying results. All the queries were under one second. Sebastian, I took a look at your patch and I think it's more practical than the current

Re: Mahout performance issues

2011-12-04 Thread Sean Owen
Are you referring to my patch, MAHOUT-910? It does let you specify a hard cap, really -- if you place a limit of X, then at most X^2 item-item associations come out. Before you could not bound the result, really, since one user could rate a lot of items. I think it's slightly more efficient and

Re: Mahout performance issues

2011-12-04 Thread Daniel Zohar
Actually I was referring to Sebastian's. I haven't seen you committed anything to SamplingCandidateItemsStrategy. Can you tell me in which class the change appears? On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen sro...@gmail.com wrote: Are you referring to my patch, MAHOUT-910? It does let you

Re: Mahout performance issues

2011-12-04 Thread Sean Owen
Have a look at the patch attached to MAHOUT-910. I have not committed it yet so as to allow review. https://issues.apache.org/jira/browse/MAHOUT-910 The current implementation samples users. MAHOUT-914 samples items from users. MAHOUT-910 samples both. What's most ideal? I had supposed we want

Re: Mahout performance issues

2011-12-04 Thread Sebastian Schelter
Hi Daniel, My view is this: I think you can pretty safely down-sample power users like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 I did some experiments on the movielens1M dataset that showed that you get a negligible error given you look at enough interactions per user:

Re: Mahout performance issues

2011-12-04 Thread Daniel Zohar
Sean, your impl. is indeed better than mine but for some reason when I ran it with for a user with a lot of interactions, I got 2023 possibleItemIDs (although I used 10,2 in the constructor). Sebastian, I will try and expriment also with your patch. I would just like to add that in my opinion, as

Re: Mahout performance issues

2011-12-04 Thread Daniel Zohar
I assume the parameter does not affect the possibleItemIDs because of the following line: max = (int) Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems(; On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar

Re: Mahout performance issues

2011-12-04 Thread Sean Owen
To talk about this clearly, let me go back to my example and add to it: --- Say we're recommending for user A. User A is connected to items 1, 2, 3. Those items are connected to other users X, Y, Z. And those users in turn are connected to items 100, 101, 102, 103 You can down-sample three

Re: Mahout performance issues

2011-12-04 Thread Ted Dunning
Sean, You can also do #1. That is what I have used in the past and what I recommend. That achieves a large part of #2, but what is most important is that it *directly* addresses the key cost factor in off-line recommendations since the number of item pairs emitted is proportional to the sum of

Re: 20newsgroups example does not print verbose output

2011-12-04 Thread Grant Ingersoll
What do you have for logging in your classpath? On Dec 1, 2011, at 1:24 PM, magicalo wrote: Hello, I have ran the 20newsgroups example on my own data set. It runs successfully and prints the summary output. However, I have enabled the verbose option in the script when I run the

Re: Time series analysis

2011-12-04 Thread Ted Dunning
2011/12/4 myn m...@163.com does mahout contain this method? Which method? Time series analysis is not a method.

Re: Time series analysis

2011-12-04 Thread Peyman Mohajerian
Any time you have data collected over time, you have time series data. For example data form trajectory of hand movement in biomechanics or movement of a give stock in a given day, x-axis is time. FFT, frequency analysis of the data is an example of time series analysis. In general regression are

Re: Time series analysis

2011-12-04 Thread Ted Dunning
Classification and clustering a also common tasks in time series analysis. Furthermore, not all time series have sample that are expressed as simple continuous values. Think about click streams or financial transactions. Neither can be expressed as a simple number. On Sun, Dec 4, 2011 at 7:29

When is PCA expected to be fully implemented into Mahout?

2011-12-04 Thread magicalo
Hello, Is there an expected release date for the PCA algorithm as part of Mahout? Tx!

Re: When is PCA expected to be fully implemented into Mahout?

2011-12-04 Thread Raphael Cendrillon
Hi Magicalo, You can find a patch for PCA under MAHOUT-512 which is available here https://issues.apache.org/jira/browse/MAHOUT-512. This implementation scales well with training samples and calculates the covariance matrix in a distributed way. The feature size is not so scalable as the