Mahout performance issues

2011-11-30 Thread Daniel Zohar
Hello all, This email follows the correspondence in StackExchange between myself and Sean Owen. Please see http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues I'm building a boolean-based recommendation engine with the following data: - 12M users - 2M items - 18M

Clustering - Sequence File from Directory

2011-11-30 Thread Faizan(Aroha)
Would anyone please give any hint? On Running the following command: bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles I'm getting the following error: MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. MAHOUT_LOCAL is set,

Re: Mahout performance issues

2011-11-30 Thread Sean Owen
I have a few more thoughts. First, I was wrong about what the first parameter to SamplingCandidateStrategy means. It's effectively a minimum, rather than maximum; setting to 1 just means it will sample at least 1 pref. I think you figured that out. I think values like (5,1) are probably about

Re: Mahout performance issues

2011-11-30 Thread Daniel Zohar
Hi Sean, First of all let me thank you for all your help thus far :) I am using Mahout 0.5. At the moment the application is not live yet, so I assume multi-threading is not a problem at the moment. I definitely see that the bottleneck is in the similarities computations. Looking at

Re: Mahout performance issues

2011-11-30 Thread Sean Owen
Yeah, I agree that using just a handful of candidates is far too few and that's not a solution. It should not be so slow even with a reasonable number of prefs and users. Multi-threading *is* a problem insofar as there is no multi-threading helping speed up your request. But that's a side issue.

Re: Mahout performance issues

2011-11-30 Thread Daniel Zohar
I will now try using the latest snapshot from http://svn.apache.org/repos/asf/mahout/trunk . I would really prefer to avoid pre-computing the item similarities at the moment. Do you believe I can achieve good performance without it? Is there any specific pruning method you would recommend? I

Re: Mahout performance issues

2011-11-30 Thread Daniel Zohar
I just tested the app with Mahout 0.6. There seems to be a small performance improvement, but still recommendations for the 'heavy users' take between 1-5 seconds. On Wed, Nov 30, 2011 at 4:50 PM, Daniel Zohar disso...@gmail.com wrote: I will now try using the latest snapshot from

RE: Successful Organization Meeting for Austin SIGKDD

2011-11-30 Thread Saikat Kanjilal
Hi everyone,I'd love to setup a hacker dojo similar to what David is doing in Austin in the Seattle area, are there other folks interested in doing this with a similar theme. Please let me know. This is great way to do deep dives on some of the algorithms in mahout.Regards From:

Re: Mahout performance issues

2011-11-30 Thread Sean Owen
Have you used CachingItemSimilarity? That will hold common similarities in memory. It's a lot easier than pre-computing and might help. I think something like your change is a good one (Sebastian what do you think) in that it gives you the ultimate lever to control how many candidates are

Re: Mahout performance issues

2011-11-30 Thread Dan Beaulieu
Hi all, this is a tangent and can mostly be ignored by the people interested in this problem. I'm new to Machine Learning and especially Mahout. Following this discussion has made me a bit confused. Isn't Mahout used for large datasets where it makes sense to distribute the work? Why then isn't

Re: Mahout performance issues

2011-11-30 Thread Sean Owen
The simple answer is that: Mahout absorbed a non-distributed recommender project called Taste, which scales up to a point which may be sufficient for a lot of users. It certainly is a lot simpler. Yes it is realistic to do near-real-time recommendations, though it gets harder and harder and

Re: Relevance score - Classification

2011-11-30 Thread Isabel Drost
On 29.11.2011 Faizan(Aroha) wrote: In our case, I think we won't be looking much into features I am moving towards clustering as Tantons's mentioned. Hmm - what kind of similarity measure are you planning to use for that? What makes to items be similar in your use case? Isabel

Re: LDATopic

2011-11-30 Thread Isabel Drost
On 28.11.2011 bish maten wrote: mahout ldatopics -i mahout-work/abc/abc-lda/state-20 -d mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0 -dt sequencefile (there were no errors reported and command worked fine with following output). Does the output appear ok? Hmm - this only

Re: Mahout distribution download

2011-11-30 Thread Isabel Drost
On 28.11.2011 Sean Owen wrote: There is no newer distribution, but, you can always check out the very latest from Subversion: https://cwiki.apache.org/confluence/display/MAHOUT/Version+Control Also we do publish nightly builds at the Apache Maven-Snapshot repository. If you would like to help

Re: Data class taxonomy for machine learning

2011-11-30 Thread Isabel Drost
On 29.11.2011 Ted Dunning wrote: I find this taxonomy excessive and over-done. The distinctions I find useful include - continuous variables - discrete variables with a known set of values (I call these categorical, usually). This includes ordinal variables since ordering rarely makes a

Re: LDATopic

2011-11-30 Thread Jake Mannix
On Wed, Nov 30, 2011 at 1:03 PM, Isabel Drost isa...@apache.org wrote: On 28.11.2011 bish maten wrote: mahout ldatopics -i mahout-work/abc/abc-lda/state-20 -d mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0 -dt sequencefile (there were no errors reported and command worked

Re: Clustering - Sequence File from Directory

2011-11-30 Thread Isabel Drost
On 30.11.2011 Faizan(Aroha) wrote: Would anyone please give any hint? On Running the following command: bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles I'm getting the following error: MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to

Re: Clustering graph coloring and layout

2011-11-30 Thread Dmitriy Lyubimov
Nice! it is very obvious i cannot avoid learning R (sigh). On Wed, Nov 30, 2011 at 2:58 PM, Ted Dunning ted.dunn...@gmail.com wrote: Here is some that I just whipped up. I have also attached an example of the output. In the sample output, notice how you can see different stories about what

Re: Clustering graph coloring and layout

2011-11-30 Thread Grant Ingersoll
Can you share the R code too? On Nov 30, 2011, at 2:58 PM, Ted Dunning wrote: Here is some that I just whipped up. I have also attached an example of the output. In the sample output, notice how you can see different stories about what clusters the brown-ish and purple clusters are

Re: Clustering graph coloring and layout

2011-11-30 Thread Ted Dunning
Sure. I attached it, but those get stripped. I didn't realize that this was going to the list. Try here: http://dl.dropbox.com/u/36863361/cluster-viz.r And here for the image: http://dl.dropbox.com/u/36863361/xyz.png On Wed, Nov 30, 2011 at 4:04 PM, Grant Ingersoll gsing...@apache.orgwrote:

Re: Data class taxonomy for machine learning

2011-11-30 Thread Lance Norskog
Problemanalyze.pdf is not there. On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost isa...@apache.org wrote: On 29.11.2011 Ted Dunning wrote: I find this taxonomy excessive and over-done. The distinctions I find useful include - continuous variables - discrete variables with a known set

Re: LDATopic

2011-11-30 Thread Charles Earl
Jake, Thanks for the pending update. Slightly off topic, if I understand your notes on MAHOUT-897, Gibbs sampling would only be feasible in MR implementation that support efficient iteration -- Spark, perhaps YARN -- but not for Mahout as currently conceived. In the case of Spark, the RDD is

RE: Clustering - Sequence File from Directory

2011-11-30 Thread Faizan(Aroha)
Yes I did build all of mahout. But fortunately the issue has been resolved. I just unset the environment variable MAHOUT_LOCAL and it worked. thanks. -Original Message- From: Isabel Drost [mailto:isa...@apache.org] Sent: Thursday, December 01, 2011 2:20 AM To: user@mahout.apache.org

Re: Data class taxonomy for machine learning

2011-11-30 Thread Ted Dunning
It is not spelled that way in german. Use an s near the end of the word. Other than that, I can't imagine the problem. The link worked for me earlier today and just now as well. On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog goks...@gmail.com wrote: Problemanalyze.pdf is not there. On Wed,

Re: Data class taxonomy for machine learning

2011-11-30 Thread Lance Norskog
Oops, the other one: Datenaufbereitung.pdfhttp://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdfdoes not work. On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is not spelled that way in german. Use an s near the end of the word. Other

Re: Data class taxonomy for machine learning

2011-11-30 Thread Ted Dunning
Join the lines together. On Wed, Nov 30, 2011 at 8:45 PM, Lance Norskog goks...@gmail.com wrote: Oops, the other one: Datenaufbereitung.pdf http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf does not work. On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning

OutOfMemoryError - Clustering

2011-11-30 Thread Faizan(Aroha)
I've successfully run the vectorization process on reuters dataset. Now I'm trying to vectorize the wikidataset(10.6GB). And I'm getting OutOfMemoryError. Any help? Thanks. aroha@aroha-laptop:~/workspace/mahout$ bin/mahout seqdirectory -c UTF-8 -i