Hello all,
This email follows the correspondence in StackExchange between myself and
Sean Owen. Please see
http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues
I'm building a boolean-based recommendation engine with the following data:
- 12M users
- 2M items
- 18M
Would anyone please give any hint?
On Running the following command:
bin/mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
I'm getting the following error:
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set,
I have a few more thoughts.
First, I was wrong about what the first parameter to
SamplingCandidateStrategy means. It's effectively a minimum, rather than
maximum; setting to 1 just means it will sample at least 1 pref. I think
you figured that out. I think values like (5,1) are probably about
Hi Sean,
First of all let me thank you for all your help thus far :)
I am using Mahout 0.5.
At the moment the application is not live yet, so I assume multi-threading
is not a problem at the moment.
I definitely see that the bottleneck is in the similarities computations.
Looking at
Yeah, I agree that using just a handful of candidates is far too few and
that's not a solution. It should not be so slow even with a reasonable
number of prefs and users.
Multi-threading *is* a problem insofar as there is no multi-threading
helping speed up your request. But that's a side issue.
I will now try using the latest snapshot from
http://svn.apache.org/repos/asf/mahout/trunk .
I would really prefer to avoid pre-computing the item similarities at the
moment. Do you believe I can achieve good performance without it?
Is there any specific pruning method you would recommend? I
I just tested the app with Mahout 0.6.
There seems to be a small performance improvement, but still
recommendations for the 'heavy users' take between 1-5 seconds.
On Wed, Nov 30, 2011 at 4:50 PM, Daniel Zohar disso...@gmail.com wrote:
I will now try using the latest snapshot from
Hi everyone,I'd love to setup a hacker dojo similar to what David is doing in
Austin in the Seattle area, are there other folks interested in doing this with
a similar theme. Please let me know. This is great way to do deep dives on
some of the algorithms in mahout.Regards
From:
Have you used CachingItemSimilarity? That will hold common similarities in
memory. It's a lot easier than pre-computing and might help.
I think something like your change is a good one (Sebastian what do you
think) in that it gives you the ultimate lever to control how many
candidates are
Hi all, this is a tangent and can mostly be ignored by the people
interested in this problem.
I'm new to Machine Learning and especially Mahout. Following this
discussion has made me a bit confused.
Isn't Mahout used for large datasets where it makes sense to distribute the
work? Why then isn't
The simple answer is that:
Mahout absorbed a non-distributed recommender project called Taste, which
scales up to a point which may be sufficient for a lot of users. It
certainly is a lot simpler. Yes it is realistic to do near-real-time
recommendations, though it gets harder and harder and
On 29.11.2011 Faizan(Aroha) wrote:
In our case, I think we won't be looking much into features
I am moving towards clustering as Tantons's mentioned.
Hmm - what kind of similarity measure are you planning to use for that? What
makes to items be similar in your use case?
Isabel
On 28.11.2011 bish maten wrote:
mahout ldatopics -i mahout-work/abc/abc-lda/state-20 -d
mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0 -dt
sequencefile (there were no errors reported and command worked fine with
following output). Does the output appear ok?
Hmm - this only
On 28.11.2011 Sean Owen wrote:
There is no newer distribution, but, you can always check out the very
latest from Subversion:
https://cwiki.apache.org/confluence/display/MAHOUT/Version+Control
Also we do publish nightly builds at the Apache Maven-Snapshot repository.
If you would like to help
On 29.11.2011 Ted Dunning wrote:
I find this taxonomy excessive and over-done. The distinctions I find
useful include
- continuous variables
- discrete variables with a known set of values (I call these categorical,
usually). This includes ordinal variables since ordering rarely makes a
On Wed, Nov 30, 2011 at 1:03 PM, Isabel Drost isa...@apache.org wrote:
On 28.11.2011 bish maten wrote:
mahout ldatopics -i mahout-work/abc/abc-lda/state-20 -d
mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0 -dt
sequencefile (there were no errors reported and command worked
On 30.11.2011 Faizan(Aroha) wrote:
Would anyone please give any hint?
On Running the following command:
bin/mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
I'm getting the following error:
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to
Nice!
it is very obvious i cannot avoid learning R (sigh).
On Wed, Nov 30, 2011 at 2:58 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Here is some that I just whipped up. I have also attached an example of
the output.
In the sample output, notice how you can see different stories about what
Can you share the R code too?
On Nov 30, 2011, at 2:58 PM, Ted Dunning wrote:
Here is some that I just whipped up. I have also attached an example of the
output.
In the sample output, notice how you can see different stories about what
clusters the brown-ish and purple clusters are
Sure. I attached it, but those get stripped. I didn't realize that this
was going to the list.
Try here: http://dl.dropbox.com/u/36863361/cluster-viz.r
And here for the image: http://dl.dropbox.com/u/36863361/xyz.png
On Wed, Nov 30, 2011 at 4:04 PM, Grant Ingersoll gsing...@apache.orgwrote:
Problemanalyze.pdf is not there.
On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost isa...@apache.org wrote:
On 29.11.2011 Ted Dunning wrote:
I find this taxonomy excessive and over-done. The distinctions I find
useful include
- continuous variables
- discrete variables with a known set
Jake,
Thanks for the pending update.
Slightly off topic, if I understand your notes on MAHOUT-897, Gibbs sampling
would only be feasible in MR implementation that support efficient iteration --
Spark, perhaps YARN -- but not for Mahout as currently conceived. In the case
of Spark, the RDD is
Yes I did build all of mahout.
But fortunately the issue has been resolved.
I just unset the environment variable MAHOUT_LOCAL and it worked.
thanks.
-Original Message-
From: Isabel Drost [mailto:isa...@apache.org]
Sent: Thursday, December 01, 2011 2:20 AM
To: user@mahout.apache.org
It is not spelled that way in german. Use an s near the end of the word.
Other than that, I can't imagine the problem. The link worked for me
earlier today and just now as well.
On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog goks...@gmail.com wrote:
Problemanalyze.pdf is not there.
On Wed,
Oops, the other one:
Datenaufbereitung.pdfhttp://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdfdoes
not work.
On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning ted.dunn...@gmail.com wrote:
It is not spelled that way in german. Use an s near the end of the word.
Other
Join the lines together.
On Wed, Nov 30, 2011 at 8:45 PM, Lance Norskog goks...@gmail.com wrote:
Oops, the other one:
Datenaufbereitung.pdf
http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf
does
not work.
On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning
I've successfully run the vectorization process on reuters dataset.
Now I'm trying to vectorize the wikidataset(10.6GB).
And I'm getting OutOfMemoryError.
Any help?
Thanks.
aroha@aroha-laptop:~/workspace/mahout$ bin/mahout seqdirectory -c UTF-8 -i
27 matches
Mail list logo