This isn't anything to do with chmod, as far as I know: Hadoop uses Java to
set readable permission, and this is not implemented in Windows.
chmod is already on the Cygwin path anyway.
It seems pretty normal that Hadoop might want to make its output directory
writable!
On Thu, Nov 17, 2011 at
One more question. OK, so I use Lanczos to find V_k by finding the top k
eigenvectors of AT * A. A is sparse. But isn't AT * A dense, then? Is that
just how it is?
This will be my last basic question for the week:
I understand that A ~= U_k * S_k * V_kT. Let's call the product on the
right A_k.
On Thu, Nov 17, 2011 at 5:26 AM, Sean Owen sro...@gmail.com wrote:
One more question. OK, so I use Lanczos to find V_k by finding the top k
eigenvectors of AT * A. A is sparse. But isn't AT * A dense, then? Is that
just how it is?
A'A and AA' are both dense, yes, but you never compute them.
Ah-ha. That's clicked now. Especially as I read the comments and see it
already says exactly this.
And I understand that you just compute extra eigenvectors then throw out
near-duplicates, or those that are too un-eigenvector -- are there good
pointers on the alternatives for that, or are
Is there some way to weight particular preferences within Mahout? For
example, suppose you were creating some kind of literature recommender that
uses a 5-star preference scale. If you wanted to give double the weighting
to preferences for novels versus preferences for short stories, what would
Not directly, but you could modify an item-based recommender to do so.
Where it uses an item-item similarity as a weight in a weighted average,
you could modify the weight however you like depending on the types of the
two items.
On Thu, Nov 17, 2011 at 5:16 PM, Jamey Wood jamey.w...@gmail.com
On Thu, Nov 17, 2011 at 7:21 AM, Jake Mannix jake.man...@gmail.com wrote:
On Thu, Nov 17, 2011 at 5:26 AM, Sean Owen sro...@gmail.com wrote:
One more question. OK, so I use Lanczos to find V_k by finding the top k
eigenvectors of AT * A. A is sparse. But isn't AT * A dense, then? Is
that
Hi Grant,
I am running the NewsKMeansClustering Class from NetBeans (Run - Run
File). I did not change anything in the class code except the name of the
input directory, so the class can see the dataset that I want to cluster.
So, I changed the statement:
String inputDir = inputDir;
to:
String
Yeah... a good alternative is to use the random projection stuff.
On Thu, Nov 17, 2011 at 9:12 AM, Sean Owen sro...@gmail.com wrote:
Ah-ha. That's clicked now. Especially as I read the comments and see it
already says exactly this.
And I understand that you just compute extra eigenvectors
Hi Jeff,
Can you please elaborate what is meant by the -c path? I am running the
Class NewsKMeansClustering normally from NetBeans (not from a command-line
shell neither from mahout launcher script). So, I am not including any
options with the run.
Thanks,
Ahmad
On Wed, Nov 16, 2011 at 5:22 PM,
Thanks, Sean. We'll look into that.
For user-based recommenders (or even just calculating UserSimilarity),
would it have the desired effect if we added multiple virtual preference
data points for the real items that we wished to more heavily weight?
For example, if our real preference data
I am interested in starting a hacker dojo in Austin for big data machine
learning. We would meet one evening a week to work on coding up Hadoop based
machine learning and statistical analysis problems for big data systems. This
would be a hacker dojo where the focus is on coding. I can teach
How did you set the heap sizes? If you are running on a cluster you need to add
properties to your mapred-site.xml. Something like this:
property
namemapred.map.child.java.opts/name
value-Xmx1500m/value
descriptionJava opts for the map tasks.
MapR:
Default heapsize(-Xmx) is
Well I think you could fit it inside some of the user-user similarities,
yes. For a Pearson correlation, you could count important items twice or
something, yes. I wouldn't do that by literally adding more items to the
model as it creates other problems. It's possible; it may or may not have
the
Agree.
On Thu, Nov 17, 2011 at 11:30 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
However,
it would seem to me that QR as a completely isolated job would have
little value in machine learning applications.
On Thu, Nov 17, 2011 at 11:30 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:
I will finish adding an option with Cholesky decomposition route to
SSVD some time early in Q1 2012.
PPS i already put some jobs in (they are in the trunk) for Cholesky
route. I thought it would be an easy mod but then
I think Dmitriys description of the SGD and ALS-WR approach hits the
nail on the head.
However there is a third way to factorize the rating matrix which we
haven't talked about yet. It's described in Yehuda Koren's
Collaborative Filtering for Implicit Feedback Datasets
Yes. This is even one more step away from straightforward SVD, i.e.
explicitly analyizing implicit feedback (pun intended).
On Thu, Nov 17, 2011 at 12:38 PM, Sebastian Schelter s...@apache.org wrote:
I think Dmitriys description of the SGD and ALS-WR approach hits the
nail on the head.
I've never implemented LSI. Is there a way to incrementally build the model
(by simply indexing documents) or is it something that one only runs after the
fact once one has built up the much bigger matrix? If it's the former, I bet
it wouldn't be that hard to just implement the appropriate
It is possible to index/vectorize new documents in an existing projection.
Building the projection is pretty much a from-scratch operation.
Rebuilding the projection can be done pretty infrequently.
On Thu, Nov 17, 2011 at 1:47 PM, Grant Ingersoll gsing...@apache.orgwrote:
I've never
The only way to build model incrementally is to do a 'fold in' of new
observations, that i know.
However, folding in (which is just a multiplication of a new vector
over the matrices as Ted explained somewhere else) is just a
projection into already trained space of factors, but not a repetition
PS the danger of using an overly specific corpus is that training may
not be able to learn polisemy very well unless it sees other documents
with examples of use of the industry jargon words that may also mean
something else. But you definitely want to include documents that do
have words
'chmod' is the program that sets readable permission. It does whatever
Windows magic is required to match the Posix command line semantics.
The cygwin path is not the true windows path. So, when Java runs it gets
the true path which has no Cygwin.
You have to add c:\cygwin\bin to the windows path
Hi guys,
I just noticed the out of memory problem in the ClusterDumper class. It
seems that it loads all the data (for example, the clusteredPoints) into the
Map container which cost huge memory if we have GBs data. I think we could also
use Mapreduce to print the results instead of
24 matches
Mail list logo