I think, if your input vector is a set, the ele.get() should be used,
instead, if your input vector is a sparse vector, the ele.index() would be
used.
Pls correct me if I'm wrong.
for (int i = 0; i numHashFunctions; i++) {
for (Vector.Element ele : featureVector) {
/// Shouldn't the
I doubt that query compilation is the major cost here. The problem is that too
many records are being moved too often.
Sent from my iPad
On Aug 15, 2011, at 10:23 PM, Lance Norskog goks...@gmail.com wrote:
The standard advice also applies: use stored procedures if you can. If
not, use
Yes, I also doubt that the cost of parsing a simple select a,b,c from
x query matters compared to sending 80K records across the network.
On Tue, Aug 16, 2011 at 6:23 AM, Lance Norskog goks...@gmail.com wrote:
The standard advice also applies: use stored procedures if you can. If
not, use
I'm not the authoritative voice here, but I would also agree with your
interpretation -- it's indices rather than values that I'd use.
I can imagine using min-hash on values, but that would not seem to be
the most natural thing to do.
(I don't understand the comment about set and get(). Vectors
Hello Ted,
thanks for your help!
To give you more details:
Clustering in this case has something of pattern recognition:
for the first graph, I am looking for following pattern:
* *
* *
* *
for the second graph, I basically want following pattern:
* *
Hi Sean,
Why is only userCache cleared on refresh?
On 15 August 2011 19:32, Sean Owen sro...@gmail.com wrote:
For the interested, I wrote a follow-up to this article, focusing on
using *Hadoop* with Cassandra and Mahout:
http://acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/
We're talking about the first article, and CassandraDataModel?
That is just a mistake, I'll fix it.
On Tue, Aug 16, 2011 at 1:21 PM, Marko Ciric ciric.ma...@gmail.com wrote:
Hi Sean,
Why is only userCache cleared on refresh?
Hi,
Thanks for your post. However, the proposed solution will not work because
the getFeatureID is needed to populated the weight matrix. So the proposed
modifications to the code will result in not loading the model correctly and
a wrong execution.
The problem with the large memory requirement
+1 Me too. If there aren't already unit tests which guarantee this then we need
to add them. This is a pretty important capability not to guarantee in the API.
-Original Message-
From: Blake Lemoine [mailto:bal2...@gmail.com]
Sent: Saturday, August 13, 2011 4:46 PM
To:
I just looked at the initial JIRA to create this implementation and saw the
example code that uses it --
https://issues.apache.org/jira/browse/MAHOUT-344
The LastfmDataConverter class is indeed creating a vector with the indices
stored in the values and spurious information stored in the indices:
It's the right place -- best-effort question-answering is not always that good.
A JIRA is a good thing if you have a specific idea of the issue /
enhancement and ideally a proposed patch. That is tracked with some
loose regularity and so might get more attention.
On Tue, Aug 16, 2011 at 5:13 PM,
Thanks again. I am using 0.5 right now. We will try to patch it up and see how
it performs. In the mean time, I am having another (possibly user?) error: I
have a 260 X 230 matrix. I set k+p = 40, it fails with
Exception in thread main java.io.IOException: Q job unsuccessful.
at
I've been getting this exception a lot as well. I've been going through
some of the examples in Mahout In Action book, and I get errors a lot when I
follow the instructions word for word -- either due to typos in the book (it
seems like there were a few sections where a script was updated due to
This is unusually small input. What's the block size? Use large blocks (such
as 30,000). Block size can't be less than k+p.
Can you please cut and paste actual log of qjob tasks that failed? This is
front end error, but the actual problem is actually in the backend ranging
anywhere from hadoop
(Since it's specifically about the book, might be better to post in the
Manning forums.)
The final version, which is a fair bit more up-to-date than the MEAP
version, is synced with 0.5. It was re-read by a technical proofreader to
make sure it all works, so I imagine most of this has been
I guess technically it's a subject for another patch, front end can just set
upper limit for -r (block height) to be no less than k+p automatically in
the front end. Right now, if that's not the case,only backend catches it and
backend should have a meaningful message about it, but not the
When I first started reading the Manning book, I was a little surprised by
the description of data structures for preferences in the collaborative
filtering section. Before getting the book I had really only played around
with the Vector implementations and I was used to the Vectors being generic
It's more an artifact of history than design. When this project kicked off
it was pretty open-ended -- large scale machine learning. At some early
stage we merged in my (previous, independent) project called Taste, which
was all collaborative filtering and not Hadoop-based. So that's where this
OK.
This is more of a kind of time series analysis even if the horizontal axis
isn't time.
You need to extract features from these graphs before doing clustering.
Something like extreme values of smoothed second derivative might be
useful. Spectral or cepstral features might be useful as well,
Is there a way to selectively reload data from the database for a
user? That way, we wouldn't have to pull down 80k records on every
reload?
On Mon, Aug 15, 2011 at 1:59 PM, Sean Owen sro...@gmail.com wrote:
That's more reasonable. It sounds a bit long still but could believe
it is due to the
On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote:
This is unusually small input. What's the block size? Use large blocks (such
as 30,000). Block size can't be less than k+p.
I did set blockSize to 30,000 (as recommended in the PDF that you wrote up). As
far as input size, the reason to
In principle, it would be really nice if we could parametrize our desire
for larger entity sets / vocabularies (have keys of type 'long' vs. 'int')
and
our precision on values ('float' vs. 'double' vs even 'boolean').
But while we've talked about this, adding a proliferation of FloatVector,
Hm. This is not common at all.
This error would surface if map split can't accumulate at least k+p rows.
That's another requirement which usually is non-issue -- any precomputed
split must contain at least k+p rows, which normally would not be the case
only if matrix is extra wide and dense, in
PS another idea that i have is that it is possible to use multiple files for
the input of course, such as output from another job. But again, if there
are any that contain less than k+p rows, they of course would generate
individual splits and must be pre-aggregated (it is similar to pig
Number of mappers is 7. DFS block size is 128 MB, the reason I think there are
7 mappers being used is that I am using a Pig script to generate the sequence
file of Vectors and that script generates 7 reducers. I am not setting
minSplitSize though.
On Aug 16, 2011, at 12:15 PM, Dmitriy
yep that's what i figured. you have 193 rows or so but distributed between 7
files so they are small and would generate several mappers and there are
probably some there with a small row count.
See my other email. This method is for big data, big files. If you want to
automate handling of small
also, with data as small as this, stochastic noise ratio would be
significant (as in 'big numbers' law) so if you really think you might need
to handle inputs that small, you better write a pipeline that detects this
as a corner case and just runs in-memory decomposition. In fact, i think
dense
There isn't -- you could probably add that to your copy fairly easily. Just
clear the in memory representation and reload what you want from the DB.
On Tue, Aug 16, 2011 at 7:34 PM, Salil Apte sa...@offlinelabs.com wrote:
Is there a way to selectively reload data from the database for a
user?
PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW what i
am using int local tests to assert SSVD results. Although it starts to feel
slow pretty quickly and sometimes produces errors (i think i starts feeling
slow at 10k x 1k inputs)
On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy
I have decided to do something similar: Do the pipeline in memory and not
invoke map-reduce for small datasets which I think will handle the issue.
Thanks again for clearing that up.
Esh
Aug 16, 2011, at 1:45 PM, Dmitriy Lyubimov wrote:
PPS Mahout also has in-memory SVD Colt-migrated solver
I have several in-memory implementations almost ready to publish.
These provide straightforward implementation of the original SSVD algorithm
from the Martinsson and Halko paper, a version that avoids QR and LQ
decompositions and an out-of-core version that only keeps a moderate sized
amount of
Actually SGD is just this for classification. It is (pretty) scalable and
definitely not normally parallel.
On Tue, Aug 16, 2011 at 11:16 AM, Sean Owen sro...@gmail.com wrote:
There are no non-distributed counterparts for clustering and
classification.
It's not symmetric, and it would be
There are major costs incurred if we move to long indexes for matrices.
That might be a good thing to do and it would be pretty easy to provide
legacy access points, but it would hurt me to spend 30% on memory to do
this.
The need on the recommendation side was to have id's that would not
On Tue, Aug 16, 2011 at 3:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:
There are major costs incurred if we move to long indexes for matrices.
That might be a good thing to do and it would be pretty easy to provide
legacy access points, but it would hurt me to spend 30% on memory to do
On Tue, Aug 16, 2011 at 3:28 PM, Jake Mannix jake.man...@gmail.com wrote:
The need on the recommendation side was to have id's that would not
collide
without having to check. That is a bit different from the matrix world
where you have a conceptually dense set of integer indexes.
Why
Actually, most clustering algorithms have sequential implementations (-xm,
--method sequential) that read from and write to the same files but run a
single, non-mapreduce thread in memory using their respective reference
implementations.
-Original Message-
From: Sean Owen
Hi all,
I'm following the instructions on the Mahout wiki for launching a
non-distributed recommender service:
$ cd integration
$ cp ../examples/target/grouplens.jar ./lib
Unfortunately, I don't have an integration directory in my local file
system. I tried out my recommender by adding a simple
Are there any recommender algorithms designed for micro-sharding the
data model? The use case would be a mobile app that stores only a data
model for the phone owner.
It seems like a user-user recommender does not need data for all
users; nearby users plus some background noise should be enough
Yes. That is quite reasonably possible. It isn't really micro-sharding
since it will be different for every user rather than being a universal
sharding of all users.
On Tue, Aug 16, 2011 at 8:35 PM, Lance Norskog goks...@gmail.com wrote:
Are there any recommender algorithms designed for
39 matches
Mail list logo