Re: MinHash implementation

2011-08-16 Thread 刘鎏
I think, if your input vector is a set, the ele.get() should be used, instead, if your input vector is a sparse vector, the ele.index() would be used. Pls correct me if I'm wrong. for (int i = 0; i numHashFunctions; i++) { for (Vector.Element ele : featureVector) { /// Shouldn't the

Re: Slow ReloadFromJDBCDataModel

2011-08-16 Thread Ted Dunning
I doubt that query compilation is the major cost here. The problem is that too many records are being moved too often. Sent from my iPad On Aug 15, 2011, at 10:23 PM, Lance Norskog goks...@gmail.com wrote: The standard advice also applies: use stored procedures if you can. If not, use

Re: Slow ReloadFromJDBCDataModel

2011-08-16 Thread Sean Owen
Yes, I also doubt that the cost of parsing a simple select a,b,c from x query matters compared to sending 80K records across the network. On Tue, Aug 16, 2011 at 6:23 AM, Lance Norskog goks...@gmail.com wrote: The standard advice also applies: use stored procedures if you can. If not, use

Re: MinHash implementation

2011-08-16 Thread Sean Owen
I'm not the authoritative voice here, but I would also agree with your interpretation -- it's indices rather than values that I'd use. I can imagine using min-hash on values, but that would not seem to be the most natural thing to do. (I don't understand the comment about set and get(). Vectors

Re: Clustering Data

2011-08-16 Thread Alexander Kerner
Hello Ted, thanks for your help! To give you more details: Clustering in this case has something of pattern recognition: for the first graph, I am looking for following pattern: * * * * * * for the second graph, I basically want following pattern: * *

Re: Article on Mahout recommenders and Cassandra

2011-08-16 Thread Marko Ciric
Hi Sean, Why is only userCache cleared on refresh? On 15 August 2011 19:32, Sean Owen sro...@gmail.com wrote: For the interested, I wrote a follow-up to this article, focusing on using *Hadoop* with Cassandra and Mahout: http://acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/

Re: Article on Mahout recommenders and Cassandra

2011-08-16 Thread Sean Owen
We're talking about the first article, and CassandraDataModel? That is just a mistake, I'll fix it. On Tue, Aug 16, 2011 at 1:21 PM, Marko Ciric ciric.ma...@gmail.com wrote: Hi Sean, Why is only userCache cleared on refresh?

Re: Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

2011-08-16 Thread Mut
Hi, Thanks for your post. However, the proposed solution will not work because the getFeatureID is needed to populated the weight matrix. So the proposed modifications to the code will result in not loading the model correctly and a wrong execution. The problem with the large memory requirement

RE: Mahout KMeans Output

2011-08-16 Thread Jeff Eastman
+1 Me too. If there aren't already unit tests which guarantee this then we need to add them. This is a pretty important capability not to guarantee in the API. -Original Message- From: Blake Lemoine [mailto:bal2...@gmail.com] Sent: Saturday, August 13, 2011 4:46 PM To:

Re: MinHash implementation

2011-08-16 Thread Jeff Hansen
I just looked at the initial JIRA to create this implementation and saw the example code that uses it -- https://issues.apache.org/jira/browse/MAHOUT-344 The LastfmDataConverter class is indeed creating a vector with the indices stored in the values and spurious information stored in the indices:

Re: MinHash implementation

2011-08-16 Thread Sean Owen
It's the right place -- best-effort question-answering is not always that good. A JIRA is a good thing if you have a specific idea of the issue / enhancement and ideally a proposed patch. That is tracked with some loose regularity and so might get more attention. On Tue, Aug 16, 2011 at 5:13 PM,

Re: Errors in SSVD

2011-08-16 Thread Eshwaran Vijaya Kumar
Thanks again. I am using 0.5 right now. We will try to patch it up and see how it performs. In the mean time, I am having another (possibly user?) error: I have a 260 X 230 matrix. I set k+p = 40, it fails with Exception in thread main java.io.IOException: Q job unsuccessful. at

Re: df-count/data does not exist

2011-08-16 Thread Jeff Hansen
I've been getting this exception a lot as well. I've been going through some of the examples in Mahout In Action book, and I get errors a lot when I follow the instructions word for word -- either due to typos in the book (it seems like there were a few sections where a script was updated due to

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
This is unusually small input. What's the block size? Use large blocks (such as 30,000). Block size can't be less than k+p. Can you please cut and paste actual log of qjob tasks that failed? This is front end error, but the actual problem is actually in the backend ranging anywhere from hadoop

Re: df-count/data does not exist

2011-08-16 Thread Sean Owen
(Since it's specifically about the book, might be better to post in the Manning forums.) The final version, which is a fair bit more up-to-date than the MEAP version, is synced with 0.5. It was re-read by a technical proofreader to make sure it all works, so I imagine most of this has been

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
I guess technically it's a subject for another patch, front end can just set upper limit for -r (block height) to be no less than k+p automatically in the front end. Right now, if that's not the case,only backend catches it and backend should have a meaningful message about it, but not the

Vectors vs Preferences

2011-08-16 Thread Jeff Hansen
When I first started reading the Manning book, I was a little surprised by the description of data structures for preferences in the collaborative filtering section. Before getting the book I had really only played around with the Vector implementations and I was used to the Vectors being generic

Re: Vectors vs Preferences

2011-08-16 Thread Sean Owen
It's more an artifact of history than design. When this project kicked off it was pretty open-ended -- large scale machine learning. At some early stage we merged in my (previous, independent) project called Taste, which was all collaborative filtering and not Hadoop-based. So that's where this

Re: Clustering Data

2011-08-16 Thread Ted Dunning
OK. This is more of a kind of time series analysis even if the horizontal axis isn't time. You need to extract features from these graphs before doing clustering. Something like extreme values of smoothed second derivative might be useful. Spectral or cepstral features might be useful as well,

Re: Slow ReloadFromJDBCDataModel

2011-08-16 Thread Salil Apte
Is there a way to selectively reload data from the database for a user? That way, we wouldn't have to pull down 80k records on every reload? On Mon, Aug 15, 2011 at 1:59 PM, Sean Owen sro...@gmail.com wrote: That's more reasonable. It sounds a bit long still but could believe it is due to the

Re: Errors in SSVD

2011-08-16 Thread Eshwaran Vijaya Kumar
On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote: This is unusually small input. What's the block size? Use large blocks (such as 30,000). Block size can't be less than k+p. I did set blockSize to 30,000 (as recommended in the PDF that you wrote up). As far as input size, the reason to

Re: Vectors vs Preferences

2011-08-16 Thread Jake Mannix
In principle, it would be really nice if we could parametrize our desire for larger entity sets / vocabularies (have keys of type 'long' vs. 'int') and our precision on values ('float' vs. 'double' vs even 'boolean'). But while we've talked about this, adding a proliferation of FloatVector,

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
Hm. This is not common at all. This error would surface if map split can't accumulate at least k+p rows. That's another requirement which usually is non-issue -- any precomputed split must contain at least k+p rows, which normally would not be the case only if matrix is extra wide and dense, in

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
PS another idea that i have is that it is possible to use multiple files for the input of course, such as output from another job. But again, if there are any that contain less than k+p rows, they of course would generate individual splits and must be pre-aggregated (it is similar to pig

Re: Errors in SSVD

2011-08-16 Thread Eshwaran Vijaya Kumar
Number of mappers is 7. DFS block size is 128 MB, the reason I think there are 7 mappers being used is that I am using a Pig script to generate the sequence file of Vectors and that script generates 7 reducers. I am not setting minSplitSize though. On Aug 16, 2011, at 12:15 PM, Dmitriy

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
yep that's what i figured. you have 193 rows or so but distributed between 7 files so they are small and would generate several mappers and there are probably some there with a small row count. See my other email. This method is for big data, big files. If you want to automate handling of small

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
also, with data as small as this, stochastic noise ratio would be significant (as in 'big numbers' law) so if you really think you might need to handle inputs that small, you better write a pipeline that detects this as a corner case and just runs in-memory decomposition. In fact, i think dense

Re: Slow ReloadFromJDBCDataModel

2011-08-16 Thread Sean Owen
There isn't -- you could probably add that to your copy fairly easily. Just clear the in memory representation and reload what you want from the DB. On Tue, Aug 16, 2011 at 7:34 PM, Salil Apte sa...@offlinelabs.com wrote: Is there a way to selectively reload data from the database for a user?

Re: Errors in SSVD

2011-08-16 Thread Dmitriy Lyubimov
PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW what i am using int local tests to assert SSVD results. Although it starts to feel slow pretty quickly and sometimes produces errors (i think i starts feeling slow at 10k x 1k inputs) On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy

Re: Errors in SSVD

2011-08-16 Thread Eshwaran Vijaya Kumar
I have decided to do something similar: Do the pipeline in memory and not invoke map-reduce for small datasets which I think will handle the issue. Thanks again for clearing that up. Esh Aug 16, 2011, at 1:45 PM, Dmitriy Lyubimov wrote: PPS Mahout also has in-memory SVD Colt-migrated solver

Re: Errors in SSVD

2011-08-16 Thread Ted Dunning
I have several in-memory implementations almost ready to publish. These provide straightforward implementation of the original SSVD algorithm from the Martinsson and Halko paper, a version that avoids QR and LQ decompositions and an out-of-core version that only keeps a moderate sized amount of

Re: Vectors vs Preferences

2011-08-16 Thread Ted Dunning
Actually SGD is just this for classification. It is (pretty) scalable and definitely not normally parallel. On Tue, Aug 16, 2011 at 11:16 AM, Sean Owen sro...@gmail.com wrote: There are no non-distributed counterparts for clustering and classification. It's not symmetric, and it would be

Re: Vectors vs Preferences

2011-08-16 Thread Ted Dunning
There are major costs incurred if we move to long indexes for matrices. That might be a good thing to do and it would be pretty easy to provide legacy access points, but it would hurt me to spend 30% on memory to do this. The need on the recommendation side was to have id's that would not

Re: Vectors vs Preferences

2011-08-16 Thread Jake Mannix
On Tue, Aug 16, 2011 at 3:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: There are major costs incurred if we move to long indexes for matrices. That might be a good thing to do and it would be pretty easy to provide legacy access points, but it would hurt me to spend 30% on memory to do

Re: Vectors vs Preferences

2011-08-16 Thread Ted Dunning
On Tue, Aug 16, 2011 at 3:28 PM, Jake Mannix jake.man...@gmail.com wrote: The need on the recommendation side was to have id's that would not collide without having to check. That is a bit different from the matrix world where you have a conceptually dense set of integer indexes. Why

RE: Vectors vs Preferences

2011-08-16 Thread Jeff Eastman
Actually, most clustering algorithms have sequential implementations (-xm, --method sequential) that read from and write to the same files but run a single, non-mapreduce thread in memory using their respective reference implementations. -Original Message- From: Sean Owen

How to launch a single-node recommender service?

2011-08-16 Thread Ozgun Erdogan
Hi all, I'm following the instructions on the Mahout wiki for launching a non-distributed recommender service: $ cd integration $ cp ../examples/target/grouplens.jar ./lib Unfortunately, I don't have an integration directory in my local file system. I tried out my recommender by adding a simple

Single-user recommenders?

2011-08-16 Thread Lance Norskog
Are there any recommender algorithms designed for micro-sharding the data model? The use case would be a mobile app that stores only a data model for the phone owner. It seems like a user-user recommender does not need data for all users; nearby users plus some background noise should be enough

Re: Single-user recommenders?

2011-08-16 Thread Ted Dunning
Yes. That is quite reasonably possible. It isn't really micro-sharding since it will be different for every user rather than being a universal sharding of all users. On Tue, Aug 16, 2011 at 8:35 PM, Lance Norskog goks...@gmail.com wrote: Are there any recommender algorithms designed for