Re: Canopy Clustering not scaling

2010-05-02 Thread Jeff Eastman
These sorts of optimizations could delay the growth of canopy clusters in situations where the clustering thresholds are set too low for the dataset. At some point the mapper would still OME with enough points if all become clusters. That decision rests with the T2 threshold which determines if

Re: Canopy Clustering not scaling

2010-05-02 Thread Ted Dunning
How about making the threshold adapt over time? Another option is to keep a count of all of the canopies so far and evict any which have too few points with too large average distance. The points emitted so far would still reference these canopies, but we wouldn't be able to add new points to the

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Ted Dunning
The net effect will be to decrease the effect of the downstream compressor, but I would still expect the final result to be a bit smaller with upstream improvements in representation. Speed will be better with the better representations if only because the downstream compressor will have to deal w

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Ted Dunning
The really major win would be if we handle integer (especially boolean) matrices specially. Attacking the 4 byte cost of the index in a sparse vector, but attacking the 8 byte value would be even better. For sparse boolean matrices, the value can go away entirely. All of these efforts will have

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
It's the same approach to variable-length encoding, yes. Zig-zag is a trick to make negative numbers "compatible" with this encoding. Because two's-complement negative numbers start with a bunch of 1s their representation is terrible under this variable-length encoding -- always of maximum length.

Re: Quickstart for kMeans

2010-05-02 Thread Jeff Eastman
Indeed, the wiki is pretty out of date in some areas and the actual apis have changed (since 2008!). For users wishing to launch clustering jobs using trunk I suggest checking out utils TestCDbwEvaluator and TestClusterDumper which employ the latest versions. These do not use the command-line f

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Robin Anil
LZO is supposedly the best option, but due to GPL restrictions, it was removed. The Quicklz hasnt yet been integrated into the Hadoop code base. Robin On Mon, May 3, 2010 at 1:15 AM, Drew Farris wrote: > Is this what is commonly referred to as zig-zag encoding? Avro uses the > same > technique:

Re: Quickstart for kMeans

2010-05-02 Thread Sisir Koppaka
Two more useful resources for quickstarting with the code - http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/ On Mon, May 3, 2010 at 1:14 AM, Robin Anil wrote: > Nice work! > > On Mon, Ma

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Drew Farris
Is this what is commonly referred to as zig-zag encoding? Avro uses the same technique: http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding For sequential sparse vectors it we could get an additional win by delta encoding the indexes. This would allow the index, stored as the differ

Quickstart for kMeans

2010-05-02 Thread Sisir Koppaka
For GSOC students, In case anyone was going through the code and finding some difficulty in running stuff, I have updated the kMeans page on the wiki with a short quickstart shell script that will run it for you. You can tweak the settings

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
That much is expected right? Since it stores a 4-byte index along with each 8-byte double value, the sparse representation is bigger when over 8/(4+8) = 66% of the values are non-default / non-zero. But variable-encoding the index value trims a byte or more per element depending on your assumption

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
That's the one! I actually didn't know this was how PBs did the variable length encoding but makes sense, it's about the most efficient thing I can imagine. Values up to 16,383 fit in two bytes, which less than a 4-byte int and the 3 bytes or so it would take the other scheme. Could add up over th

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Robin Anil
On Sun, May 2, 2010 at 9:40 PM, Sean Owen wrote: > What's the specific improvement idea? > > Size and speed improvements would be good. The Hadoop serialization > mechanism is already pretty low-level, dealing directly in bytes (as > opposed to fancier stuff like Avro). It's if anything fast and

Re: Wiki Access

2010-05-02 Thread Jeff Eastman
I saw that email too, but confluence appears to be working. I've sent a request to infrastructure... On 5/2/10 9:12 AM, Robin Anil wrote: I believe they are upgrading confluence. I got an email about it yesterday On Sun, May 2, 2010 at 9:40 PM, Jeff Eastman wrote: I can't seem t

Wiki Access

2010-05-02 Thread Jeff Eastman
I can't seem to log into the wiki any more and two password reset attempts have failed to produce the promised password email (I checked my spam filter too). Does anybody have enough karma to help me out? Jeff

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
What's the specific improvement idea? Size and speed improvements would be good. The Hadoop serialization mechanism is already pretty low-level, dealing directly in bytes (as opposed to fancier stuff like Avro). It's if anything fast and lean but quite manual. The latest Writable updates squeezed

Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Robin Anil
I am getting more and more ideas as I try to write about scaling Mahout clustering. I added serialize and de serialize benchmark for Vectors and checked the speed of our vectors. Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100 loop=100 (hence writing 10K(int-doubles)

Re: Canopy Clustering not scaling

2010-05-02 Thread Jeff Eastman
You could try using more, smaller input splits, but large datasets and too-small distance thresholds will choke up the mappers with number of canopies approaching the number of points seen by the mapper. Also the single reducer will choke unless the thresholds allow condensing the mapper canopi

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
As I said, "you can imagine how the rest goes" -- this is a taste of how you might distribute the key piece of the computation you asked about, and certainly does that correctly. It is not the whole algorithm of course -- up to you. On Sun, May 2, 2010 at 1:52 PM, Robin Anil wrote: > I dont think

Re: Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
I dont think you got the algorithm correct. The canopy list is empty at start, And automatically populated using the distance threshold, this may work, I dont have a clue how to get till here. On Sun, May 2, 2010 at 6:15 PM, Sean Owen wrote: > How about this for the first phase? I think you can

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
How about this for the first phase? I think you can imagine how the rest goes, more later... Mapper 1A. map() input: One canopy map() output: canopy ID -> canopy Mapper 1B. Has in memory all canopy IDs, read at startup) map() input: one point map() output: for each canopy ID, canopy ID -> point

Re: Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
On Sun, May 2, 2010 at 5:45 PM, Sean Owen wrote: > Not surprising indeed, that won't scale at some point. > What is the stage that needs everything in memory? maybe describing > that helps imagine solutions. > Algorithm is simple For each point read into the mapper. Find the canopy it

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
Not surprising indeed, that won't scale at some point. What is the stage that needs everything in memory? maybe describing that helps imagine solutions. The typical reason for this, in my experience back in the day, was needing to look up data infrequently in a key-value way. "Side-loading" off HD

Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
Keeping all canopies in memory is not making things scale. I frequently run into out of memory errors when the distance thresholds are not good on reuters. Any ideas on optimizing this? Robin

Re: svn commit: r939867 - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/clustering/dirichlet/ core/src/main/java/org/apache/mahout/clustering/kmeans/ core/src/main/java/org/apache/ma

2010-05-02 Thread Robin Anil
Works fine :) Sorry about that.