Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
To be honest, i always cancelled the sketching after a while because i wasn't satisfied with the points per second speed. The version used is the 0.8 release. if i find the time i'm gonna look what is called when and where and how often and what the problem could be. On Thu, Dec 26, 2013 at 8:22

Re: Streaming KMeans clustering

2013-12-25 Thread Ted Dunning
Interesting. In Dan's tests on sparse data, he got about 10x speedup net. You didn't run multiple sketching passes did you? Also, which version? There was a horrendous clone in there at one time. On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > ever

Re: Happy Holidays!

2013-12-25 Thread Tharindu Rusira
Happy Holidays everyone !!! :) On Wed, Dec 25, 2013 at 8:09 AM, Andrew Musselman < andrew.mussel...@gmail.com> wrote: > Merry Christmas and a Happy New Year! > > > On Dec 24, 2013, at 3:36 PM, Stevo Slavić wrote: > > > > Happy Holidays Everyone! > > > > > > On Tue, Dec 24, 2013 at 12:28 PM, Fra

[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2013-12-25 Thread Yexi Jiang (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856671#comment-13856671 ] Yexi Jiang commented on MAHOUT-1388: [~smarthi] OK, I'll add it. Currently, it only s

Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
everybody should have the right to do job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G"); for that :) For my problems, i always felt the sketching took too long. i put up a simple comparison here: g...@github.com:baunz/cluster-comprarison.git it generates some sample vector

Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone > On Dec 25, 2013, at 9:44 AM, Sebastian Schelter wrote: > >> On 25.12.2013 14:19, Suneel Marthi wrote: >> >> >> >> >> On Tuesday, December

Re: Streaming KMeans clustering

2013-12-25 Thread Sebastian Schelter
On 25.12.2013 14:19, Suneel Marthi wrote: > > > > > >>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning >>> wrote: > >>> For reference, on a 16 core machine, I was able to run the sequential >>> version of streaming k-means on 1,000,000 points, each with 10 dimensions >>> in about 20 se

Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
>>On Tuesday, December 24, 2013 4:23 PM, Ted Dunning >>wrote: >>For reference, on a 16 core machine, I was able to run the sequential >>version of streaming k-means on 1,000,000 points, each with 10 dimensions >>in about 20 seconds.  The map-reduce versions are comparable subject to >>scal

[jira] [Updated] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true

2013-12-25 Thread Suneel Marthi (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1358: -- Description: Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true and when

[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2013-12-25 Thread Suneel Marthi (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856592#comment-13856592 ] Suneel Marthi commented on MAHOUT-1388: --- [~yxjiang] Also please provide adequate Lo

Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
@Johannes, how many datapoints did u have in ur test?  Since the Streaming KMeans runs through a single reducer how much memory did u have to allocate if u had like a million data points?  What was the expectedDistanceCutoff you had? @All, My experience so far has been that once you are done wit

Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
On Wednesday, December 25, 2013 5:20 AM, Sebastian Schelter wrote: Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)? @Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) he

Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
@Johannes, I didn't quite get reading your 2 emails if Streaming kmeans worked for you or not? What were the issues you had identified with pending additions and projection? On Wednesday, December 25, 2013 5:40 AM, Johannes Schulte wrote: Hey Sebastian, it was a text like clustering pr

Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
Hey Sebastian, it was a text like clustering problem with a dimensionality of 100 000, the number of data points could have have been million but i always cancelled it after a while (i used the java classes, not the command line version and monitored the progress). As for my statements above: The

Re: Streaming KMeans clustering

2013-12-25 Thread Sebastian Schelter
Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)? @Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) help cope with high dimensional inputs? --sebastian On 25.12.2013 10:42, Joh

Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*