Re: Streaming KMeans clustering
Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right? What is your vector's cardinality? On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: Ted, What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)? @Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) help cope with high dimensional inputs? --sebastian On 25.12.2013 10:42, Johannes Schulte wrote: Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right? What is your vector's cardinality? On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: Ted, What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
Hey Sebastian, it was a text like clustering problem with a dimensionality of 100 000, the number of data points could have have been million but i always cancelled it after a while (i used the java classes, not the command line version and monitored the progress). As for my statements above: They are possibly not quite correct. Sure, the projection search reduces the amount of searching needed, but by the time i looked into the code, i identified two problems, if i remember correctly: - the searching of pending additions - the projection itself but i'll have to retry that and look into the code again. i ended up using the old k-means code on a sample of the data.. cheers, johannes On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter s...@apache.org wrote: Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)? @Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) help cope with high dimensional inputs? --sebastian On 25.12.2013 10:42, Johannes Schulte wrote: Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right? What is your vector's cardinality? On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Ted, What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
@Johannes, I didn't quite get reading your 2 emails if Streaming kmeans worked for you or not? What were the issues you had identified with pending additions and projection? On Wednesday, December 25, 2013 5:40 AM, Johannes Schulte johannes.schu...@gmail.com wrote: Hey Sebastian, it was a text like clustering problem with a dimensionality of 100 000, the number of data points could have have been million but i always cancelled it after a while (i used the java classes, not the command line version and monitored the progress). As for my statements above: They are possibly not quite correct. Sure, the projection search reduces the amount of searching needed, but by the time i looked into the code, i identified two problems, if i remember correctly: - the searching of pending additions - the projection itself but i'll have to retry that and look into the code again. i ended up using the old k-means code on a sample of the data.. cheers, johannes On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter s...@apache.org wrote: Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)? @Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) help cope with high dimensional inputs? --sebastian On 25.12.2013 10:42, Johannes Schulte wrote: Hi, i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right? What is your vector's cardinality? On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Ted, What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP
[ https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856592#comment-13856592 ] Suneel Marthi commented on MAHOUT-1388: --- [~yxjiang] Also please provide adequate Logging statements in ur code. Would u be only supporting CSV as input format? Add command line support and logging for MLP Key: MAHOUT-1388 URL: https://issues.apache.org/jira/browse/MAHOUT-1388 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 1.0 Reporter: Yexi Jiang Labels: mlp, sgd Fix For: 1.0 The user should have the ability to run the Perceptron from the command line. There are two modes for MLP, the training and labeling, the first one takes the data as input and outputs the model, the second one takes the model and unlabeled data as input and outputs the results. The parameters are as follows: --mode -mo // train or label --input -i (input data) --model -mo // in training mode, this is the location to store the model (if the specified location has an existing model, it will update the model through incremental learning), in labeling mode, this is the location to store the result --output -o // this is only useful in labeling mode --layersize -ls (no. of units per hidden layer) // use comma separated number to indicate the number of neurons in each layer (including input layer and output layer) --momentum -m --learningrate -l --regularizationweight -r --costfunction -cf // the type of cost function, For example, train a 3-layer (including input, hidden, and output) MLP with Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the parameter would be: mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01 -cf minus_squared This command would read the training data from /tmp/training-data.csv and write the trained model to /tmp/model.model. If a user need to use an existing model, it will use the following command: mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result Moreover, we should be providing default values if the user does not specify any. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true
[ https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1358: -- Description: Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true and when no estimatedDistanceCutoff is specified, throws the following error {Code} java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:120) at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176) at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) {Code} The issue is caused by the following code in StreamingKMeansThread.call() {Code} IteratorCentroid datapointsIterator = datapoints.iterator(); if (estimateDistanceCutoff == StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) { ListCentroid estimatePoints = Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS); while (datapointsIterator.hasNext() estimatePoints.size() NUM_ESTIMATE_POINTS) { estimatePoints.add(datapointsIterator.next()); } estimateDistanceCutoff = ClusteringUtils.estimateDistanceCutoff(estimatePoints, searcher.getDistanceMeasure()); } StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, estimateDistanceCutoff); while (datapointsIterator.hasNext()) { clusterer.cluster(datapointsIterator.next()); } {Code} The code is using the same iterator twice, and it fails on the second use for obvious reasons. was: Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true, throws the following error {Code} java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:120) at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176) at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) {Code} The issue is caused by the following code in StreamingKMeansThread.call() {Code} IteratorCentroid datapointsIterator = datapoints.iterator(); if (estimateDistanceCutoff == StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) { ListCentroid estimatePoints = Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS); while (datapointsIterator.hasNext() estimatePoints.size() NUM_ESTIMATE_POINTS) { estimatePoints.add(datapointsIterator.next()); } estimateDistanceCutoff = ClusteringUtils.estimateDistanceCutoff(estimatePoints, searcher.getDistanceMeasure()); } StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, estimateDistanceCutoff); while (datapointsIterator.hasNext()) { clusterer.cluster(datapointsIterator.next()); } {Code} The code is using the same iterator twice, and it fails on the second use for obvious reasons. StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true - Key: MAHOUT-1358 URL: https://issues.apache.org/jira/browse/MAHOUT-1358 Project: Mahout Issue Type: Bug Components:
Re: Streaming KMeans clustering
On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone On Dec 25, 2013, at 9:44 AM, Sebastian Schelter s...@apache.org wrote: On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
everybody should have the right to do job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G); for that :) For my problems, i always felt the sketching took too long. i put up a simple comparison here: g...@github.com:baunz/cluster-comprarison.git it generates some sample vectors and clusters them with regular k-means, and streaming k-means, both sequentially. i took 10 kmeans iterations as a benchmark and used the default values for FastProjectionSearch from the kMeans Driver Class. Visual VM tells me the most time is spent in FastProjectionSearch.remove(). This is called on every added datapoint. Maybe i got something wrong but for this sparse, high dimensional vectors i never got streaming k-means faster than the regula version On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone On Dec 25, 2013, at 9:44 AM, Sebastian Schelter s...@apache.org wrote: On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP
[ https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856671#comment-13856671 ] Yexi Jiang commented on MAHOUT-1388: [~smarthi] OK, I'll add it. Currently, it only supports CSV. Add command line support and logging for MLP Key: MAHOUT-1388 URL: https://issues.apache.org/jira/browse/MAHOUT-1388 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 1.0 Reporter: Yexi Jiang Labels: mlp, sgd Fix For: 1.0 The user should have the ability to run the Perceptron from the command line. There are two modes for MLP, the training and labeling, the first one takes the data as input and outputs the model, the second one takes the model and unlabeled data as input and outputs the results. The parameters are as follows: --mode -mo // train or label --input -i (input data) --model -mo // in training mode, this is the location to store the model (if the specified location has an existing model, it will update the model through incremental learning), in labeling mode, this is the location to store the result --output -o // this is only useful in labeling mode --layersize -ls (no. of units per hidden layer) // use comma separated number to indicate the number of neurons in each layer (including input layer and output layer) --momentum -m --learningrate -l --regularizationweight -r --costfunction -cf // the type of cost function, For example, train a 3-layer (including input, hidden, and output) MLP with Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the parameter would be: mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01 -cf minus_squared This command would read the training data from /tmp/training-data.csv and write the trained model to /tmp/model.model. If a user need to use an existing model, it will use the following command: mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result Moreover, we should be providing default values if the user does not specify any. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Happy Holidays!
Happy Holidays everyone !!! :) On Wed, Dec 25, 2013 at 8:09 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Merry Christmas and a Happy New Year! On Dec 24, 2013, at 3:36 PM, Stevo Slavić ssla...@gmail.com wrote: Happy Holidays Everyone! On Tue, Dec 24, 2013 at 12:28 PM, Frank Scholten fr...@frankscholten.nl wrote: Best wishes! On Tue, Dec 24, 2013 at 11:11 AM, Sebastian Schelter s...@apache.org wrote: dito! On 24.12.2013 11:09, Isabel Drost-Fromm wrote: I'd like to take some time and wish everyone a Happy Holiday! Enjoy the time with your family and friends. Thank you all for your contributions and work on Mahout. Looking forward to an exciting 2014. Isabel -- M.P. Tharindu Rusira Kumara Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. +94757033733 www.tharindu-rusira.blogspot.com
Re: Streaming KMeans clustering
Interesting. In Dan's tests on sparse data, he got about 10x speedup net. You didn't run multiple sketching passes did you? Also, which version? There was a horrendous clone in there at one time. On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte johannes.schu...@gmail.com wrote: everybody should have the right to do job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G); for that :) For my problems, i always felt the sketching took too long. i put up a simple comparison here: g...@github.com:baunz/cluster-comprarison.git it generates some sample vectors and clusters them with regular k-means, and streaming k-means, both sequentially. i took 10 kmeans iterations as a benchmark and used the default values for FastProjectionSearch from the kMeans Driver Class. Visual VM tells me the most time is spent in FastProjectionSearch.remove(). This is called on every added datapoint. Maybe i got something wrong but for this sparse, high dimensional vectors i never got streaming k-means faster than the regula version On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone On Dec 25, 2013, at 9:44 AM, Sebastian Schelter s...@apache.org wrote: On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
To be honest, i always cancelled the sketching after a while because i wasn't satisfied with the points per second speed. The version used is the 0.8 release. if i find the time i'm gonna look what is called when and where and how often and what the problem could be. On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning ted.dunn...@gmail.com wrote: Interesting. In Dan's tests on sparse data, he got about 10x speedup net. You didn't run multiple sketching passes did you? Also, which version? There was a horrendous clone in there at one time. On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte johannes.schu...@gmail.com wrote: everybody should have the right to do job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G); for that :) For my problems, i always felt the sketching took too long. i put up a simple comparison here: g...@github.com:baunz/cluster-comprarison.git it generates some sample vectors and clusters them with regular k-means, and streaming k-means, both sequentially. i took 10 kmeans iterations as a benchmark and used the default values for FastProjectionSearch from the kMeans Driver Class. Visual VM tells me the most time is spent in FastProjectionSearch.remove(). This is called on every added datapoint. Maybe i got something wrong but for this sparse, high dimensional vectors i never got streaming k-means faster than the regula version On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. Sent from my iPhone On Dec 25, 2013, at 9:44 AM, Sebastian Schelter s...@apache.org wrote: On 25.12.2013 14:19, Suneel Marthi wrote: On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: For reference, on a 16 core machine, I was able to run the sequential version of streaming k-means on 1,000,000 points, each with 10 dimensions in about 20 seconds. The map-reduce versions are comparable subject to scaling except for startup time. @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally. What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer? If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote: That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.