Re: Streaming K-means
Dmitriy is correct in that the Streaming KMeans in MlLib is a wrong name for something that was meant to convey "Spark Streaming + KMeans". The Mahout Streaming KMeans is an implementation of the Meyerson paper that's been referred to in Dmitriy's email. I have had folks wrongly misconstrue Streaming KMeans as being Spark Streaming + KMeans, thanks to the bad naming on part of the MlLib folks. I had spoken to Jeremy Freeman, the MLLib Streaming KMEans contributor about this and he agrees that the intent was to convery "Spark Streaming + KMeans", he definitely wasn't aware of the Streaming KMeans algorithm that existed much before Spark Streaming. On Tue, Jun 16, 2015 at 5:34 PM, Dmitriy Lyubimov wrote: > "streaming k-means" is something else afaik. Streaming k-means is reserved > for a particular k-means method (in Mahout, at least, [1]). > > Whereas as far as i understand what mllib calls "streaming k-means" is name > given by mllib contributor which really means "online k-means", i.e. radar > tracking of centroids over time over stream of (x_i, t_i) pairs and that > uses Spark streaming, but has nothing to do with Shindler et. al. method. > > At least that was our understanding last time we looked at the issue of the > names here. > > So this issue has come several times already when people come and say, > "what? streaming k-means? mllib has streaming k-means" whereas everybody > else is talking about something else. > > [1] > > http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf > > On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling wrote: > > > There is a streaming k-means implementation in MLlib that uses reservoir > > sampling. > > > > On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic > > wrote: > > > > > Ted, > > > > > > Thank you for your answer. > > > > > > What would you then recommend me to do? My idea is to implement it to > > > enable clustering of time series using DTW (Dynamic Time Warping) as > > > distance measure. As you know, the main problem is that K-medoids is > not > > > scalable, so that's standing on my way. Of course, it could be used > with > > > other distances as well. > > > > > > I have already implemented something that I consider a scalable > > K-medoids, > > > based on using pivots to speed up medoid selection ( > > > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This > > > works for distance measures such as Euclidean, has some limitations > (best > > > results are in case of normal distribution, outliers could be a > problem), > > > but it works pretty good (considering the computations). The thing is, > it > > > can't be used with DTW, since it relies on projections, while triangle > > > inequality for DTW does not hold. That is why I'm considering this > > > Streaming approach now. > > > > > > Would you think that it is worthy of giving a shot? I'm really > stretching > > > for a scalable solution. > > > > > > Best regards, > > > Marko > > > > > > > > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote: > > > > > >> The streaming k-means works by building a sketch of the data which is > > then > > >> used to do real clustering. > > >> > > >> It might be that this sketch would be acceptable to do k-medoids, but > > that > > >> is definitely not guaranteed. > > >> > > >> Similarly, it might be possible to build a medoid sketch instead of a > > mean > > >> based sketch, but this is also unexplored ground. > > >> > > >> The virtue of the first approach (using a m-means sketch as input to > > >> k-medoids) would be that it would make the k-medoids scalable. > > >> > > >> > > >> > > >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic < > marko.di...@nissatech.com> > > >> wrote: > > >> > > >> Hello everyone, > > >>> > > >>> I have an idea and I would like to get a validation from community > > about > > >>> it. > > >>> > > >>> In Mahout there is an implementation of Streaming K-means. I'm > > interested > > >>> in your opinion would it make sense to make a similar implementation > of > > >>> Streaming K-medoids? > > >>> > > >>> K-medoids has even bigger problems than K-means because it's not > > >>> scalable, > > >>> but can be useful in some cases (e.g. It allows more sophisticated > > >>> distance > > >>> measures). > > >>> > > >>> What is your opinion about implementation of this? > > >>> > > >>> Best regards, > > >>> Marko > > >>> > > >>> > > >> > > >
Re: Streaming K-means
"streaming k-means" is something else afaik. Streaming k-means is reserved for a particular k-means method (in Mahout, at least, [1]). Whereas as far as i understand what mllib calls "streaming k-means" is name given by mllib contributor which really means "online k-means", i.e. radar tracking of centroids over time over stream of (x_i, t_i) pairs and that uses Spark streaming, but has nothing to do with Shindler et. al. method. At least that was our understanding last time we looked at the issue of the names here. So this issue has come several times already when people come and say, "what? streaming k-means? mllib has streaming k-means" whereas everybody else is talking about something else. [1] http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling wrote: > There is a streaming k-means implementation in MLlib that uses reservoir > sampling. > > On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic > wrote: > > > Ted, > > > > Thank you for your answer. > > > > What would you then recommend me to do? My idea is to implement it to > > enable clustering of time series using DTW (Dynamic Time Warping) as > > distance measure. As you know, the main problem is that K-medoids is not > > scalable, so that's standing on my way. Of course, it could be used with > > other distances as well. > > > > I have already implemented something that I consider a scalable > K-medoids, > > based on using pivots to speed up medoid selection ( > > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This > > works for distance measures such as Euclidean, has some limitations (best > > results are in case of normal distribution, outliers could be a problem), > > but it works pretty good (considering the computations). The thing is, it > > can't be used with DTW, since it relies on projections, while triangle > > inequality for DTW does not hold. That is why I'm considering this > > Streaming approach now. > > > > Would you think that it is worthy of giving a shot? I'm really stretching > > for a scalable solution. > > > > Best regards, > > Marko > > > > > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote: > > > >> The streaming k-means works by building a sketch of the data which is > then > >> used to do real clustering. > >> > >> It might be that this sketch would be acceptable to do k-medoids, but > that > >> is definitely not guaranteed. > >> > >> Similarly, it might be possible to build a medoid sketch instead of a > mean > >> based sketch, but this is also unexplored ground. > >> > >> The virtue of the first approach (using a m-means sketch as input to > >> k-medoids) would be that it would make the k-medoids scalable. > >> > >> > >> > >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic > >> wrote: > >> > >> Hello everyone, > >>> > >>> I have an idea and I would like to get a validation from community > about > >>> it. > >>> > >>> In Mahout there is an implementation of Streaming K-means. I'm > interested > >>> in your opinion would it make sense to make a similar implementation of > >>> Streaming K-medoids? > >>> > >>> K-medoids has even bigger problems than K-means because it's not > >>> scalable, > >>> but can be useful in some cases (e.g. It allows more sophisticated > >>> distance > >>> measures). > >>> > >>> What is your opinion about implementation of this? > >>> > >>> Best regards, > >>> Marko > >>> > >>> > >> >
Re: Streaming K-means
There is a streaming k-means implementation in MLlib that uses reservoir sampling. On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic wrote: > Ted, > > Thank you for your answer. > > What would you then recommend me to do? My idea is to implement it to > enable clustering of time series using DTW (Dynamic Time Warping) as > distance measure. As you know, the main problem is that K-medoids is not > scalable, so that's standing on my way. Of course, it could be used with > other distances as well. > > I have already implemented something that I consider a scalable K-medoids, > based on using pivots to speed up medoid selection ( > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This > works for distance measures such as Euclidean, has some limitations (best > results are in case of normal distribution, outliers could be a problem), > but it works pretty good (considering the computations). The thing is, it > can't be used with DTW, since it relies on projections, while triangle > inequality for DTW does not hold. That is why I'm considering this > Streaming approach now. > > Would you think that it is worthy of giving a shot? I'm really stretching > for a scalable solution. > > Best regards, > Marko > > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote: > >> The streaming k-means works by building a sketch of the data which is then >> used to do real clustering. >> >> It might be that this sketch would be acceptable to do k-medoids, but that >> is definitely not guaranteed. >> >> Similarly, it might be possible to build a medoid sketch instead of a mean >> based sketch, but this is also unexplored ground. >> >> The virtue of the first approach (using a m-means sketch as input to >> k-medoids) would be that it would make the k-medoids scalable. >> >> >> >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic >> wrote: >> >> Hello everyone, >>> >>> I have an idea and I would like to get a validation from community about >>> it. >>> >>> In Mahout there is an implementation of Streaming K-means. I'm interested >>> in your opinion would it make sense to make a similar implementation of >>> Streaming K-medoids? >>> >>> K-medoids has even bigger problems than K-means because it's not >>> scalable, >>> but can be useful in some cases (e.g. It allows more sophisticated >>> distance >>> measures). >>> >>> What is your opinion about implementation of this? >>> >>> Best regards, >>> Marko >>> >>> >>
Re: Streaming K-means
Ted, Thank you for your answer. What would you then recommend me to do? My idea is to implement it to enable clustering of time series using DTW (Dynamic Time Warping) as distance measure. As you know, the main problem is that K-medoids is not scalable, so that's standing on my way. Of course, it could be used with other distances as well. I have already implemented something that I consider a scalable K-medoids, based on using pivots to speed up medoid selection (https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This works for distance measures such as Euclidean, has some limitations (best results are in case of normal distribution, outliers could be a problem), but it works pretty good (considering the computations). The thing is, it can't be used with DTW, since it relies on projections, while triangle inequality for DTW does not hold. That is why I'm considering this Streaming approach now. Would you think that it is worthy of giving a shot? I'm really stretching for a scalable solution. Best regards, Marko On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote: The streaming k-means works by building a sketch of the data which is then used to do real clustering. It might be that this sketch would be acceptable to do k-medoids, but that is definitely not guaranteed. Similarly, it might be possible to build a medoid sketch instead of a mean based sketch, but this is also unexplored ground. The virtue of the first approach (using a m-means sketch as input to k-medoids) would be that it would make the k-medoids scalable. On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic wrote: Hello everyone, I have an idea and I would like to get a validation from community about it. In Mahout there is an implementation of Streaming K-means. I'm interested in your opinion would it make sense to make a similar implementation of Streaming K-medoids? K-medoids has even bigger problems than K-means because it's not scalable, but can be useful in some cases (e.g. It allows more sophisticated distance measures). What is your opinion about implementation of this? Best regards, Marko
Re: Streaming K-means
The streaming k-means works by building a sketch of the data which is then used to do real clustering. It might be that this sketch would be acceptable to do k-medoids, but that is definitely not guaranteed. Similarly, it might be possible to build a medoid sketch instead of a mean based sketch, but this is also unexplored ground. The virtue of the first approach (using a m-means sketch as input to k-medoids) would be that it would make the k-medoids scalable. On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic wrote: > Hello everyone, > > I have an idea and I would like to get a validation from community about > it. > > In Mahout there is an implementation of Streaming K-means. I'm interested > in your opinion would it make sense to make a similar implementation of > Streaming K-medoids? > > K-medoids has even bigger problems than K-means because it's not scalable, > but can be useful in some cases (e.g. It allows more sophisticated distance > measures). > > What is your opinion about implementation of this? > > Best regards, > Marko >
Re: Streaming K Means exception without any reason
Here is the dataset, I've just checked to be sure it is the right one. On 09.10.2014. 15:34, Suneel Marthi wrote: Heh u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets. Do u see any output clusters? If so how many points? possible to share ur dataset to troubleshoot ? On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić wrote: Suneel, Thank you for your answer, this was rather strange to me. The number of points is 942. I have multiple runs, in each run I have a loop in which number of clusters is increased in each iteration and I multiple that number by 3, since I'm expecting log(n) initial centroids, before Ball K Means step. It's actually an attempt of elbow method implementation. It's very strange that this crashing happens occasionally. Can I expect that problems like this be fixed in future? I'm using it since it gives better results, both in speed and clustering quality, but it would be a problem if it crashes like this. On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input centroids - can't recall the sequence that leads to this off the top of my head, will have to look. What's the size of ur input - the no. of points u r trying to cluster, how r u setting the value for estimatedNumMapClusters ? Streaming KMeans is still experimental and has scalability issues that need to be worked out. There are few other scenarios wherein Streaming KMeans fails that u should be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. Lemme take a look at this. On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić wrote: Hello everyone, I'm using Mahout Streaming K Means multiple times in a loop, every time for same input data, and output path is always different. Concretely, I'm increasing number of clusters in each iteration. Currently it is run on a single machine. A couple of times (maybe 3 of 20 runs) I get this exception Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Merging 1 sorted segments Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Down to the last merge-pass, with 1 segments left of total size: 1623 bytes Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local1196467414_0036 java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull( Preconditions.java:213) at org.apache.mahout.math.random.WeightedThing.( WeightedThing.java:31) at org.apache.mahout.math.neighborhood.ProjectionSearch. searchFirst(ProjectionSearch.java:191) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. iterativeAssignment(BallKMeans.java:395) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. cluster(BallKMeans.java:208) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer( ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run( LocalJobRunner.java:398) I'm running it like this: String[] args1 = new String[] {"-i",dataPath,"-o", plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters", String.valueOf((i+1)*3), "-ow"}; StreamingKMeansDriver.main(args1); I'm using the same configuration, and the same dataset, but I see no reason why I get this exception, and it's even stranger that it doesn't always occur. Any ideas? Thanks -- Pozdrav, Marko Dinić -- Pozdrav, Marko Dinić
Re: Streaming K Means exception without any reason
Here is the dataset. On четвртак, 09. октобар 2014. 16:53:25 CEST, Marko Dinić wrote: Yes it is small, but it is just a sample, so the dataset will probably be much bigger. So you think that this was the problem? Will this problem be avoided in case of larger dataset? I think that there were no output clusters, as I remember. I'm sending the dataset, if you want to take a look. Thanks again. On четвртак, 09. октобар 2014. 15:34:36 CEST, Suneel Marthi wrote: Heh u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets. Do u see any output clusters? If so how many points? possible to share ur dataset to troubleshoot ? On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić wrote: Suneel, Thank you for your answer, this was rather strange to me. The number of points is 942. I have multiple runs, in each run I have a loop in which number of clusters is increased in each iteration and I multiple that number by 3, since I'm expecting log(n) initial centroids, before Ball K Means step. It's actually an attempt of elbow method implementation. It's very strange that this crashing happens occasionally. Can I expect that problems like this be fixed in future? I'm using it since it gives better results, both in speed and clustering quality, but it would be a problem if it crashes like this. On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input centroids - can't recall the sequence that leads to this off the top of my head, will have to look. What's the size of ur input - the no. of points u r trying to cluster, how r u setting the value for estimatedNumMapClusters ? Streaming KMeans is still experimental and has scalability issues that need to be worked out. There are few other scenarios wherein Streaming KMeans fails that u should be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. Lemme take a look at this. On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić wrote: Hello everyone, I'm using Mahout Streaming K Means multiple times in a loop, every time for same input data, and output path is always different. Concretely, I'm increasing number of clusters in each iteration. Currently it is run on a single machine. A couple of times (maybe 3 of 20 runs) I get this exception Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Merging 1 sorted segments Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Down to the last merge-pass, with 1 segments left of total size: 1623 bytes Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local1196467414_0036 java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull( Preconditions.java:213) at org.apache.mahout.math.random.WeightedThing.( WeightedThing.java:31) at org.apache.mahout.math.neighborhood.ProjectionSearch. searchFirst(ProjectionSearch.java:191) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. iterativeAssignment(BallKMeans.java:395) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. cluster(BallKMeans.java:208) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer( ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run( LocalJobRunner.java:398) I'm running it like this: String[] args1 = new String[] {"-i",dataPath,"-o", plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters", String.valueOf((i+1)*3), "-ow"}; StreamingKMeansDriver.main(args1); I'm using the same configuration, and the same dataset, but I see no reason why I get this exception, and it's even stranger that it doesn't always occur. Any ideas? Thanks -- Pozdrav, Marko Dinić -- Pozdrav, Marko Dinić -- Pozdrav, Marko Dinić
Re: Streaming K Means exception without any reason
Yes it is small, but it is just a sample, so the dataset will probably be much bigger. So you think that this was the problem? Will this problem be avoided in case of larger dataset? I think that there were no output clusters, as I remember. I'm sending the dataset, if you want to take a look. Thanks again. On четвртак, 09. октобар 2014. 15:34:36 CEST, Suneel Marthi wrote: Heh u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets. Do u see any output clusters? If so how many points? possible to share ur dataset to troubleshoot ? On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić wrote: Suneel, Thank you for your answer, this was rather strange to me. The number of points is 942. I have multiple runs, in each run I have a loop in which number of clusters is increased in each iteration and I multiple that number by 3, since I'm expecting log(n) initial centroids, before Ball K Means step. It's actually an attempt of elbow method implementation. It's very strange that this crashing happens occasionally. Can I expect that problems like this be fixed in future? I'm using it since it gives better results, both in speed and clustering quality, but it would be a problem if it crashes like this. On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input centroids - can't recall the sequence that leads to this off the top of my head, will have to look. What's the size of ur input - the no. of points u r trying to cluster, how r u setting the value for estimatedNumMapClusters ? Streaming KMeans is still experimental and has scalability issues that need to be worked out. There are few other scenarios wherein Streaming KMeans fails that u should be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. Lemme take a look at this. On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić wrote: Hello everyone, I'm using Mahout Streaming K Means multiple times in a loop, every time for same input data, and output path is always different. Concretely, I'm increasing number of clusters in each iteration. Currently it is run on a single machine. A couple of times (maybe 3 of 20 runs) I get this exception Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Merging 1 sorted segments Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Down to the last merge-pass, with 1 segments left of total size: 1623 bytes Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local1196467414_0036 java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull( Preconditions.java:213) at org.apache.mahout.math.random.WeightedThing.( WeightedThing.java:31) at org.apache.mahout.math.neighborhood.ProjectionSearch. searchFirst(ProjectionSearch.java:191) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. iterativeAssignment(BallKMeans.java:395) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. cluster(BallKMeans.java:208) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer( ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run( LocalJobRunner.java:398) I'm running it like this: String[] args1 = new String[] {"-i",dataPath,"-o", plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters", String.valueOf((i+1)*3), "-ow"}; StreamingKMeansDriver.main(args1); I'm using the same configuration, and the same dataset, but I see no reason why I get this exception, and it's even stranger that it doesn't always occur. Any ideas? Thanks -- Pozdrav, Marko Dinić -- Pozdrav, Marko Dinić
Re: Streaming K Means exception without any reason
Heh u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets. Do u see any output clusters? If so how many points? possible to share ur dataset to troubleshoot ? On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić wrote: > Suneel, > > Thank you for your answer, this was rather strange to me. > > The number of points is 942. I have multiple runs, in each run I have a > loop in which number of clusters is increased in each iteration and I > multiple that number by 3, since I'm expecting log(n) initial centroids, > before Ball K Means step. It's actually an attempt of elbow method > implementation. It's very strange that this crashing happens occasionally. > > Can I expect that problems like this be fixed in future? I'm using it > since it gives better results, both in speed and clustering quality, but it > would be a problem if it crashes like this. > > > On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: > >> Seen this issue happen a few times before, there are few edge conditions >> that need to be fixed in the Streaming KMeans code and you are right that >> the generated clusters are different on successive runs given the same >> input. >> >> IIRC this stacktrace is due to BallKMeans failing to read any input >> centroids - can't recall the sequence that leads to this off the top of my >> head, will have to look. >> >> What's the size of ur input - the no. of points u r trying to cluster, how >> r u setting the value for estimatedNumMapClusters ? >> Streaming KMeans is still experimental and has scalability issues that >> need >> to be worked out. >> >> There are few other scenarios wherein Streaming KMeans fails that u should >> be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. >> >> Lemme take a look at this. >> >> >> >> On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić >> wrote: >> >> Hello everyone, >>> >>> I'm using Mahout Streaming K Means multiple times in a loop, every time >>> for same input data, and output path is always different. Concretely, I'm >>> increasing number of clusters in each iteration. Currently it is run on a >>> single machine. >>> >>> A couple of times (maybe 3 of 20 runs) I get this exception >>> >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue >>> merge >>> INFO: Merging 1 sorted segments >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue >>> merge >>> INFO: Down to the last merge-pass, with 1 segments left of total size: >>> 1623 bytes >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job >>> statusUpdate >>> INFO: >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run >>> WARNING: job_local1196467414_0036 >>> java.lang.NullPointerException >>> at com.google.common.base.Preconditions.checkNotNull( >>> Preconditions.java:213) >>> at org.apache.mahout.math.random.WeightedThing.( >>> WeightedThing.java:31) >>> at org.apache.mahout.math.neighborhood.ProjectionSearch. >>> searchFirst(ProjectionSearch.java:191) >>> at org.apache.mahout.clustering.streaming.cluster.BallKMeans. >>> iterativeAssignment(BallKMeans.java:395) >>> at org.apache.mahout.clustering.streaming.cluster.BallKMeans. >>> cluster(BallKMeans.java:208) >>> at org.apache.mahout.clustering.streaming.mapreduce. >>> StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) >>> at org.apache.mahout.clustering.streaming.mapreduce. >>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) >>> at org.apache.mahout.clustering.streaming.mapreduce. >>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) >>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) >>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer( >>> ReduceTask.java:649) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run( >>> LocalJobRunner.java:398) >>> >>> I'm running it like this: >>> >>> String[] args1 = new String[] {"-i",dataPath,"-o", >>> plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters", >>> String.valueOf((i+1)*3), >>> "-ow"}; >>> StreamingKMeansDriver.main(args1); >>> >>> I'm using the same configuration, and the same dataset, but I see no >>> reason why I get this exception, and it's even stranger that it doesn't >>> always occur. >>> >>> Any ideas? >>> >>> Thanks >>> >>> >> > -- > Pozdrav, > Marko Dinić >
Re: Streaming K Means exception without any reason
Suneel, Thank you for your answer, this was rather strange to me. The number of points is 942. I have multiple runs, in each run I have a loop in which number of clusters is increased in each iteration and I multiple that number by 3, since I'm expecting log(n) initial centroids, before Ball K Means step. It's actually an attempt of elbow method implementation. It's very strange that this crashing happens occasionally. Can I expect that problems like this be fixed in future? I'm using it since it gives better results, both in speed and clustering quality, but it would be a problem if it crashes like this. On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input centroids - can't recall the sequence that leads to this off the top of my head, will have to look. What's the size of ur input - the no. of points u r trying to cluster, how r u setting the value for estimatedNumMapClusters ? Streaming KMeans is still experimental and has scalability issues that need to be worked out. There are few other scenarios wherein Streaming KMeans fails that u should be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. Lemme take a look at this. On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić wrote: Hello everyone, I'm using Mahout Streaming K Means multiple times in a loop, every time for same input data, and output path is always different. Concretely, I'm increasing number of clusters in each iteration. Currently it is run on a single machine. A couple of times (maybe 3 of 20 runs) I get this exception Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Merging 1 sorted segments Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Down to the last merge-pass, with 1 segments left of total size: 1623 bytes Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local1196467414_0036 java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull( Preconditions.java:213) at org.apache.mahout.math.random.WeightedThing.( WeightedThing.java:31) at org.apache.mahout.math.neighborhood.ProjectionSearch. searchFirst(ProjectionSearch.java:191) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. iterativeAssignment(BallKMeans.java:395) at org.apache.mahout.clustering.streaming.cluster.BallKMeans. cluster(BallKMeans.java:208) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce. StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer( ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run( LocalJobRunner.java:398) I'm running it like this: String[] args1 = new String[] {"-i",dataPath,"-o", plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters",String.valueOf((i+1)*3), "-ow"}; StreamingKMeansDriver.main(args1); I'm using the same configuration, and the same dataset, but I see no reason why I get this exception, and it's even stranger that it doesn't always occur. Any ideas? Thanks -- Pozdrav, Marko Dinić
Re: Streaming K Means exception without any reason
Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input centroids - can't recall the sequence that leads to this off the top of my head, will have to look. What's the size of ur input - the no. of points u r trying to cluster, how r u setting the value for estimatedNumMapClusters ? Streaming KMeans is still experimental and has scalability issues that need to be worked out. There are few other scenarios wherein Streaming KMeans fails that u should be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. Lemme take a look at this. On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić wrote: > Hello everyone, > > I'm using Mahout Streaming K Means multiple times in a loop, every time > for same input data, and output path is always different. Concretely, I'm > increasing number of clusters in each iteration. Currently it is run on a > single machine. > > A couple of times (maybe 3 of 20 runs) I get this exception > > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge > INFO: Merging 1 sorted segments > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge > INFO: Down to the last merge-pass, with 1 segments left of total size: > 1623 bytes > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job > statusUpdate > INFO: > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run > WARNING: job_local1196467414_0036 > java.lang.NullPointerException > at com.google.common.base.Preconditions.checkNotNull( > Preconditions.java:213) > at org.apache.mahout.math.random.WeightedThing.( > WeightedThing.java:31) > at org.apache.mahout.math.neighborhood.ProjectionSearch. > searchFirst(ProjectionSearch.java:191) > at org.apache.mahout.clustering.streaming.cluster.BallKMeans. > iterativeAssignment(BallKMeans.java:395) > at org.apache.mahout.clustering.streaming.cluster.BallKMeans. > cluster(BallKMeans.java:208) > at org.apache.mahout.clustering.streaming.mapreduce. > StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) > at org.apache.mahout.clustering.streaming.mapreduce. > StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) > at org.apache.mahout.clustering.streaming.mapreduce. > StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) > at org.apache.hadoop.mapred.ReduceTask.runNewReducer( > ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run( > LocalJobRunner.java:398) > > I'm running it like this: > > String[] args1 = new String[] {"-i",dataPath,"-o", > plusOneCentroids,"-k",String.valueOf(i+1), > "--estimatedNumMapClusters",String.valueOf((i+1)*3), > "-ow"}; > StreamingKMeansDriver.main(args1); > > I'm using the same configuration, and the same dataset, but I see no > reason why I get this exception, and it's even stranger that it doesn't > always occur. > > Any ideas? > > Thanks >
Re: Streaming K Means
Suneel, I thank you again for your answer. I'm trying to implement some kind of cluster based anomaly detection. For that, I need to cluster normal examples, and then, when a new example gets into system, I need to assign it to nearest centroid (by calculating the distance between existing centroids and the new example), and then I need the distances from the points in that cluster to the centroid. I could use K Means for that, but I'm hopping to get better results using Streaming K Means, primarily because of its KMeans++ initialization (which I could probably implement myself, but I'm trying to avoid that, since it is already implemented), and also I understand that it can be faster than usual Streaming K Means, since it does one pass clustering, before the Ball K Means step. Please correct me if you disagree with the things I said. Maybe I'm doing something wrong, but I'm getting only one file as output - part-r-0, while I'm expecting something like - ClusteredPoints and Clusters-*-final, in case of KMeans? How can I get and read in centroids and clustered points? Also, I see this qualcluster in the examples/bin/cluster-reuters.sh that you have provided, what is it used for? Thanks, Marko On понедељак, 29. септембар 2014. 20:00:33 CEST, Suneel Marthi wrote: This was replied to earlier with the details u r looking for, repeating here again: See http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471 for how to invoke Streaming Kmeans Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans option. If all that u r looking for his centroids and distances from centroids, wouldn't KMeans suffice? It would help if u could provide more details as to what u r trying to accomplish here? On Mon, Sep 29, 2014 at 9:55 AM, Marko wrote: Hello everyone, I have previously asked a question about Streaming K Means examples, and got an answer that there are not so many available. Can anyone give me example of how to call Streaming K Means clustering for a dataset, and how to get the results? What are the results, are they the same as in basic K Means? Do I get centroids and clustered points? And do I get the distance between point and its centroid, like in K Means? I would like to run Streaming K Means clustering on a dataset, and read in centroids, and also I need the distance from the points to their given centroids. How to do that? Thanks -- Pozdrav, Marko Dinić
Re: Streaming K Means
This was replied to earlier with the details u r looking for, repeating here again: See http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471 for how to invoke Streaming Kmeans Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans option. If all that u r looking for his centroids and distances from centroids, wouldn't KMeans suffice? It would help if u could provide more details as to what u r trying to accomplish here? On Mon, Sep 29, 2014 at 9:55 AM, Marko wrote: > Hello everyone, > > I have previously asked a question about Streaming K Means examples, and > got an answer that there are not so many available. > > Can anyone give me example of how to call Streaming K Means clustering for > a dataset, and how to get the results? > > What are the results, are they the same as in basic K Means? Do I get > centroids and clustered points? And do I get the distance between point and > its centroid, like in K Means? > > I would like to run Streaming K Means clustering on a dataset, and read in > centroids, and also I need the distance from the points to their given > centroids. How to do that? > > Thanks >