Re: Streaming K-means

2015-06-17 Thread Suneel Marthi
Dmitriy is correct in that the Streaming KMeans in MlLib is a wrong name
for something that was meant to convey "Spark Streaming + KMeans".

The Mahout Streaming KMeans is an implementation of the Meyerson paper
that's been referred to in Dmitriy's email.

I have had folks wrongly misconstrue Streaming KMeans as being Spark
Streaming + KMeans, thanks to the bad naming on part of the MlLib folks.

I had spoken to Jeremy Freeman, the MLLib Streaming KMEans contributor
about this and he agrees that the intent was to convery "Spark Streaming +
KMeans", he definitely wasn't aware of the Streaming KMeans algorithm that
existed much before Spark Streaming.



On Tue, Jun 16, 2015 at 5:34 PM, Dmitriy Lyubimov  wrote:

>  "streaming k-means" is something else afaik. Streaming k-means is reserved
> for a particular k-means method (in Mahout, at least, [1]).
>
> Whereas as far as i understand what mllib calls "streaming k-means" is name
> given by mllib contributor which really means "online k-means", i.e. radar
> tracking of centroids over time over stream of (x_i, t_i) pairs and that
> uses Spark streaming, but has nothing to do with Shindler et. al. method.
>
> At least that was our understanding last time we looked at the issue of the
> names here.
>
> So this issue has come several times already when people come and say,
> "what? streaming k-means? mllib has streaming k-means" whereas everybody
> else is talking about something else.
>
> [1]
>
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf
>
> On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling  wrote:
>
> > There is a streaming k-means implementation in MLlib that uses reservoir
> > sampling.
> >
> > On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic 
> > wrote:
> >
> > > Ted,
> > >
> > > Thank you for your answer.
> > >
> > > What would you then recommend me to do? My idea is to implement it to
> > > enable clustering of time series using DTW (Dynamic Time Warping) as
> > > distance measure. As you know, the main problem is that K-medoids is
> not
> > > scalable, so that's standing on my way. Of course, it could be used
> with
> > > other distances as well.
> > >
> > > I have already implemented something that I consider a scalable
> > K-medoids,
> > > based on using pivots to speed up medoid selection (
> > > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This
> > > works for distance measures such as Euclidean, has some limitations
> (best
> > > results are in case of normal distribution, outliers could be a
> problem),
> > > but it works pretty good (considering the computations). The thing is,
> it
> > > can't be used with DTW, since it relies on projections, while triangle
> > > inequality for DTW does not hold. That is why I'm considering this
> > > Streaming approach now.
> > >
> > > Would you think that it is worthy of giving a shot? I'm really
> stretching
> > > for a scalable solution.
> > >
> > > Best regards,
> > > Marko
> > >
> > >
> > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:
> > >
> > >> The streaming k-means works by building a sketch of the data which is
> > then
> > >> used to do real clustering.
> > >>
> > >> It might be that this sketch would be acceptable to do k-medoids, but
> > that
> > >> is definitely not guaranteed.
> > >>
> > >> Similarly, it might be possible to build a medoid sketch instead of a
> > mean
> > >> based sketch, but this is also unexplored ground.
> > >>
> > >> The virtue of the first approach (using a m-means sketch as input to
> > >> k-medoids) would be that it would make the k-medoids scalable.
> > >>
> > >>
> > >>
> > >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic <
> marko.di...@nissatech.com>
> > >> wrote:
> > >>
> > >>  Hello everyone,
> > >>>
> > >>> I have an idea and I would like to get a validation from community
> > about
> > >>> it.
> > >>>
> > >>> In Mahout there is an implementation of Streaming K-means. I'm
> > interested
> > >>> in your opinion would it make sense to make a similar implementation
> of
> > >>> Streaming K-medoids?
> > >>>
> > >>> K-medoids has even bigger problems than K-means because it's not
> > >>> scalable,
> > >>> but can be useful in some cases (e.g. It allows more sophisticated
> > >>> distance
> > >>> measures).
> > >>>
> > >>> What is your opinion about implementation of this?
> > >>>
> > >>> Best regards,
> > >>> Marko
> > >>>
> > >>>
> > >>
> >
>


Re: Streaming K-means

2015-06-16 Thread Dmitriy Lyubimov
 "streaming k-means" is something else afaik. Streaming k-means is reserved
for a particular k-means method (in Mahout, at least, [1]).

Whereas as far as i understand what mllib calls "streaming k-means" is name
given by mllib contributor which really means "online k-means", i.e. radar
tracking of centroids over time over stream of (x_i, t_i) pairs and that
uses Spark streaming, but has nothing to do with Shindler et. al. method.

At least that was our understanding last time we looked at the issue of the
names here.

So this issue has come several times already when people come and say,
"what? streaming k-means? mllib has streaming k-means" whereas everybody
else is talking about something else.

[1]
http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf

On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling  wrote:

> There is a streaming k-means implementation in MLlib that uses reservoir
> sampling.
>
> On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic 
> wrote:
>
> > Ted,
> >
> > Thank you for your answer.
> >
> > What would you then recommend me to do? My idea is to implement it to
> > enable clustering of time series using DTW (Dynamic Time Warping) as
> > distance measure. As you know, the main problem is that K-medoids is not
> > scalable, so that's standing on my way. Of course, it could be used with
> > other distances as well.
> >
> > I have already implemented something that I consider a scalable
> K-medoids,
> > based on using pivots to speed up medoid selection (
> > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This
> > works for distance measures such as Euclidean, has some limitations (best
> > results are in case of normal distribution, outliers could be a problem),
> > but it works pretty good (considering the computations). The thing is, it
> > can't be used with DTW, since it relies on projections, while triangle
> > inequality for DTW does not hold. That is why I'm considering this
> > Streaming approach now.
> >
> > Would you think that it is worthy of giving a shot? I'm really stretching
> > for a scalable solution.
> >
> > Best regards,
> > Marko
> >
> >
> > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:
> >
> >> The streaming k-means works by building a sketch of the data which is
> then
> >> used to do real clustering.
> >>
> >> It might be that this sketch would be acceptable to do k-medoids, but
> that
> >> is definitely not guaranteed.
> >>
> >> Similarly, it might be possible to build a medoid sketch instead of a
> mean
> >> based sketch, but this is also unexplored ground.
> >>
> >> The virtue of the first approach (using a m-means sketch as input to
> >> k-medoids) would be that it would make the k-medoids scalable.
> >>
> >>
> >>
> >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic 
> >> wrote:
> >>
> >>  Hello everyone,
> >>>
> >>> I have an idea and I would like to get a validation from community
> about
> >>> it.
> >>>
> >>> In Mahout there is an implementation of Streaming K-means. I'm
> interested
> >>> in your opinion would it make sense to make a similar implementation of
> >>> Streaming K-medoids?
> >>>
> >>> K-medoids has even bigger problems than K-means because it's not
> >>> scalable,
> >>> but can be useful in some cases (e.g. It allows more sophisticated
> >>> distance
> >>> measures).
> >>>
> >>> What is your opinion about implementation of this?
> >>>
> >>> Best regards,
> >>> Marko
> >>>
> >>>
> >>
>


Re: Streaming K-means

2015-06-16 Thread RJ Nowling
There is a streaming k-means implementation in MLlib that uses reservoir
sampling.

On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic 
wrote:

> Ted,
>
> Thank you for your answer.
>
> What would you then recommend me to do? My idea is to implement it to
> enable clustering of time series using DTW (Dynamic Time Warping) as
> distance measure. As you know, the main problem is that K-medoids is not
> scalable, so that's standing on my way. Of course, it could be used with
> other distances as well.
>
> I have already implemented something that I consider a scalable K-medoids,
> based on using pivots to speed up medoid selection (
> https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This
> works for distance measures such as Euclidean, has some limitations (best
> results are in case of normal distribution, outliers could be a problem),
> but it works pretty good (considering the computations). The thing is, it
> can't be used with DTW, since it relies on projections, while triangle
> inequality for DTW does not hold. That is why I'm considering this
> Streaming approach now.
>
> Would you think that it is worthy of giving a shot? I'm really stretching
> for a scalable solution.
>
> Best regards,
> Marko
>
>
> On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:
>
>> The streaming k-means works by building a sketch of the data which is then
>> used to do real clustering.
>>
>> It might be that this sketch would be acceptable to do k-medoids, but that
>> is definitely not guaranteed.
>>
>> Similarly, it might be possible to build a medoid sketch instead of a mean
>> based sketch, but this is also unexplored ground.
>>
>> The virtue of the first approach (using a m-means sketch as input to
>> k-medoids) would be that it would make the k-medoids scalable.
>>
>>
>>
>> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic 
>> wrote:
>>
>>  Hello everyone,
>>>
>>> I have an idea and I would like to get a validation from community about
>>> it.
>>>
>>> In Mahout there is an implementation of Streaming K-means. I'm interested
>>> in your opinion would it make sense to make a similar implementation of
>>> Streaming K-medoids?
>>>
>>> K-medoids has even bigger problems than K-means because it's not
>>> scalable,
>>> but can be useful in some cases (e.g. It allows more sophisticated
>>> distance
>>> measures).
>>>
>>> What is your opinion about implementation of this?
>>>
>>> Best regards,
>>> Marko
>>>
>>>
>>


Re: Streaming K-means

2015-06-02 Thread Marko Dinic

Ted,

Thank you for your answer.

What would you then recommend me to do? My idea is to implement it to 
enable clustering of time series using DTW (Dynamic Time Warping) as 
distance measure. As you know, the main problem is that K-medoids is 
not scalable, so that's standing on my way. Of course, it could be used 
with other distances as well.


I have already implemented something that I consider a scalable 
K-medoids, based on using pivots to speed up medoid selection 
(https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This 
works for distance measures such as Euclidean, has some limitations 
(best results are in case of normal distribution, outliers could be a 
problem), but it works pretty good (considering the computations). The 
thing is, it can't be used with DTW, since it relies on projections, 
while triangle inequality for DTW does not hold. That is why I'm 
considering this Streaming approach now.


Would you think that it is worthy of giving a shot? I'm really 
stretching for a scalable solution.


Best regards,
Marko

On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:

The streaming k-means works by building a sketch of the data which is then
used to do real clustering.

It might be that this sketch would be acceptable to do k-medoids, but that
is definitely not guaranteed.

Similarly, it might be possible to build a medoid sketch instead of a mean
based sketch, but this is also unexplored ground.

The virtue of the first approach (using a m-means sketch as input to
k-medoids) would be that it would make the k-medoids scalable.



On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic 
wrote:


Hello everyone,

I have an idea and I would like to get a validation from community about
it.

In Mahout there is an implementation of Streaming K-means. I'm interested
in your opinion would it make sense to make a similar implementation of
Streaming K-medoids?

K-medoids has even bigger problems than K-means because it's not scalable,
but can be useful in some cases (e.g. It allows more sophisticated distance
measures).

What is your opinion about implementation of this?

Best regards,
Marko





Re: Streaming K-means

2015-06-01 Thread Ted Dunning
The streaming k-means works by building a sketch of the data which is then
used to do real clustering.

It might be that this sketch would be acceptable to do k-medoids, but that
is definitely not guaranteed.

Similarly, it might be possible to build a medoid sketch instead of a mean
based sketch, but this is also unexplored ground.

The virtue of the first approach (using a m-means sketch as input to
k-medoids) would be that it would make the k-medoids scalable.



On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic 
wrote:

> Hello everyone,
>
> I have an idea and I would like to get a validation from community about
> it.
>
> In Mahout there is an implementation of Streaming K-means. I'm interested
> in your opinion would it make sense to make a similar implementation of
> Streaming K-medoids?
>
> K-medoids has even bigger problems than K-means because it's not scalable,
> but can be useful in some cases (e.g. It allows more sophisticated distance
> measures).
>
> What is your opinion about implementation of this?
>
> Best regards,
> Marko
>


Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić

Here is the dataset, I've just checked to be sure it is the right one.

On 09.10.2014. 15:34, Suneel Marthi wrote:

Heh u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.

Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?


On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić 
wrote:


Suneel,

Thank you for your answer, this was rather strange to me.

The number of points is 942. I have multiple runs, in each run I have a
loop in which number of clusters is increased in each iteration and I
multiple that number by 3, since I'm expecting log(n) initial centroids,
before Ball K Means step. It's actually an attempt of elbow method
implementation. It's very strange that this crashing happens occasionally.

Can I expect that problems like this be fixed in future? I'm using it
since it gives better results, both in speed and clustering quality, but it
would be a problem if it crashes like this.


On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:


Seen this issue happen a few times before, there are few edge conditions
that need to be fixed in the Streaming KMeans code and you are right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to cluster, how
r u setting the value for estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that
need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić 
wrote:

  Hello everyone,

I'm using Mahout Streaming K Means multiple times in a loop, every time
for same input data, and output path is always different. Concretely, I'm
increasing number of clusters in each iteration. Currently it is run on a
single machine.

A couple of times (maybe 3 of 20 runs) I get this exception

Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Merging 1 sorted segments
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Down to the last merge-pass, with 1 segments left of total size:
1623 bytes
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
INFO:
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local1196467414_0036
java.lang.NullPointerException
  at com.google.common.base.Preconditions.checkNotNull(
Preconditions.java:213)
  at org.apache.mahout.math.random.WeightedThing.(
WeightedThing.java:31)
  at org.apache.mahout.math.neighborhood.ProjectionSearch.
searchFirst(ProjectionSearch.java:191)
  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
iterativeAssignment(BallKMeans.java:395)
  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
cluster(BallKMeans.java:208)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
ReduceTask.java:649)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)

I'm running it like this:

String[] args1 = new String[] {"-i",dataPath,"-o",
plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters",
String.valueOf((i+1)*3),
"-ow"};
  StreamingKMeansDriver.main(args1);

I'm using the same configuration, and the same dataset, but I see no
reason why I get this exception, and it's even stranger that it doesn't
always occur.

Any ideas?

Thanks



--
Pozdrav,
Marko Dinić



--
Pozdrav,
Marko Dinić



Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić

Here is the dataset.

On четвртак, 09. октобар 2014. 16:53:25 CEST, Marko Dinić wrote:

Yes it is small, but it is just a sample, so the dataset will probably
be much bigger. So you think that this was the problem? Will this
problem be avoided in case of larger dataset?

I think that there were no output clusters, as I remember. I'm sending
the dataset, if you want to take a look.

Thanks again.

On четвртак, 09. октобар 2014. 15:34:36 CEST, Suneel Marthi wrote:

Heh u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.

Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?


On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić 
wrote:


Suneel,

Thank you for your answer, this was rather strange to me.

The number of points is 942. I have multiple runs, in each run I have a
loop in which number of clusters is increased in each iteration and I
multiple that number by 3, since I'm expecting log(n) initial
centroids,
before Ball K Means step. It's actually an attempt of elbow method
implementation. It's very strange that this crashing happens
occasionally.

Can I expect that problems like this be fixed in future? I'm using it
since it gives better results, both in speed and clustering quality,
but it
would be a problem if it crashes like this.


On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:


Seen this issue happen a few times before, there are few edge
conditions
that need to be fixed in the Streaming KMeans code and you are
right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the
top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to
cluster, how
r u setting the value for estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that
need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u
should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić

wrote:

  Hello everyone,


I'm using Mahout Streaming K Means multiple times in a loop, every
time
for same input data, and output path is always different.
Concretely, I'm
increasing number of clusters in each iteration. Currently it is
run on a
single machine.

A couple of times (maybe 3 of 20 runs) I get this exception

Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Merging 1 sorted segments
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Down to the last merge-pass, with 1 segments left of total
size:
1623 bytes
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
INFO:
Oct 09, 2014 11:30:40 AM
org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local1196467414_0036
java.lang.NullPointerException
  at com.google.common.base.Preconditions.checkNotNull(
Preconditions.java:213)
  at org.apache.mahout.math.random.WeightedThing.(
WeightedThing.java:31)
  at org.apache.mahout.math.neighborhood.ProjectionSearch.
searchFirst(ProjectionSearch.java:191)
  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
iterativeAssignment(BallKMeans.java:395)
  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
cluster(BallKMeans.java:208)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)

  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
ReduceTask.java:649)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)

I'm running it like this:

String[] args1 = new String[] {"-i",dataPath,"-o",
plusOneCentroids,"-k",String.valueOf(i+1),
"--estimatedNumMapClusters",
String.valueOf((i+1)*3),
"-ow"};
  StreamingKMeansDriver.main(args1);

I'm using the same configuration, and the same dataset, but I see no
reason why I get this exception, and it's even stranger that it
doesn't
always occur.

Any ideas?

Thanks





--
Pozdrav,
Marko Dinić





--
Pozdrav,
Marko Dinić


--
Pozdrav,
Marko Dinić


Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić
Yes it is small, but it is just a sample, so the dataset will probably 
be much bigger. So you think that this was the problem? Will this 
problem be avoided in case of larger dataset?


I think that there were no output clusters, as I remember. I'm sending 
the dataset, if you want to take a look.


Thanks again.

On четвртак, 09. октобар 2014. 15:34:36 CEST, Suneel Marthi wrote:

Heh u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.

Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?


On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić 
wrote:


Suneel,

Thank you for your answer, this was rather strange to me.

The number of points is 942. I have multiple runs, in each run I have a
loop in which number of clusters is increased in each iteration and I
multiple that number by 3, since I'm expecting log(n) initial centroids,
before Ball K Means step. It's actually an attempt of elbow method
implementation. It's very strange that this crashing happens occasionally.

Can I expect that problems like this be fixed in future? I'm using it
since it gives better results, both in speed and clustering quality, but it
would be a problem if it crashes like this.


On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:


Seen this issue happen a few times before, there are few edge conditions
that need to be fixed in the Streaming KMeans code and you are right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to cluster, how
r u setting the value for estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that
need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić 
wrote:

  Hello everyone,


I'm using Mahout Streaming K Means multiple times in a loop, every time
for same input data, and output path is always different. Concretely, I'm
increasing number of clusters in each iteration. Currently it is run on a
single machine.

A couple of times (maybe 3 of 20 runs) I get this exception

Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Merging 1 sorted segments
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Down to the last merge-pass, with 1 segments left of total size:
1623 bytes
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
INFO:
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local1196467414_0036
java.lang.NullPointerException
  at com.google.common.base.Preconditions.checkNotNull(
Preconditions.java:213)
  at org.apache.mahout.math.random.WeightedThing.(
WeightedThing.java:31)
  at org.apache.mahout.math.neighborhood.ProjectionSearch.
searchFirst(ProjectionSearch.java:191)
  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
iterativeAssignment(BallKMeans.java:395)
  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
cluster(BallKMeans.java:208)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
  at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
ReduceTask.java:649)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)

I'm running it like this:

String[] args1 = new String[] {"-i",dataPath,"-o",
plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters",
String.valueOf((i+1)*3),
"-ow"};
  StreamingKMeansDriver.main(args1);

I'm using the same configuration, and the same dataset, but I see no
reason why I get this exception, and it's even stranger that it doesn't
always occur.

Any ideas?

Thanks





--
Pozdrav,
Marko Dinić





--
Pozdrav,
Marko Dinić


Re: Streaming K Means exception without any reason

2014-10-09 Thread Suneel Marthi
Heh u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.

Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?


On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić 
wrote:

> Suneel,
>
> Thank you for your answer, this was rather strange to me.
>
> The number of points is 942. I have multiple runs, in each run I have a
> loop in which number of clusters is increased in each iteration and I
> multiple that number by 3, since I'm expecting log(n) initial centroids,
> before Ball K Means step. It's actually an attempt of elbow method
> implementation. It's very strange that this crashing happens occasionally.
>
> Can I expect that problems like this be fixed in future? I'm using it
> since it gives better results, both in speed and clustering quality, but it
> would be a problem if it crashes like this.
>
>
> On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:
>
>> Seen this issue happen a few times before, there are few edge conditions
>> that need to be fixed in the Streaming KMeans code and you are right that
>> the generated clusters are different on successive runs given the same
>> input.
>>
>> IIRC this stacktrace is due to BallKMeans failing to read any input
>> centroids - can't recall the sequence that leads to this off the top of my
>> head, will have to look.
>>
>> What's the size of ur input - the no. of points u r trying to cluster, how
>> r u setting the value for estimatedNumMapClusters ?
>> Streaming KMeans is still experimental and has scalability issues that
>> need
>> to be worked out.
>>
>> There are few other scenarios wherein Streaming KMeans fails that u should
>> be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.
>>
>> Lemme take a look at this.
>>
>>
>>
>> On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić 
>> wrote:
>>
>>  Hello everyone,
>>>
>>> I'm using Mahout Streaming K Means multiple times in a loop, every time
>>> for same input data, and output path is always different. Concretely, I'm
>>> increasing number of clusters in each iteration. Currently it is run on a
>>> single machine.
>>>
>>> A couple of times (maybe 3 of 20 runs) I get this exception
>>>
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
>>> merge
>>> INFO: Merging 1 sorted segments
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
>>> merge
>>> INFO: Down to the last merge-pass, with 1 segments left of total size:
>>> 1623 bytes
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
>>> statusUpdate
>>> INFO:
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>>> WARNING: job_local1196467414_0036
>>> java.lang.NullPointerException
>>>  at com.google.common.base.Preconditions.checkNotNull(
>>> Preconditions.java:213)
>>>  at org.apache.mahout.math.random.WeightedThing.(
>>> WeightedThing.java:31)
>>>  at org.apache.mahout.math.neighborhood.ProjectionSearch.
>>> searchFirst(ProjectionSearch.java:191)
>>>  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
>>> iterativeAssignment(BallKMeans.java:395)
>>>  at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
>>> cluster(BallKMeans.java:208)
>>>  at org.apache.mahout.clustering.streaming.mapreduce.
>>> StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
>>>  at org.apache.mahout.clustering.streaming.mapreduce.
>>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
>>>  at org.apache.mahout.clustering.streaming.mapreduce.
>>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
>>>  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
>>> ReduceTask.java:649)
>>>  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>> LocalJobRunner.java:398)
>>>
>>> I'm running it like this:
>>>
>>> String[] args1 = new String[] {"-i",dataPath,"-o",
>>> plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters",
>>> String.valueOf((i+1)*3),
>>> "-ow"};
>>>  StreamingKMeansDriver.main(args1);
>>>
>>> I'm using the same configuration, and the same dataset, but I see no
>>> reason why I get this exception, and it's even stranger that it doesn't
>>> always occur.
>>>
>>> Any ideas?
>>>
>>> Thanks
>>>
>>>
>>
> --
> Pozdrav,
> Marko Dinić
>


Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić

Suneel,

Thank you for your answer, this was rather strange to me.

The number of points is 942. I have multiple runs, in each run I have a 
loop in which number of clusters is increased in each iteration and I 
multiple that number by 3, since I'm expecting log(n) initial 
centroids, before Ball K Means step. It's actually an attempt of elbow 
method implementation. It's very strange that this crashing happens 
occasionally.


Can I expect that problems like this be fixed in future? I'm using it 
since it gives better results, both in speed and clustering quality, 
but it would be a problem if it crashes like this.


On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:

Seen this issue happen a few times before, there are few edge conditions
that need to be fixed in the Streaming KMeans code and you are right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to cluster, how
r u setting the value for estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić 
wrote:


Hello everyone,

I'm using Mahout Streaming K Means multiple times in a loop, every time
for same input data, and output path is always different. Concretely, I'm
increasing number of clusters in each iteration. Currently it is run on a
single machine.

A couple of times (maybe 3 of 20 runs) I get this exception

Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 1 sorted segments
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 1 segments left of total size:
1623 bytes
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
INFO:
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local1196467414_0036
java.lang.NullPointerException
 at com.google.common.base.Preconditions.checkNotNull(
Preconditions.java:213)
 at org.apache.mahout.math.random.WeightedThing.(
WeightedThing.java:31)
 at org.apache.mahout.math.neighborhood.ProjectionSearch.
searchFirst(ProjectionSearch.java:191)
 at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
iterativeAssignment(BallKMeans.java:395)
 at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
cluster(BallKMeans.java:208)
 at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
 at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
 at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
 at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
ReduceTask.java:649)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)

I'm running it like this:

String[] args1 = new String[] {"-i",dataPath,"-o",
plusOneCentroids,"-k",String.valueOf(i+1), 
"--estimatedNumMapClusters",String.valueOf((i+1)*3),
"-ow"};
 StreamingKMeansDriver.main(args1);

I'm using the same configuration, and the same dataset, but I see no
reason why I get this exception, and it's even stranger that it doesn't
always occur.

Any ideas?

Thanks





--
Pozdrav,
Marko Dinić


Re: Streaming K Means exception without any reason

2014-10-09 Thread Suneel Marthi
Seen this issue happen a few times before, there are few edge conditions
that need to be fixed in the Streaming KMeans code and you are right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to cluster, how
r u setting the value for estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić 
wrote:

> Hello everyone,
>
> I'm using Mahout Streaming K Means multiple times in a loop, every time
> for same input data, and output path is always different. Concretely, I'm
> increasing number of clusters in each iteration. Currently it is run on a
> single machine.
>
> A couple of times (maybe 3 of 20 runs) I get this exception
>
> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
> INFO: Merging 1 sorted segments
> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
> INFO: Down to the last merge-pass, with 1 segments left of total size:
> 1623 bytes
> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
> statusUpdate
> INFO:
> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local1196467414_0036
> java.lang.NullPointerException
> at com.google.common.base.Preconditions.checkNotNull(
> Preconditions.java:213)
> at org.apache.mahout.math.random.WeightedThing.(
> WeightedThing.java:31)
> at org.apache.mahout.math.neighborhood.ProjectionSearch.
> searchFirst(ProjectionSearch.java:191)
> at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
> iterativeAssignment(BallKMeans.java:395)
> at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
> cluster(BallKMeans.java:208)
> at org.apache.mahout.clustering.streaming.mapreduce.
> StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> at org.apache.mahout.clustering.streaming.mapreduce.
> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> at org.apache.mahout.clustering.streaming.mapreduce.
> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
> ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java:398)
>
> I'm running it like this:
>
> String[] args1 = new String[] {"-i",dataPath,"-o",
> plusOneCentroids,"-k",String.valueOf(i+1), 
> "--estimatedNumMapClusters",String.valueOf((i+1)*3),
> "-ow"};
> StreamingKMeansDriver.main(args1);
>
> I'm using the same configuration, and the same dataset, but I see no
> reason why I get this exception, and it's even stranger that it doesn't
> always occur.
>
> Any ideas?
>
> Thanks
>


Re: Streaming K Means

2014-10-02 Thread Marko Dinić

Suneel,

I thank you again for your answer.

I'm trying to implement some kind of cluster based anomaly detection. 
For that, I need to cluster normal examples, and then, when a new 
example gets into system, I need to assign it to nearest centroid (by 
calculating the distance between existing centroids and the new 
example), and then I need the distances from the points in that cluster 
to the centroid.


I could use K Means for that, but I'm hopping to get better results 
using Streaming K Means, primarily because of its KMeans++ 
initialization (which I could probably implement myself, but I'm trying 
to avoid that, since it is already implemented), and also I understand 
that it can be faster than usual Streaming K Means, since it does one 
pass clustering, before the Ball K Means step. Please correct me if you 
disagree with the things I said.


Maybe I'm doing something wrong, but I'm getting only one file as 
output - part-r-0, while I'm expecting something like - 
ClusteredPoints and Clusters-*-final, in case of KMeans? How can I get 
and read in centroids and clustered points?


Also, I see this qualcluster in the examples/bin/cluster-reuters.sh 
that you have provided, what is it used for?


Thanks,
Marko

On понедељак, 29. септембар 2014. 20:00:33 CEST, Suneel Marthi wrote:

This was replied to earlier with the details u r looking for, repeating
here again:


See
http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471
for how to invoke Streaming Kmeans

Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans
option.


If all that u r looking for his centroids and distances from centroids,
wouldn't KMeans suffice?  It would help if u could provide more details as
to what u r trying to accomplish here?


On Mon, Sep 29, 2014 at 9:55 AM, Marko  wrote:


Hello everyone,

I have previously asked a question about Streaming K Means examples, and
got an answer that there are not so many available.

Can anyone give me example of how to call Streaming K Means clustering for
a dataset, and how to get the results?

What are the results, are they the same as in basic K Means? Do I get
centroids and clustered points? And do I get the distance between point and
its centroid, like in K Means?

I would like to run Streaming K Means clustering on a dataset, and read in
centroids, and also I need the distance from the points to their given
centroids. How to do that?

Thanks





--
Pozdrav,
Marko Dinić


Re: Streaming K Means

2014-09-29 Thread Suneel Marthi
This was replied to earlier with the details u r looking for, repeating
here again:


See
http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471
for how to invoke Streaming Kmeans

Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans
option.


If all that u r looking for his centroids and distances from centroids,
wouldn't KMeans suffice?  It would help if u could provide more details as
to what u r trying to accomplish here?


On Mon, Sep 29, 2014 at 9:55 AM, Marko  wrote:

> Hello everyone,
>
> I have previously asked a question about Streaming K Means examples, and
> got an answer that there are not so many available.
>
> Can anyone give me example of how to call Streaming K Means clustering for
> a dataset, and how to get the results?
>
> What are the results, are they the same as in basic K Means? Do I get
> centroids and clustered points? And do I get the distance between point and
> its centroid, like in K Means?
>
> I would like to run Streaming K Means clustering on a dataset, and read in
> centroids, and also I need the distance from the points to their given
> centroids. How to do that?
>
> Thanks
>