Here is the dataset.

On четвртак, 09. октобар 2014. 16:53:25 CEST, Marko Dinić wrote:
Yes it is small, but it is just a sample, so the dataset will probably
be much bigger. So you think that this was the problem? Will this
problem be avoided in case of larger dataset?

I think that there were no output clusters, as I remember. I'm sending
the dataset, if you want to take a look.

Thanks again.

On четвртак, 09. октобар 2014. 15:34:36 CEST, Suneel Marthi wrote:
Heh.... u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.

Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?


On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić <marko.di...@nissatech.com>
wrote:

Suneel,

Thank you for your answer, this was rather strange to me.

The number of points is 942. I have multiple runs, in each run I have a
loop in which number of clusters is increased in each iteration and I
multiple that number by 3, since I'm expecting log(n) initial
centroids,
before Ball K Means step. It's actually an attempt of elbow method
implementation. It's very strange that this crashing happens
occasionally.

Can I expect that problems like this be fixed in future? I'm using it
since it gives better results, both in speed and clustering quality,
but it
would be a problem if it crashes like this.


On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:

Seen this issue happen a few times before, there are few edge
conditions
that need to be fixed in the Streaming KMeans code and you are
right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the
top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to
cluster, how
r u setting the value for ----estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that
need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u
should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić
<marko.di...@nissatech.com>
wrote:

  Hello everyone,

I'm using Mahout Streaming K Means multiple times in a loop, every
time
for same input data, and output path is always different.
Concretely, I'm
increasing number of clusters in each iteration. Currently it is
run on a
single machine.

A couple of times (maybe 3 of 20 runs) I get this exception

Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Merging 1 sorted segments
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
merge
INFO: Down to the last merge-pass, with 1 segments left of total
size:
1623 bytes
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
INFO:
Oct 09, 2014 11:30:40 AM
org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local1196467414_0036
java.lang.NullPointerException
      at com.google.common.base.Preconditions.checkNotNull(
Preconditions.java:213)
      at org.apache.mahout.math.random.WeightedThing.<init>(
WeightedThing.java:31)
      at org.apache.mahout.math.neighborhood.ProjectionSearch.
searchFirst(ProjectionSearch.java:191)
      at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
iterativeAssignment(BallKMeans.java:395)
      at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
cluster(BallKMeans.java:208)
      at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)

      at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
      at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
      at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
ReduceTask.java:649)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)

I'm running it like this:

String[] args1 = new String[] {"-i",dataPath,"-o",
plusOneCentroids,"-k",String.valueOf(i+1),
"--estimatedNumMapClusters",
String.valueOf((i+1)*3),
"-ow"};
                          StreamingKMeansDriver.main(args1);

I'm using the same configuration, and the same dataset, but I see no
reason why I get this exception, and it's even stranger that it
doesn't
always occur.

Any ideas?

Thanks



--
Pozdrav,
Marko Dinić



--
Pozdrav,
Marko Dinić

--
Pozdrav,
Marko Dinić

Reply via email to