Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Dirichlet Process Clustering 
(https://cwiki.apache.org/confluence/display/MAHOUT/Dirichlet+Process+Clustering)


Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Overview

The Dirichlet Process Clustering algorithm performs Bayesian mixture modeling.

The idea is that we use a probabilistic mixture of a number of models that we 
use to explain some observed data. Each observed data point is assumed to have 
come from one of the models in the mixture, but we don't know which.  The way 
we deal with that is to use a so-called latent parameter which specifies which 
model each data point came from.

In addition, since this is a Bayesian clustering algorithm, we don't want to 
actually commit to any single explanation, but rather to sample from the 
distribution of models and latent assignments of data points to models given 
the observed data and the prior distributions of model parameters. This 
sampling process is initialized by taking models at random from the prior 
distribution for models.

Then, we iteratively assign points to the different models using the mixture 
probabilities and the degree of fit between the point and each model expressed 
as a probability that the point was generated by that model. After points are 
assigned, new parameters for each model are sampled from the posterior 
distribution for the model parameters considering all of the observed data 
points that were assigned to the model.  Models without any data points are 
also sampled, but since they have no points assigned, the new samples are 
effectively taken from the prior distribution for model parameters.

The result is a number of samples that represent mixing probabilities, models 
and assignment of points to models. If the total number of possible models is 
substantially larger than the number that ever have points assigned to them, 
then this algorithm provides a (nearly) non-parametric clustering algorithm. 
These samples can give us interesting information that is lacking from a normal 
clustering that consists of a single assignment of points to clusters.  
Firstly, by examining the number of models in each sample that actually has any 
points assigned to it, we can get information about how many models (clusters) 
that the data support. Morevoer, by examining how often two points are assigned 
to the same model, we can get an approximate measure of how likely these points 
are to be explained by the same model.  Such soft membership information is 
difficult to come by with conventional clustering methods.

Finally, we can get an idea of the stability of how the data can be described.  
Typically, aspects of the data with lots of data available wind up with stable 
descriptions while at the edges, there are aspects that are phenomena that we 
can't really commit to a solid description, but it is still clear that the well 
supported explanations are insufficient to explain these additional aspects. 
One thing that can be difficult about these samples is that we can't always 
assign a correlation between the models in the different samples.  Probably the 
best way to do this is to look for overlap in the assignments of data 
observations to the different models.

h2. Design of Implementation

The implementation accepts one input directory containing the data points to be 
clustered. The data directory contains multiple input files of 
SequenceFile(key, VectorWritable). The input data points are not modified by 
the implementation, allowing experimentation with initial clustering and 
convergence values.

The program iterates over the input points, outputting a new directory 
"clusters-N" containing SequenceFile(Text, DirichletCluster) files for each 
iteration N. This process uses a mapper/reducer/driver as follows:

DirichletMapper - reads the input clusters during its configure() method, then 
assigns and outputs each input point to a probable cluster as defined by the 
model's pdf() function. Output _key_ is: clusterId. Output _value_ is: input 
point.
DirichletReducer - reads the input clusters during its configure() method, then 
each reducer receives clusterId:VectorWritable pairs from all mappers and 
accumulates them to produce a new posterior model for each cluster which is 
output. Output _key_ is: clusterId. Output value is: DirichletCluster. Reducer 
outputs are used as the input clusters for the next iteration.
DirichletDriver - iterates over the points and clusters until the given number 
of iterations has been reached. During iterations, a new clusters directory 
"clusters-N" is produced with the output clusters from the previous iteration 
used for input to the next. A final optional pass over the data using the 
DirichletClusterMapper clusters all points to an output directory 
"clusteredPoints" and has no combiner or reducer steps.

h2. Running Dirichlet Process Clustering

The Dirichlet clustering algorithm may be run using a command-line invocation 
on DirichletDriver.main or by making a Java call to DirichletDriver.runJob(). 

Invocation using the command line takes the form:

{noformat}
bin/mahout dirichlet \
    -i <input vectors directory> \
    -o <output working directory> \
    -a0 <the alpha_0 parameter to the Dirichlet Distribution>
    -x <maximum number of iterations> \
    -k <number of models to create from prior> \
    -md <the ModelDistribution class name. Default NormalModelDistribution> \
    -mp <the ModelPrototype class name. Default SequentialAccessSparseVector> \
    -dm <optional DistanceMeasure class name for some ModelDistribution>
    -ow <overwrite output directory if present>
    -cl <run input vector clustering after computing Clusters>
    -e <emit vectors to most likely cluster during clustering>
    -t <threshold to use for clustering if -e is false>
    -xm <execution method: sequential or mapreduce>
{noformat}

Invocation using Java involves supplying the following arguments:

# input: a file path string to a directory containing the input data set a 
SequenceFile(WritableComparable, VectorWritable). The sequence file _key_ is 
not used.
# output: a file path string to an empty directory which is used for all output 
from the algorithm.
# modelFactory: an instance of ModelDistribution which will be used for the 
clustering.
# numClusters: the number of models to be used for the clustering. This should 
be larger than the number of clusters which is expected in the data set.
# maxIterations: the number of iterations to run for the clustering.
# alpha_0: a double value (default is 1.0) used for creating the 
DirichletDistribution. Influences the likelihood that new, empty clusters will 
be selected for assignment in the first iteration.
# runClustering: a boolean indicating, if true, that the clustering step is to 
be executed after clusters have been determined.
# emitMostLikely: a boolean indicating, if true, that the clustering step 
should only emit the most likely cluster for each clustered point.
# threshold: a double indicating, if emitMostLikely is false, the cluster 
probability threshold used for emitting multiple clusters for each point. A 
value of 0 will emit all clusters with their associated probabilities for each 
vector.
# runSequential: a boolean indicating, if true, that the clustering is to be 
run using the sequential reference implementation in memory.

After running the algorithm, the output directory will contain:
# clusters-N: directories containing SequenceFiles(Text, DirichletCluster) 
produced by the algorithm for each iteration. The Text _key_ is a cluster 
identifier string.
# clusteredPoints: (if runClustering enabled) a directory containing 
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is the 
clusterId. The WeightedVectorWritable _value_ is a bean containing a double 
_weight_ and a VectorWritable _vector_ where the weight indicates the 
probability that the vector is a member of the cluster (the Probability Density 
Function (pdf) of the cluster model evaluated at the vector). 


h1. Examples

The following images illustrate three different prior models applied to a set 
of randomly-generated 2-d data points. The points are generated using a normal 
distribution centered at a mean location and with a constant standard 
deviation. See the README file in the 
[/examples/src/main/java/org/apache/mahout/clustering/display/README.txt|http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt]
 for details on running similar examples.

The points are generated as follows:

* 500 samples m=\[1.0, 1.0\] sd=3.0
* 300 samples m=\[1.0, 0.0\] sd=0.5
* 300 samples m=\[0.0, 2.0\] sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of their 
generator are superimposed. It is, of course, impossible to tell which model 
actually generated each point as there is some probability - perhaps small - 
that any of the models could have generated every point.

!SampleData.png!

In the next image, the Dirichlet Process Clusterer is run against the sample 
points using a NormalModelDistribution with m=\[0.0, 0.0\] sd=1.0. This 
distribution represents the least amount of prior information, as its sampled 
models all have constant parameters. The resulting significant models 
(representing > 5% of the population) are superimposed upon the sample data. 
Since all prior models are identical and their pdfs are the same, the first 
iteration's assignment of points to models is completely governed by the 
initial mixture values. Since these are also identical, it means the first 
iteration assigns points to models at random. During subsequent iterations, the 
models diverge from the origin but there is some over-fitting in the result.

As Dirichlet clustering is an iterative process, the following illustrations 
include the cluster information from all iterations. The final cluster values 
are in bold red and earlier iterations are shown in \[orange, yellow, green, 
blue, violet and the rest are all gray\]. These illustrate the cluster 
convergence process over the last several iterations and can be helpful in 
tuning the algorithm.

!DirichletN.png!

The next image improves upon this situation by using a 
SampledNormalDistribution. In this distribution, the prior models have means 
that are sampled from a normal distribution and all have a constant sd=1. This 
distribution creates initial models that are centered at different coordinates. 
During the first iteration, each model thus has a different pdf for each point 
and the iteration assigns points to the more-likely models given this value. 
The result is a decent capture of the sample data parameters but there is still 
some over-fitting.

!DirichletSN.png!

The above image was run through 20 iterations and the cluster assignments are 
clearly moving indicating the clustering is not yet converged. The next image 
runs the same model for 40 iterations, producing an accurate model of the input 
data.

!DirichletSN40.png!

The next image uses an AsymmetricSampledNormalDistribution in which the model's 
standard deviation is also represented as a 2-d vector. This causes the 
clusters to assume elliptical shapes in the resulting clustering. This 
represents an incorrect prior assumption but it is interesting that it fits the 
actual sample data quite well. Had we suspected the sample points were 
generated in a similar manner then this distribution would have been the most 
logical model.

!DirichletASN.png!

In order to explore an asymmetrical sample data distribution, the following 
image shows a number of points generated according to the following parameters. 
Again, the generator's 3-sigma ellipses are superimposed:

* 500 samples m=\[1.0, 1.0\] sd=\[3.0, 1.0\]
* 300 samples m=\[1.0, 0.0\] sd=\[0.5, 1.0\]
* 300 samples m=\[0.0, 2.0\] sd=\[0.1, 0.5\]

!AsymmetricSampleData.png!

The following image shows the results of applying the symmetrical 
SampledNormalDistribution to the asymmetrically-generated sample data. It does 
a valiant effort but does not capture a very good set of models because the 
circular model assumption does not fit the data.

!2dDirichletSN.png!

Finally, the AsymmetricSampledNormalDistribution is run against the 
asymmetrical sample data. Though there is some over-fitting, it does a credible 
job of capturing the underlying models. Different arguments (numClusters, 
alpha0, numIterations) and display thresholds will yield slightly different 
results. Compare the first run of numClusters=20 models for 20 iterations with 
another run of numClusters=40 models for 40 iterations.

!2dDirichletASN.png!
!2dDirichletASN4040.png!

h1. References

McCullagh and Yang: http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

There is also a more approachable example in [Chris Bishop's book on Machine 
Learning| 
http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm]. I think 
that chapter 9 is where the example of clustering using a mixture model is 
found.

The Neal and Blei references from the McCullagh and Yang paper are also good. 
Zoubin Gharamani has some very [nice tutorials out which describe why 
non-parametric Bayesian approaches to problems are very 
cool|http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf], there are 
video versions about as well.


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to