On Nov 2, 2011, at 5:29 PM, Jeff Eastman wrote: > I think the scalability problems you are seeing are a consequence of using > the default GaussianCluster models. These models perform especially poorly > for large text clustering problems such as email. The pdf() calculation over > wide topic vectors does a lot of complicated math for each term pdf and then > underflows on the combined pdf() product to boot. I've updated > build-reuters.sh to use a CosineDistanceMeasure and a DistanceMeasureCluster > instead and the performance has improved over 100x on Reuters. So has, > evidently, the quality of the clustering. See recent posts "Dirichlet Process > Clustering not working".
I shall try that on the build-asf-email.sh script.
