On Nov 2, 2011, at 5:29 PM, Jeff Eastman wrote:

> I think the scalability problems you are seeing are a consequence of using 
> the default GaussianCluster models. These models perform especially poorly 
> for large text clustering problems such as email. The pdf() calculation over 
> wide topic vectors does a lot of complicated math for each term pdf and then 
> underflows on the combined pdf() product to boot. I've updated 
> build-reuters.sh to use a CosineDistanceMeasure and a DistanceMeasureCluster 
> instead and the performance has improved over 100x on Reuters. So has, 
> evidently, the quality of the clustering. See recent posts "Dirichlet Process 
> Clustering not working".

I shall try that on the build-asf-email.sh script.

Reply via email to