unfortunately I am using private data which I cannot share. I am using emails, indexed by Solr and then creating vectors out of them. I am using them with k-means and everything is ok. Just wanted to try out the Dirichlet algorithm.
On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <[email protected]>wrote: > I gather you are doing text clustering? Are you using one of our example > datasets or one which is publicly available? > > > > Bogdan Vatkov wrote: > >> Hi Jeff, >> >> What kind of details do you need to continue? >> In the mean time I am anyway going back to kmeans (maybe I really start >> with >> adding canopy to my kmeans only scenario first ;)). >> >> Best regards, >> Bogdan >> >> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[email protected] >> >wrote: >> >> >> >>> I think KMeans and Canopy are the most-used and therefore the most >>> robust. >>> Dirichlet still has not seen much use beyond some test examples and >>> NormalModel has at least one known problem (with sample() only returning >>> the >>> maximum likelihood) that has been reported but never fixed. Can you point >>> me >>> to the problem you are running so I can try to get up to speed? It has >>> been >>> some time since I worked in this code but I'm keen to do so and I have >>> some >>> time to invest. >>> >>> Jeff >>> >>> >>> >>> Bogdan Vatkov wrote: >>> >>> >>> >>>> But I am the first one to use Dirichlet which algorithm is the >>>> recommended >>>> one? Are all other algs better then Dirichlet so no one used it ;)? >>>> >>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman < >>>> [email protected] >>>> >>>> >>>>> wrote: >>>>> >>>>> >>>> >>>> >>>> >>>>> The NormalModelDistribution seems to still think all the data vectors >>>>> are >>>>> size=2. In SampleFromPrior, it is creating models with that size. >>>>> Subsequently, when you calculate the pdf with your data value (x) the >>>>> sizes >>>>> are incompatible. Suggest changing 'DenseVector(2)' to >>>>> 'DenseVector(n)', >>>>> where n is your data cardinality. Please also look at the rest of the >>>>> math >>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try >>>>> to >>>>> use Dirichlet. >>>>> >>>>> >>>>> >>>>> Bogdan Vatkov wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> I see a stack when the size of the vectore mean is set to 2: >>>>>> >>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in >>>>>> NormalModel)) >>>>>> NormalModel.<init>(Vector, double) line: 48 >>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33 >>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) >>>>>> line: >>>>>> 48 >>>>>> DirichletDriver.createState(String, int, double) line: 172 >>>>>> DirichletDriver.writeInitialState(String, String, String, int, double) >>>>>> line: >>>>>> 150 >>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int) >>>>>> line: >>>>>> 133 >>>>>> DirichletDriver.main(String[]) line: 109 >>>>>> Clusters.doClustering() line: 244 >>>>>> Clusters.access$0(Clusters) line: 175 >>>>>> Clusters$1.run() line: 148 >>>>>> Thread.run() line: 619 >>>>>> >>>>>> >>>>>> public class NormalModelDistribution implements >>>>>> ModelDistribution<Vector> >>>>>> { >>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) { >>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i < >>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } >>>>>> return >>>>>> result; } >>>>>> >>>>>> and later this vector is dotted to >>>>>> @Override >>>>>> public double pdf(Vector x) { >>>>>> double sd2 = stdDev * stdDev; >>>>>> double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * >>>>>> sd2); >>>>>> double ex = Math.exp(exp); >>>>>> return ex / (stdDev * sqrt2pi); >>>>>> } >>>>>> >>>>>> x vector which is coming from Hadoop MapRunner through the map >>>>>> function: >>>>>> >>>>>> public void map(WritableComparable<?> key, Vector v, >>>>>> OutputCollector<Text, Vector> output, Reporter >>>>>> reporter) >>>>>> throws IOException { >>>>>> >>>>>> >>>>>> any idea? >>>>>> >>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it >>>>>> safe >>>>>> enough to run against trunk? >>>>>> >>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov < >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> wrote: >>>>>>>> Sorry, what does that mean :)? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> It means that there is probably a programming bug somehow. At the >>>>>>> very >>>>>>> least, the program is not robust with respect to strange invocations. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> what is a dotted vector? and why aren't they the same? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> dot product is a vector operation that is the sum of products of >>>>>>> corresponding elements of the two vectors being operated on. If >>>>>>> these >>>>>>> vectors don't have the same length, then it is an error. >>>>>>> >>>>>>> what should I investigate? >>>>>>> I am not familiar with the code, but if I had time to look, my >>>>>>> strategy >>>>>>> would be to start in the NormalModel and work back up the stack trace >>>>>>> to >>>>>>> find out how the vectors came to be different lengths. No doubt, the >>>>>>> code >>>>>>> in NormalModel will not tell you anything, but you can see which >>>>>>> vectors >>>>>>> are >>>>>>> involved and by walking up the stack you may be able to see where >>>>>>> they >>>>>>> come >>>>>>> from. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I am basically running my complete kmeans scenario (same input data, >>>>>>>> same >>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main >>>>>>>> step >>>>>>>> with a DirichletDriver.main call...of course the arguments are >>>>>>>> adjusted >>>>>>>> since kmeans and dirichlet do not have the same arguments. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I would think that this sounds very plausible. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I am not sure what number I should give for the alpha argument, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Alpha should have a value in the range from 0.01 to 20. I would scan >>>>>>> with >>>>>>> 1,2, 5 magnitude steps to see what works well for your data. (i.e. >>>>>>> 0.01, >>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20). A value of 1 is a fine place to start. >>>>>>> The >>>>>>> effect of different values should be small over a pretty wide range. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> iterations >>>>>>>> and reductions...here is my current argument set: >>>>>>>> >>>>>>>> args = new String[] { >>>>>>>> "--input", >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> "--output", config.getClustersDir(), >>>>>>>> "--modelClass", >>>>>>>> >>>>>>>> >>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", >>>>>>>> "--maxIter", "15", >>>>>>>> "--alpha", "1.0", >>>>>>>> "--k", config.getClustersCount(), >>>>>>>> "--maxRed", "2" >>>>>>>> }; >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Not off-hand. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> >> > > -- Best regards, Bogdan
