Hi Jeff, What kind of details do you need to continue? In the mean time I am anyway going back to kmeans (maybe I really start with adding canopy to my kmeans only scenario first ;)).
Best regards, Bogdan On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[email protected]>wrote: > I think KMeans and Canopy are the most-used and therefore the most robust. > Dirichlet still has not seen much use beyond some test examples and > NormalModel has at least one known problem (with sample() only returning the > maximum likelihood) that has been reported but never fixed. Can you point me > to the problem you are running so I can try to get up to speed? It has been > some time since I worked in this code but I'm keen to do so and I have some > time to invest. > > Jeff > > > > Bogdan Vatkov wrote: > >> But I am the first one to use Dirichlet which algorithm is the recommended >> one? Are all other algs better then Dirichlet so no one used it ;)? >> >> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[email protected] >> >wrote: >> >> >> >>> The NormalModelDistribution seems to still think all the data vectors are >>> size=2. In SampleFromPrior, it is creating models with that size. >>> Subsequently, when you calculate the pdf with your data value (x) the >>> sizes >>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)', >>> where n is your data cardinality. Please also look at the rest of the >>> math >>> in DenseVector with suspiscion. AFAIK, you are the first person to try to >>> use Dirichlet. >>> >>> >>> >>> Bogdan Vatkov wrote: >>> >>> >>> >>>> I see a stack when the size of the vectore mean is set to 2: >>>> >>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in >>>> NormalModel)) >>>> NormalModel.<init>(Vector, double) line: 48 >>>> NormalModelDistribution.sampleFromPrior(int) line: 33 >>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) >>>> line: >>>> 48 >>>> DirichletDriver.createState(String, int, double) line: 172 >>>> DirichletDriver.writeInitialState(String, String, String, int, double) >>>> line: >>>> 150 >>>> DirichletDriver.runJob(String, String, String, int, int, double, int) >>>> line: >>>> 133 >>>> DirichletDriver.main(String[]) line: 109 >>>> Clusters.doClustering() line: 244 >>>> Clusters.access$0(Clusters) line: 175 >>>> Clusters$1.run() line: 148 >>>> Thread.run() line: 619 >>>> >>>> >>>> public class NormalModelDistribution implements >>>> ModelDistribution<Vector> >>>> { >>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) { >>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i < >>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } >>>> return >>>> result; } >>>> >>>> and later this vector is dotted to >>>> @Override >>>> public double pdf(Vector x) { >>>> double sd2 = stdDev * stdDev; >>>> double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * >>>> sd2); >>>> double ex = Math.exp(exp); >>>> return ex / (stdDev * sqrt2pi); >>>> } >>>> >>>> x vector which is coming from Hadoop MapRunner through the map function: >>>> >>>> public void map(WritableComparable<?> key, Vector v, >>>> OutputCollector<Text, Vector> output, Reporter reporter) >>>> throws IOException { >>>> >>>> >>>> any idea? >>>> >>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it >>>> safe >>>> enough to run against trunk? >>>> >>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]> >>>> wrote: >>>> >>>> >>>> >>>> >>>> >>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov < >>>>> [email protected] >>>>> >>>>> >>>>> >>>>> >>>>>> wrote: >>>>>> Sorry, what does that mean :)? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> It means that there is probably a programming bug somehow. At the very >>>>> least, the program is not robust with respect to strange invocations. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> what is a dotted vector? and why aren't they the same? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> dot product is a vector operation that is the sum of products of >>>>> corresponding elements of the two vectors being operated on. If these >>>>> vectors don't have the same length, then it is an error. >>>>> >>>>> what should I investigate? >>>>> I am not familiar with the code, but if I had time to look, my >>>>> strategy >>>>> would be to start in the NormalModel and work back up the stack trace >>>>> to >>>>> find out how the vectors came to be different lengths. No doubt, the >>>>> code >>>>> in NormalModel will not tell you anything, but you can see which >>>>> vectors >>>>> are >>>>> involved and by walking up the stack you may be able to see where they >>>>> come >>>>> from. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> I am basically running my complete kmeans scenario (same input data, >>>>>> same >>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main >>>>>> step >>>>>> with a DirichletDriver.main call...of course the arguments are >>>>>> adjusted >>>>>> since kmeans and dirichlet do not have the same arguments. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> I would think that this sounds very plausible. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> I am not sure what number I should give for the alpha argument, >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Alpha should have a value in the range from 0.01 to 20. I would scan >>>>> with >>>>> 1,2, 5 magnitude steps to see what works well for your data. (i.e. >>>>> 0.01, >>>>> 0.02, 0.05, 0.1, 0.2 ... 20). A value of 1 is a fine place to start. >>>>> The >>>>> effect of different values should be small over a pretty wide range. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> iterations >>>>>> and reductions...here is my current argument set: >>>>>> >>>>>> args = new String[] { >>>>>> "--input", >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", >>>>> >>>>> >>>>> >>>>> >>>>>> "--output", config.getClustersDir(), >>>>>> "--modelClass", >>>>>> >>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", >>>>>> "--maxIter", "15", >>>>>> "--alpha", "1.0", >>>>>> "--k", config.getClustersCount(), >>>>>> "--maxRed", "2" >>>>>> }; >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Not off-hand. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> >> > > -- Best regards, Bogdan
