But I am the first one to use Dirichlet which algorithm is the recommended one? Are all other algs better then Dirichlet so no one used it ;)?
On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[email protected]>wrote: > The NormalModelDistribution seems to still think all the data vectors are > size=2. In SampleFromPrior, it is creating models with that size. > Subsequently, when you calculate the pdf with your data value (x) the sizes > are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)', > where n is your data cardinality. Please also look at the rest of the math > in DenseVector with suspiscion. AFAIK, you are the first person to try to > use Dirichlet. > > > > Bogdan Vatkov wrote: > >> I see a stack when the size of the vectore mean is set to 2: >> >> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in >> NormalModel)) >> NormalModel.<init>(Vector, double) line: 48 >> NormalModelDistribution.sampleFromPrior(int) line: 33 >> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) >> line: >> 48 >> DirichletDriver.createState(String, int, double) line: 172 >> DirichletDriver.writeInitialState(String, String, String, int, double) >> line: >> 150 >> DirichletDriver.runJob(String, String, String, int, int, double, int) >> line: >> 133 >> DirichletDriver.main(String[]) line: 109 >> Clusters.doClustering() line: 244 >> Clusters.access$0(Clusters) line: 175 >> Clusters$1.run() line: 148 >> Thread.run() line: 619 >> >> >> public class NormalModelDistribution implements ModelDistribution<Vector> >> { >> @Override public Model<Vector>[] sampleFromPrior(int howMany) { >> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i < >> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } >> return >> result; } >> >> and later this vector is dotted to >> @Override >> public double pdf(Vector x) { >> double sd2 = stdDev * stdDev; >> double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * >> sd2); >> double ex = Math.exp(exp); >> return ex / (stdDev * sqrt2pi); >> } >> >> x vector which is coming from Hadoop MapRunner through the map function: >> >> public void map(WritableComparable<?> key, Vector v, >> OutputCollector<Text, Vector> output, Reporter reporter) >> throws IOException { >> >> >> any idea? >> >> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it >> safe >> enough to run against trunk? >> >> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]> >> wrote: >> >> >> >>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[email protected] >>> >>> >>>> wrote: >>>> Sorry, what does that mean :)? >>>> >>>> >>>> >>> It means that there is probably a programming bug somehow. At the very >>> least, the program is not robust with respect to strange invocations. >>> >>> >>> >>> >>>> what is a dotted vector? and why aren't they the same? >>>> >>>> >>>> >>> dot product is a vector operation that is the sum of products of >>> corresponding elements of the two vectors being operated on. If these >>> vectors don't have the same length, then it is an error. >>> >>> what should I investigate? >>> I am not familiar with the code, but if I had time to look, my >>> strategy >>> would be to start in the NormalModel and work back up the stack trace to >>> find out how the vectors came to be different lengths. No doubt, the >>> code >>> in NormalModel will not tell you anything, but you can see which vectors >>> are >>> involved and by walking up the stack you may be able to see where they >>> come >>> from. >>> >>> >>> >>> >>>> I am basically running my complete kmeans scenario (same input data, >>>> same >>>> number of clusters param, etc.) but just replacing KmeansDriver.main >>>> step >>>> with a DirichletDriver.main call...of course the arguments are adjusted >>>> since kmeans and dirichlet do not have the same arguments. >>>> >>>> >>>> >>> I would think that this sounds very plausible. >>> >>> >>> >>> >>>> I am not sure what number I should give for the alpha argument, >>>> >>>> >>> Alpha should have a value in the range from 0.01 to 20. I would scan >>> with >>> 1,2, 5 magnitude steps to see what works well for your data. (i.e. 0.01, >>> 0.02, 0.05, 0.1, 0.2 ... 20). A value of 1 is a fine place to start. >>> The >>> effect of different values should be small over a pretty wide range. >>> >>> >>> >>> >>>> iterations >>>> and reductions...here is my current argument set: >>>> >>>> args = new String[] { >>>> "--input", >>>> >>>> >>>> >>>> >>> >>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", >>> >>> >>>> "--output", config.getClustersDir(), >>>> "--modelClass", >>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", >>>> "--maxIter", "15", >>>> "--alpha", "1.0", >>>> "--k", config.getClustersCount(), >>>> "--maxRed", "2" >>>> }; >>>> >>>> >>>> >>>> >>> Not off-hand. >>> >>> >>> >> >> >> >> >> > > -- Best regards, Bogdan
