But I am the first one to use Dirichlet which algorithm is the recommended
one? Are all other algs better then Dirichlet so no one used it ;)?

On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[email protected]>wrote:

> The NormalModelDistribution seems to still think all the data vectors are
> size=2.  In SampleFromPrior, it is creating models with that size.
> Subsequently, when you calculate the pdf with your data value (x) the sizes
> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
> where n is your data cardinality. Please also look at the rest of the math
> in DenseVector with suspiscion. AFAIK, you are the first person to try to
> use Dirichlet.
>
>
>
> Bogdan Vatkov wrote:
>
>> I see a stack  when the size of the vectore mean is set to 2:
>>
>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>> NormalModel))
>> NormalModel.<init>(Vector, double) line: 48
>> NormalModelDistribution.sampleFromPrior(int) line: 33
>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>> line:
>> 48
>> DirichletDriver.createState(String, int, double) line: 172
>> DirichletDriver.writeInitialState(String, String, String, int, double)
>> line:
>> 150
>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>> line:
>> 133
>> DirichletDriver.main(String[]) line: 109
>> Clusters.doClustering() line: 244
>> Clusters.access$0(Clusters) line: 175
>> Clusters$1.run() line: 148
>> Thread.run() line: 619
>>
>>
>> public class NormalModelDistribution implements ModelDistribution<Vector>
>> {
>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>> return
>> result; }
>>
>> and later this vector is dotted to
>>  @Override
>>  public double pdf(Vector x) {
>>    double sd2 = stdDev * stdDev;
>>    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>> sd2);
>>    double ex = Math.exp(exp);
>>    return ex / (stdDev * sqrt2pi);
>>  }
>>
>> x vector which is coming from Hadoop MapRunner through the map function:
>>
>>  public void map(WritableComparable<?> key, Vector v,
>>                  OutputCollector<Text, Vector> output, Reporter reporter)
>> throws IOException {
>>
>>
>> any idea?
>>
>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>> safe
>> enough to run against trunk?
>>
>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]>
>> wrote:
>>
>>
>>
>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[email protected]
>>>
>>>
>>>> wrote:
>>>>      Sorry, what does that mean :)?
>>>>
>>>>
>>>>
>>> It means that there is probably a programming bug somehow.  At the very
>>> least, the program is not robust with respect to strange invocations.
>>>
>>>
>>>
>>>
>>>> what is a dotted vector? and why aren't they the same?
>>>>
>>>>
>>>>
>>> dot product is a vector operation that is the sum of products of
>>> corresponding elements of the two vectors being operated on.  If these
>>> vectors don't have the same length, then it is an error.
>>>
>>> what should I investigate?
>>>    I am not familiar with the code, but if I had time to look, my
>>> strategy
>>> would be to start in the NormalModel and work back up the stack trace to
>>> find out how the vectors came to be different lengths.  No doubt, the
>>> code
>>> in NormalModel will not tell you anything, but you can see which vectors
>>> are
>>> involved and by walking up the stack you may be able to see where they
>>> come
>>> from.
>>>
>>>
>>>
>>>
>>>> I am basically running my complete kmeans scenario (same input data,
>>>> same
>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>> step
>>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>>> since kmeans and dirichlet do not have the same arguments.
>>>>
>>>>
>>>>
>>> I would think that this sounds very plausible.
>>>
>>>
>>>
>>>
>>>> I am not sure what number I should give for the alpha argument,
>>>>
>>>>
>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>> with
>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>  The
>>> effect of different values should be small over a pretty wide range.
>>>
>>>
>>>
>>>
>>>> iterations
>>>> and reductions...here is my current argument set:
>>>>
>>>> args = new String[] {
>>>> "--input",
>>>>
>>>>
>>>>
>>>>
>>>
>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>
>>>
>>>> "--output", config.getClustersDir(),
>>>> "--modelClass",
>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>> "--maxIter", "15",
>>>> "--alpha", "1.0",
>>>> "--k", config.getClustersCount(),
>>>> "--maxRed", "2"
>>>> };
>>>>
>>>>
>>>>
>>>>
>>> Not off-hand.
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Reply via email to