Hi Jeff,

What kind of details do you need to continue?
In the mean time I am anyway going back to kmeans (maybe I really start with
adding canopy to my kmeans only scenario first ;)).

Best regards,
Bogdan

On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[email protected]>wrote:

> I think KMeans and Canopy are the most-used and therefore the most robust.
> Dirichlet still has not seen much use beyond some test examples and
> NormalModel has at least one known problem (with sample() only returning the
> maximum likelihood) that has been reported but never fixed. Can you point me
> to the problem you are running so I can try to get up to speed? It has been
> some time since I worked in this code but I'm keen to do so and I have some
> time to invest.
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> But I am the first one to use Dirichlet which algorithm is the recommended
>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>
>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[email protected]
>> >wrote:
>>
>>
>>
>>> The NormalModelDistribution seems to still think all the data vectors are
>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>> Subsequently, when you calculate the pdf with your data value (x) the
>>> sizes
>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>> where n is your data cardinality. Please also look at the rest of the
>>> math
>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>> use Dirichlet.
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>
>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>> NormalModel))
>>>> NormalModel.<init>(Vector, double) line: 48
>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>> line:
>>>> 48
>>>> DirichletDriver.createState(String, int, double) line: 172
>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>> line:
>>>> 150
>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>> line:
>>>> 133
>>>> DirichletDriver.main(String[]) line: 109
>>>> Clusters.doClustering() line: 244
>>>> Clusters.access$0(Clusters) line: 175
>>>> Clusters$1.run() line: 148
>>>> Thread.run() line: 619
>>>>
>>>>
>>>> public class NormalModelDistribution implements
>>>> ModelDistribution<Vector>
>>>> {
>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>> return
>>>> result; }
>>>>
>>>> and later this vector is dotted to
>>>>  @Override
>>>>  public double pdf(Vector x) {
>>>>   double sd2 = stdDev * stdDev;
>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>> sd2);
>>>>   double ex = Math.exp(exp);
>>>>   return ex / (stdDev * sqrt2pi);
>>>>  }
>>>>
>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>
>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>> throws IOException {
>>>>
>>>>
>>>> any idea?
>>>>
>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>> safe
>>>> enough to run against trunk?
>>>>
>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>     Sorry, what does that mean :)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>> least, the program is not robust with respect to strange invocations.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> dot product is a vector operation that is the sum of products of
>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>> vectors don't have the same length, then it is an error.
>>>>>
>>>>> what should I investigate?
>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>> strategy
>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>> to
>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>> code
>>>>> in NormalModel will not tell you anything, but you can see which
>>>>> vectors
>>>>> are
>>>>> involved and by walking up the stack you may be able to see where they
>>>>> come
>>>>> from.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>> same
>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>> step
>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>> adjusted
>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I would think that this sounds very plausible.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>> with
>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>> 0.01,
>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>  The
>>>>> effect of different values should be small over a pretty wide range.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> iterations
>>>>>> and reductions...here is my current argument set:
>>>>>>
>>>>>> args = new String[] {
>>>>>> "--input",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> "--output", config.getClustersDir(),
>>>>>> "--modelClass",
>>>>>>
>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>> "--maxIter", "15",
>>>>>> "--alpha", "1.0",
>>>>>> "--k", config.getClustersCount(),
>>>>>> "--maxRed", "2"
>>>>>> };
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Not off-hand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Reply via email to