Ah, ok, perhaps I will start with something similar and see how far I can get with Dirichlet.

Bogdan Vatkov wrote:
unfortunately I am using private data which I cannot share. I am using
emails, indexed by Solr and then creating vectors out of them. I am using
them with k-means and everything is ok. Just wanted to try out the Dirichlet
algorithm.

On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <[email protected]>wrote:

I gather you are doing text clustering? Are you using one of our example
datasets or one which is publicly available?



Bogdan Vatkov wrote:

Hi Jeff,

What kind of details do you need to continue?
In the mean time I am anyway going back to kmeans (maybe I really start
with
adding canopy to my kmeans only scenario first ;)).

Best regards,
Bogdan

On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[email protected]
wrote:

I think KMeans and Canopy are the most-used and therefore the most
robust.
Dirichlet still has not seen much use beyond some test examples and
NormalModel has at least one known problem (with sample() only returning
the
maximum likelihood) that has been reported but never fixed. Can you point
me
to the problem you are running so I can try to get up to speed? It has
been
some time since I worked in this code but I'm keen to do so and I have
some
time to invest.

Jeff



Bogdan Vatkov wrote:



But I am the first one to use Dirichlet which algorithm is the
recommended
one? Are all other algs better then Dirichlet so no one used it ;)?

On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
[email protected]


wrote:



The NormalModelDistribution seems to still think all the data vectors
are
size=2.  In SampleFromPrior, it is creating models with that size.
Subsequently, when you calculate the pdf with your data value (x) the
sizes
are incompatible. Suggest changing 'DenseVector(2)' to
'DenseVector(n)',
where n is your data cardinality. Please also look at the rest of the
math
in DenseVector with suspiscion. AFAIK, you are the first person to try
to
use Dirichlet.



Bogdan Vatkov wrote:





I see a stack  when the size of the vectore mean is set to 2:

Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
NormalModel))
NormalModel.<init>(Vector, double) line: 48
NormalModelDistribution.sampleFromPrior(int) line: 33
DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
line:
48
DirichletDriver.createState(String, int, double) line: 172
DirichletDriver.writeInitialState(String, String, String, int, double)
line:
150
DirichletDriver.runJob(String, String, String, int, int, double, int)
line:
133
DirichletDriver.main(String[]) line: 109
Clusters.doClustering() line: 244
Clusters.access$0(Clusters) line: 175
Clusters$1.run() line: 148
Thread.run() line: 619


public class NormalModelDistribution implements
ModelDistribution<Vector>
{
@Override public Model<Vector>[] sampleFromPrior(int howMany) {
Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
return
result; }

and later this vector is dotted to
 @Override
 public double pdf(Vector x) {
 double sd2 = stdDev * stdDev;
 double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
sd2);
 double ex = Math.exp(exp);
 return ex / (stdDev * sqrt2pi);
 }

x vector which is coming from Hadoop MapRunner through the map
function:

 public void map(WritableComparable<?> key, Vector v,
               OutputCollector<Text, Vector> output, Reporter
reporter)
throws IOException {


any idea?

btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
safe
enough to run against trunk?

On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]>
wrote:







On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
[email protected]






wrote:
   Sorry, what does that mean :)?







It means that there is probably a programming bug somehow.  At the
very
least, the program is not robust with respect to strange invocations.








what is a dotted vector? and why aren't they the same?







dot product is a vector operation that is the sum of products of
corresponding elements of the two vectors being operated on.  If
these
vectors don't have the same length, then it is an error.

what should I investigate?
 I am not familiar with the code, but if I had time to look, my
strategy
would be to start in the NormalModel and work back up the stack trace
to
find out how the vectors came to be different lengths.  No doubt, the
code
in NormalModel will not tell you anything, but you can see which
vectors
are
involved and by walking up the stack you may be able to see where
they
come
from.








I am basically running my complete kmeans scenario (same input data,
same
number of clusters param, etc.) but just replacing KmeansDriver.main
step
with a DirichletDriver.main call...of course the arguments are
adjusted
since kmeans and dirichlet do not have the same arguments.







I would think that this sounds very plausible.








I am not sure what number I should give for the alpha argument,






Alpha should have a value in the range from 0.01 to 20.  I would scan
with
1,2, 5 magnitude steps to see what works well for your data.  (i.e.
0.01,
0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
 The
effect of different values should be small over a pretty wide range.








iterations
and reductions...here is my current argument set:

args = new String[] {
"--input",








"/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",






"--output", config.getClustersDir(),
"--modelClass",


"org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
"--maxIter", "15",
"--alpha", "1.0",
"--k", config.getClustersCount(),
"--maxRed", "2"
};








Not off-hand.

















Reply via email to