The NormalModelDistribution seems to still think all the data vectors are size=2. In SampleFromPrior, it is creating models with that size. Subsequently, when you calculate the pdf with your data value (x) the sizes are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)', where n is your data cardinality. Please also look at the rest of the math in DenseVector with suspiscion. AFAIK, you are the first person to try to use Dirichlet.

Bogdan Vatkov wrote:
I see a stack  when the size of the vectore mean is set to 2:

Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
NormalModel.<init>(Vector, double) line: 48
NormalModelDistribution.sampleFromPrior(int) line: 33
DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line:
48
DirichletDriver.createState(String, int, double) line: 172
DirichletDriver.writeInitialState(String, String, String, int, double) line:
150
DirichletDriver.runJob(String, String, String, int, int, double, int) line:
133
DirichletDriver.main(String[]) line: 109
Clusters.doClustering() line: 244
Clusters.access$0(Clusters) line: 175
Clusters$1.run() line: 148
Thread.run() line: 619


public class NormalModelDistribution implements ModelDistribution<Vector> {
@Override public Model<Vector>[] sampleFromPrior(int howMany) {
Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
result; }

and later this vector is dotted to
  @Override
  public double pdf(Vector x) {
    double sd2 = stdDev * stdDev;
    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);
    double ex = Math.exp(exp);
    return ex / (stdDev * sqrt2pi);
  }

x vector which is coming from Hadoop MapRunner through the map function:

  public void map(WritableComparable<?> key, Vector v,
                  OutputCollector<Text, Vector> output, Reporter reporter)
throws IOException {


any idea?

btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
enough to run against trunk?

On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]> wrote:

On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[email protected]
wrote:
Sorry, what does that mean :)?

It means that there is probably a programming bug somehow.  At the very
least, the program is not robust with respect to strange invocations.


what is a dotted vector? and why aren't they the same?

dot product is a vector operation that is the sum of products of
corresponding elements of the two vectors being operated on.  If these
vectors don't have the same length, then it is an error.

what should I investigate?
I am not familiar with the code, but if I had time to look, my strategy
would be to start in the NormalModel and work back up the stack trace to
find out how the vectors came to be different lengths.  No doubt, the code
in NormalModel will not tell you anything, but you can see which vectors
are
involved and by walking up the stack you may be able to see where they come
from.


I am basically running my complete kmeans scenario (same input data, same
number of clusters param, etc.) but just replacing KmeansDriver.main step
with a DirichletDriver.main call...of course the arguments are adjusted
since kmeans and dirichlet do not have the same arguments.

I would think that this sounds very plausible.


I am not sure what number I should give for the alpha argument,
Alpha should have a value in the range from 0.01 to 20.  I would scan with
1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
effect of different values should be small over a pretty wide range.


iterations
and reductions...here is my current argument set:

args = new String[] {
"--input",


"/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
"--output", config.getClustersDir(),
"--modelClass",
"org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
"--maxIter", "15",
"--alpha", "1.0",
"--k", config.getClustersCount(),
"--maxRed", "2"
};


Not off-hand.





Reply via email to