I see a stack  when the size of the vectore mean is set to 2:

Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
NormalModel.<init>(Vector, double) line: 48
NormalModelDistribution.sampleFromPrior(int) line: 33
DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line:
48
DirichletDriver.createState(String, int, double) line: 172
DirichletDriver.writeInitialState(String, String, String, int, double) line:
150
DirichletDriver.runJob(String, String, String, int, int, double, int) line:
133
DirichletDriver.main(String[]) line: 109
Clusters.doClustering() line: 244
Clusters.access$0(Clusters) line: 175
Clusters$1.run() line: 148
Thread.run() line: 619


public class NormalModelDistribution implements ModelDistribution<Vector> {
@Override public Model<Vector>[] sampleFromPrior(int howMany) {
Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
result; }

and later this vector is dotted to
  @Override
  public double pdf(Vector x) {
    double sd2 = stdDev * stdDev;
    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);
    double ex = Math.exp(exp);
    return ex / (stdDev * sqrt2pi);
  }

x vector which is coming from Hadoop MapRunner through the map function:

  public void map(WritableComparable<?> key, Vector v,
                  OutputCollector<Text, Vector> output, Reporter reporter)
throws IOException {


any idea?

btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
enough to run against trunk?

On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[email protected]> wrote:

> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[email protected]
> >wrote:
>
> > Sorry, what does that mean :)?
> >
>
> It means that there is probably a programming bug somehow.  At the very
> least, the program is not robust with respect to strange invocations.
>
>
> > what is a dotted vector? and why aren't they the same?
> >
>
> dot product is a vector operation that is the sum of products of
> corresponding elements of the two vectors being operated on.  If these
> vectors don't have the same length, then it is an error.
>
> what should I investigate?
> >
>
> I am not familiar with the code, but if I had time to look, my strategy
> would be to start in the NormalModel and work back up the stack trace to
> find out how the vectors came to be different lengths.  No doubt, the code
> in NormalModel will not tell you anything, but you can see which vectors
> are
> involved and by walking up the stack you may be able to see where they come
> from.
>
>
> > I am basically running my complete kmeans scenario (same input data, same
> > number of clusters param, etc.) but just replacing KmeansDriver.main step
> > with a DirichletDriver.main call...of course the arguments are adjusted
> > since kmeans and dirichlet do not have the same arguments.
> >
>
> I would think that this sounds very plausible.
>
>
> > I am not sure what number I should give for the alpha argument,
>
>
> Alpha should have a value in the range from 0.01 to 20.  I would scan with
> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
> effect of different values should be small over a pretty wide range.
>
>
> > iterations
> > and reductions...here is my current argument set:
> >
> > args = new String[] {
> > "--input",
> >
> >
> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
> > "--output", config.getClustersDir(),
> > "--modelClass",
> > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
> > "--maxIter", "15",
> > "--alpha", "1.0",
> > "--k", config.getClustersCount(),
> > "--maxRed", "2"
> > };
> >
> >
> Not off-hand.
>



-- 
Best regards,
Bogdan

Reply via email to