I'm trying to run the Dirichlet clustering example from (http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html). The command line:

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

... loads our example jar file which contains the following structure:

>jar -tf mahout-examples-0.1.job
META-INF/
...
org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModelDistribution.class
org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.class
...
lib/mahout-core-0.1-tests.jar
lib/mahout-core-0.1.jar
lib/hadoop-core-0.19.1.jar
...

The dirichlet/Job first runs a map-reduce job to convert the input data into Mahout Vector format and then runs the DirichletDriver.runJob() method contained in the lib/mahout-core-0.1.jar. This method calls DirichletDriver.createState() which initializes a NormalScModelDistribution with a set of NormalScModels that represent the prior state of the clustering. This state is then written to HDFS and the job begins running the iterations which assign input data points to the models. So far so good.

public static DirichletState<Vector> createState(String modelFactory, int numModels, double alpha_0) throws ClassNotFoundException, InstantiationException, IllegalAccessException {
   ClassLoader ccl = Thread.currentThread().getContextClassLoader();
   Class<?> cl = ccl.loadClass(modelFactory);
ModelDistribution<Vector> factory = (ModelDistribution<Vector>) cl.newInstance(); DirichletState<Vector> state = new DirichletState<Vector>(factory, numModels, alpha_0, 1, 1);
   return state;
 }


In the DirichletMapper, also in the lib/mahout jar, the configure() method reads in the current model state by calling DirichletDriver.createState(). In this invocation; however, it throws a CNF exception.

09/03/22 09:33:03 INFO mapred.JobClient: Task Id : attempt_200903211441_0025_m_000000_2, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.syntheticcontrol.dirichlet.NormalScModelDistribution at org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:97) at org.apache.mahout.clustering.dirichlet.DirichletMapper.configure(DirichletMapper.java:61)

The kMeans job, which uses the same class loader code to load its distance measure in similar driver code, works fine. The difference is that the referenced distance measure is contained in the mahout-core-0.1.jar, not the mahout-examples-0.1.job. Both jobs run fine in test mode from Eclipse.

It would seem that there is some subtle difference in the class loader structures used by the DirichletDriver and DirichletMapper process invocations. In the former, the driver code is called by code living in the example jar; in the latter the driver code is called by code living in the mahout jar. Its like the first case can see in to the lib/mahout classes but the second cannot see out to the classes in the example jar.

Can anybody clarify what is going on and how to fix it?

Jeff

Reply via email to