I'm trying to run the Dirichlet clustering example from
(http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html). The command
line:
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
... loads our example jar file which contains the following structure:
>jar -tf mahout-examples-0.1.job
META-INF/
...
org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModelDistribution.class
org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.class
...
lib/mahout-core-0.1-tests.jar
lib/mahout-core-0.1.jar
lib/hadoop-core-0.19.1.jar
...
The dirichlet/Job first runs a map-reduce job to convert the input data
into Mahout Vector format and then runs the DirichletDriver.runJob()
method contained in the lib/mahout-core-0.1.jar. This method calls
DirichletDriver.createState() which initializes a
NormalScModelDistribution with a set of NormalScModels that represent
the prior state of the clustering. This state is then written to HDFS
and the job begins running the iterations which assign input data points
to the models. So far so good.
public static DirichletState<Vector> createState(String modelFactory,
int numModels, double alpha_0) throws
ClassNotFoundException, InstantiationException,
IllegalAccessException {
ClassLoader ccl = Thread.currentThread().getContextClassLoader();
Class<?> cl = ccl.loadClass(modelFactory);
ModelDistribution<Vector> factory = (ModelDistribution<Vector>)
cl.newInstance();
DirichletState<Vector> state = new DirichletState<Vector>(factory,
numModels, alpha_0, 1, 1);
return state;
}
In the DirichletMapper, also in the lib/mahout jar, the configure()
method reads in the current model state by calling
DirichletDriver.createState(). In this invocation; however, it throws a
CNF exception.
09/03/22 09:33:03 INFO mapred.JobClient: Task Id :
attempt_200903211441_0025_m_000000_2, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.mahout.clustering.syntheticcontrol.dirichlet.NormalScModelDistribution
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:97)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.configure(DirichletMapper.java:61)
The kMeans job, which uses the same class loader code to load its
distance measure in similar driver code, works fine. The difference is
that the referenced distance measure is contained in the
mahout-core-0.1.jar, not the mahout-examples-0.1.job. Both jobs run fine
in test mode from Eclipse.
It would seem that there is some subtle difference in the class loader
structures used by the DirichletDriver and DirichletMapper process
invocations. In the former, the driver code is called by code living in
the example jar; in the latter the driver code is called by code living
in the mahout jar. Its like the first case can see in to the lib/mahout
classes but the second cannot see out to the classes in the example jar.
Can anybody clarify what is going on and how to fix it?
Jeff