[
https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012960#comment-13012960
]
Julien Nioche commented on MAHOUT-621:
--------------------------------------
>From https://issues.apache.org/jira/browse/MAHOUT-368
{quote} > Why not having a bundle artifact where all the Mahout submodules
would be put it a single jar?
How is this not trivial for you to handle with maven?
If you are writing your own maven project (recommended), then
jar-with-dependencies will do what you want.
If you are extending Mahout (ok for prototypes), just put your code in the
examples job jar and all will be good.
{quote}
I am not extending Mahout and as you've probably seen in the comments above the
point is to be able to generate Mahout data structures from Behemoth so putting
the code in examples is not an option anyway.
Back to the original problem. I generate a job file for my Mahout module in
Behemoth (https://github.com/jnioche/behemoth/tree/master/modules/mahout) and
manage the dependencies with Ivy. The main class (SparseVectorsFromBehemoth) is
a slightly modified version of SparseVectorsFromSequenceFiles which gets the
Tokens from Behemoth documents instead of using Lucene and generates the data
structures expected by the classifiers and clusterers.
The job file contains :
* the Behemoth classes for the Mahout module
* the dependencies in /lib including
** mahout-math-0.4.jar
** mahout-core-0.4.jar
The problem I had was the same as Han Hui Wen (MAHOUT-368) i.e I was getting a
class not found exception on org.apache.mahout.math.VectorWritable. My
understanding of the problem is that my main class calls DictionaryVectorizer
which in my job file was in lib/mahout-core-0.4.jar and this has a dependency
on VectorWritable which is in lib/mahout-maths-0.4.jar. For some reason
MapReduce was not able to find VectorWritable, which I assume has to do with
the jobs in DictionaryVectorizer calling
'job.setJarByClass(DictionaryVectorizer.class)'.
I could of course use jar-with-dependencies on the Mahout code and generate a
single jar then manage the jar locally. However this means that I have very
little control over the dependencies used by Mahout (e.g. potentially
conflicting versions with other components in my job files) and I'd rather rely
on external publicised jars anyway. A better option would be to simply unpack
the content of the mahout core and maths jars into the root of my job file. At
least the Mahout dependencies would be handled and versioned normally.
I've tried with Hadoop 0.21.0 and did not get this issue so I suppose that
something must have changed in the way the classloader handles dependencies
within a job file.
Makes sense?
> Support more data import mechanisms
> -----------------------------------
>
> Key: MAHOUT-621
> URL: https://issues.apache.org/jira/browse/MAHOUT-621
> Project: Mahout
> Issue Type: Improvement
> Reporter: Grant Ingersoll
> Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira