Deploying a jar with a single class extending Analyzer results in an error for a missing org.apache.lucene.analysis.Analyzer

   mahout seq2sparse -i wp-seqfiles/part-r-00000 -o wp-vectors -ow *-a
   com.custom.analyzers.LuceneStemmingAnalyzer* -chunk 100 -wt tfidf -s
   2 -md 3 -x 95 -ng 2 -ml 50 -seq -n 2
   MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
   Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
   HADOOP_CONF_DIR=/usr/local/hadoop/conf
   MAHOUT-JOB:
   /usr/local/mahout/examples/target/mahout-examples-0.6-job.jar
   12/03/09 14:55:32 INFO vectorizer.SparseVectorsFromSequenceFiles:
   Maximum n-gram size is: 2
   12/03/09 14:55:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
   Minimum LLR value: 50.0
   12/03/09 14:55:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
   Number of reduce tasks: 1
   *Exception in thread "main" java.lang.NoClassDefFoundError:
   org/apache/lucene/analysis/Analyzer*
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
        at
   java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:73)

It seems to be finding my custom lucene analyzer but not the abstract class?

If I go back to the WhiteSpaceAnalyzer all is well

   mahout seq2sparse -i wp-seqfiles/part-r-00000 -o wp-vectors -ow -a
   org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
   -s 2 -md 3 -x 95 -ng 2 -ml 50 -seq -n 2

org.apache.lucene.analysis.Analyzer and org.apache.lucene.analysis.WhitespaceAnalyzer are in the same jar so I'm confused why it is finding one and not the other?

The same code seems to work on my laptop, so my deployment environment is missing something? Any ideas?

On 3/7/12 1:24 AM, Abbas wrote:
Hi Bogdan,

This is in reply to your previous post where you asked about having word-
stoppers
in Mahout.

Well, recently I was fighting with the same thing and found a solution,
which worked perfectly fine. What you should do is -
1. Create your own (customized) Lucene Analyzer by extending Analyzer class
and overriding tokenStream method

2. Create a jar file containing your custom analyzer. Make sure to have your
lucene jar file in the MANIFEST.mf.

3. Place the jar in mahout/examples/target/dependency. In case you get
ClassNotFoundException in the next step, you may like to put the two jar files
in
hadoop/lib/ as well. Also you can try making entries of the jar files in
HADOOP_CLASSPATH and CLASSPATH environment variable.

4. Then run your seq2sparse command by mentioning your custom analyzer in -a
parameter

5. Run your k-means command as you would otherwise do.

Hope this helps

If you need the complete code for custom analyzer, let me know.

Thanks
Abbas


Reply via email to