Re: Clustering techniques, tips and tricks

Pat Ferrel Sat, 10 Mar 2012 11:43:29 -0800

Deploying a jar with a single class extending Analyzer results in anerror for a missing org.apache.lucene.analysis.Analyzer


   mahout seq2sparse -i wp-seqfiles/part-r-00000 -o wp-vectors -ow *-a
   com.custom.analyzers.LuceneStemmingAnalyzer* -chunk 100 -wt tfidf -s
   2 -md 3 -x 95 -ng 2 -ml 50 -seq -n 2
   MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
   Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
   HADOOP_CONF_DIR=/usr/local/hadoop/conf
   MAHOUT-JOB:
   /usr/local/mahout/examples/target/mahout-examples-0.6-job.jar
   12/03/09 14:55:32 INFO vectorizer.SparseVectorsFromSequenceFiles:
   Maximum n-gram size is: 2
   12/03/09 14:55:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
   Minimum LLR value: 50.0
   12/03/09 14:55:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
   Number of reduce tasks: 1
   *Exception in thread "main" java.lang.NoClassDefFoundError:
   org/apache/lucene/analysis/Analyzer*
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
        at
   java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:73)


It seems to be finding my custom lucene analyzer but not the abstract class?

If I go back to the WhiteSpaceAnalyzer all is well

   mahout seq2sparse -i wp-seqfiles/part-r-00000 -o wp-vectors -ow -a
   org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
   -s 2 -md 3 -x 95 -ng 2 -ml 50 -seq -n 2

org.apache.lucene.analysis.Analyzer andorg.apache.lucene.analysis.WhitespaceAnalyzer are in the same jar so I'mconfused why it is finding one and not the other?

The same code seems to work on my laptop, so my deployment environmentis missing something? Any ideas?


On 3/7/12 1:24 AM, Abbas wrote:

Hi Bogdan,

This is in reply to your previous post where you asked about having word-
stoppers
in Mahout.

Well, recently I was fighting with the same thing and found a solution,
which worked perfectly fine. What you should do is -
1. Create your own (customized) Lucene Analyzer by extending Analyzer class
and overriding tokenStream method

2. Create a jar file containing your custom analyzer. Make sure to have your
lucene jar file in the MANIFEST.mf.

3. Place the jar in mahout/examples/target/dependency. In case you get
ClassNotFoundException in the next step, you may like to put the two jar files
in
hadoop/lib/ as well. Also you can try making entries of the jar files in
HADOOP_CLASSPATH and CLASSPATH environment variable.

4. Then run your seq2sparse command by mentioning your custom analyzer in -a
parameter

5. Run your k-means command as you would otherwise do.

Hope this helps

If you need the complete code for custom analyzer, let me know.

Thanks
Abbas

Re: Clustering techniques, tips and tricks

Reply via email to