OK that did work for mahout thanks!, but now hadoop cannot load the class, even when the jar containing it has been added to the hadoop classpath
hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ echo $HADOOP_CLASSPATH /home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-core-3.0.2.jar:/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-analyzers-3.0.2.jar:/home/hadoop/my_analyzer.jar I get: hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_2 -wt tfidf -a com.my.analyzers.MyAnalyzer Running on hadoop, using HADOOP_HOME=/usr/local/hadoop No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf 11/04/21 13:39:33 WARN driver.MahoutDriver: No seq2sparse.props found on classpath, will use command-line arguments only 11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 3 11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0 11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1 11/04/21 13:39:33 INFO common.HadoopUtil: Deleting /htmless_articles_vectors_2 11/04/21 13:39:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/04/21 13:39:33 INFO input.FileInputFormat: Total input paths to process : 1 11/04/21 13:39:33 INFO mapred.JobClient: Running job: job_201104211109_0038 11/04/21 13:39:34 INFO mapred.JobClient: map 0% reduce 0% 11/04/21 13:39:43 INFO mapred.JobClient: Task Id : attempt_201104211109_0038_m_000000_0, Status : FAILED java.lang.IllegalStateException: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:61) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:57) ... 4 more Is there anything I'm missing there? On 2011-04-20, at 1:32 PM, Ian Helmke wrote: > Yes, if you make a subclass of StandardAnalyzer or your own Analyzer > that has a constructor with no arguments (presumably which calls a > superclass constructor with the arguments you want), that should work > nicely. (You could also just add a zero-argument constructor to your > own custom analyzer.) > > On Wed, Apr 20, 2011 at 1:25 PM, Camilo Lopez <cam...@camilolopez.com> wrote: >> Ian, >> >> Using 3.0.x ( the one that comes by default in Mahouts trunk now), >> by nullary consstructor you mean I should overload the constructor to receive >> no args in my own custom class? >> >> >> On 2011-04-20, at 1:23 PM, Ian Helmke wrote: >> >>> What version of lucene are you using? If you use lucene 3.0 or later, >>> you can't use StandardAnalyzer as-is because it has no no-args >>> constructor. You could try the mahout DefaultAnalyzer (which wraps the >>> lucene analyzer in a no-argument constructor). I have gotten custom >>> analyzers to work, but they need to have a nullary constructor. >>> >>> >>> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <cam...@camilolopez.com> >>> wrote: >>>> Hi List, >>>> >>>> Trying to run custom analizer classes I'm always getting >>>> InstantiationException, at first I suspected my own code, but trying with >>>> what is supposed to be the default value >>>> 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the >>>> same exception. >>>> >>>> This is the command >>>> >>>> bin/mahout seq2sparse -i /htmless_articles_seq -o >>>> /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a >>>> org.apache.lucene.analysis.standard.StandardAnalyzer -nv >>>> >>>> >>>> Looking a little deeper (ie catching the InstantiationException and >>>> throwing getCause()) InstantiationException in turns out the problem is >>>> caused by a NullPointerException >>>> >>>> Exception in thread "main" java.lang.NullPointerException >>>> at >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>> at >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>> at >>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>>> >>>> >>>> Am I missing something, is there another way to create/use custom >>>> analyzers in seq2sparse? >>>> >>>> >>>> >> >>