[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606592#comment-14606592
 ] 

Asitang Mishra edited comment on NUTCH-2038 at 6/30/15 5:43 PM:
----------------------------------------------------------------

Hi [~wastl-nagel] and Hi [~lewismc],
Please, take a look at the latest patch and help me figure out the exception!!,

I am facing the following issue when running in local (please test the latest 
pull for this). This I even faced in the pull #40 here. Please test and see if 
you are facing it too.
I have added all the dependencies, dont seem to understand why it's still givin 
class not found!!!

java.lang.Exception: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at 
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
        at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:340)
        at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
        ... 9 more
2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured 
while training:: java.lang.IllegalStateException: Job failed!
        at 
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
        at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
        at 
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
        at 
org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35)
        at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
        at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)



was (Author: asitang):
Hi [~wastl-nagel],

I am facing the following issue when running in local (please test the latest 
pull for this). This I even faced in the pull #40 here. Please test and see if 
you are facing it too.
I have added all the dependencies, dont seem to understand why it's still givin 
class not found!!!

java.lang.Exception: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at 
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
        at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:340)
        at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
        ... 9 more
2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured 
while training:: java.lang.IllegalStateException: Job failed!
        at 
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
        at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
        at 
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
        at 
org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35)
        at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
        at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If 
> relevant then don't filter the outlinks. If irrelevant then go thru each 
> outlink and see if the url contains any of the important words from a list. 
> If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to