[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606592#comment-14606592 ]
Asitang Mishra edited comment on NUTCH-2038 at 6/30/15 5:43 PM: ---------------------------------------------------------------- Hi [~wastl-nagel] and Hi [~lewismc], Please, take a look at the latest patch and help me figure out the exception!!, I am facing the following issue when running in local (please test the latest pull for this). This I even faced in the pull #40 here. Please test and see if you are facing it too. I have added all the dependencies, dont seem to understand why it's still givin class not found!!! java.lang.Exception: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:340) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 9 more 2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured while training:: java.lang.IllegalStateException: Job failed! at org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441) at org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35) at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was (Author: asitang): Hi [~wastl-nagel], I am facing the following issue when running in local (please test the latest pull for this). This I even faced in the pull #40 here. Please test and see if you are facing it too. I have added all the dependencies, dont seem to understand why it's still givin class not found!!! java.lang.Exception: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:340) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 9 more 2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured while training:: java.lang.IllegalStateException: Job failed! at org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441) at org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35) at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > Naive Bayes classifier based html Parse filter (for filtering outlinks) > ----------------------------------------------------------------------- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser > Reporter: Asitang Mishra > Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A html parse filter that will filter out the outlinks in two stages. > Classify the parse text and decide if the parent page is relevant. If > relevant then don't filter the outlinks. If irrelevant then go thru each > outlink and see if the url contains any of the important words from a list. > If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)