Hi, Tried to upgrade any23 2.1 to 2.2 in nutch code base.
Changes: 1. src/plugin/any23/ivy.xml: <dependency org="org.apache.any23" name="apache-any23-core" rev="2.2" conf="*->default"> 2. src/plugin/any23/plugin.xml <library name="apache-any23-api-2.2.jar"/> <library name="apache-any23-core-2.2.jar"/> <library name="apache-any23-csvutils-2.2.jar"/> <library name="apache-any23-encoding-2.2.jar"/> <library name="apache-any23-mime-2.2.jar"/> after "ant runtime", below jar files are present in dir runtime/local/plugins/any23 any23.jar apache-any23-api-2.2.jar apache-any23-core-2.2.jar apache-any23-csvutils-2.2.jar apache-any23-encoding-2.2.jar apache-any23-mime-2.2.jar Did simple parse checker on a test html. Getting Errors as 1. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: org/eclipse/rdf4j/common/lang/service/ServiceRegistry .... Caused by: java.lang.NoClassDefFoundError: org/eclipse/rdf4j/common/lang/service/ServiceRegistry 2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: org/apache/any23/extractor/ExtractorRegistryImpl ... Caused by: java.lang.NoClassDefFoundError: org/apache/any23/extractor/ExtractorRegistryImpl Entire log file is attached in debug.txt. Regards, Govind
2018-04-02 17:09:49,999 INFO parse.ParserChecker (ParserChecker.java:run(122)) - fetching: file:/tmp/exact_code.html 2018-04-02 17:09:50,205 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No object cache found for conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml, nutch-site.xml, instantiating a new object cache 2018-04-02 17:09:50,328 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No object cache found for conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml, nutch-site.xml, instantiating a new object cache 2018-04-02 17:09:50,366 TRACE file.File (FileResponse.java:<init>(117)) - fetching file:/tmp/exact_code.html 2018-04-02 17:09:50,450 INFO parse.ParseSegment (ParseSegment.java:isTruncated(207)) - file:/tmp/exact_code.html skipped. Content of size 79433 was truncated to 65536 2018-04-02 17:09:50,450 WARN parse.ParserChecker (ParserChecker.java:run(187)) - Content is truncated, parse may fail! 2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: extractor, extension-id: ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParser 2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-tika, extension-id: org.apache.nutch.parse.tika.TikaParser 2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-ext, extension-id: ExtParser 2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-html, extension-id: org.apache.nutch.parse.html.HtmlParser 2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-js, extension-id: JSParser 2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: feed, extension-id: org.apache.nutch.parse.feed.FeedParser 2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-swf, extension-id: org.apache.nutch.parse.swf.SWFParser 2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader (ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-zip, extension-id: org.apache.nutch.parse.zip.ZipParser 2018-04-02 17:09:50,461 INFO parse.ParserFactory (ParserFactory.java:matchExtensions(374)) - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser - org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes system property, and all claim to support the content type text/html, but they are not mapped to it in the parse-plugins.xml file 2018-04-02 17:09:50,871 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) - Parsing [file:/tmp/exact_code.html] with [org.apache.nutch.parse.tika.TikaParser@693fe6c9] 2018-04-02 17:09:50,878 DEBUG tika.TikaParser (TikaParser.java:getParse(101)) - Using Tika parser org.apache.tika.parser.html.HtmlParser for mime-type text/html 2018-04-02 17:09:51,205 TRACE tika.TikaParser (TikaParser.java:getParse(152)) - Meta tags for file:/tmp/exact_code.html: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null * general tags: - viewport = width=device-width, initial-scale=1 - dc:title = I.F. on Kharms – Just a Beginning - content-encoding = UTF-8 - generator = WordPress 4.9.4 - content-type = text/html; charset=UTF-8 - robots = index,follow * http-equiv tags: 2018-04-02 17:09:51,206 TRACE tika.TikaParser (TikaParser.java:getParse(159)) - Getting text... 2018-04-02 17:09:51,222 TRACE tika.TikaParser (TikaParser.java:getParse(165)) - Getting title... 2018-04-02 17:09:51,224 TRACE tika.TikaParser (TikaParser.java:getParse(183)) - Getting links (base URL = file:/tmp/exact_code.html) ... 2018-04-02 17:09:51,227 TRACE tika.TikaParser (TikaParser.java:getParse(193)) - found 40 outlinks in file:/tmp/exact_code.html 2018-04-02 17:09:51,248 WARN parse.ParseUtil (ParseUtil.java:runParser(173)) - Error parsing file:/tmp/exact_code.html with org.apache.nutch.parse.tika.TikaParser@693fe6c9 java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: org/eclipse/rdf4j/common/lang/service/ServiceRegistry at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268) Caused by: java.lang.NoClassDefFoundError: org/eclipse/rdf4j/common/lang/service/ServiceRegistry at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:70) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.any23.Any23.<init>(Any23.java:137) at org.apache.any23.Any23.<init>(Any23.java:147) at org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:109) at org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:92) at org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:172) at org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:46) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:227) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.eclipse.rdf4j.common.lang.service.ServiceRegistry at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:104) at org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:92) at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:72) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 24 more 2018-04-02 17:09:51,251 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) - Parsing [file:/tmp/exact_code.html] with [org.apache.nutch.parse.html.HtmlParser@6a84a97d] 2018-04-02 17:09:51,280 TRACE util.EncodingDetector (EncodingDetector.java:guessEncoding(243)) - file:/tmp/exact_code.html: charset utf-8 (sniffed) 2018-04-02 17:09:51,280 TRACE util.EncodingDetector (EncodingDetector.java:guessEncoding(258)) - file:/tmp/exact_code.html: Choosing encoding: utf-8 (sniffed) 2018-04-02 17:09:51,280 TRACE html.HtmlParser (HtmlParser.java:getParse(180)) - Parsing... [Error] :10:44: Missing attribute name. [Error] :11:56: Missing attribute name. [Error] :12:43: Missing attribute name. [Error] :13:55: Missing whitespace before attribute "rel". [Error] :13:69: Missing attribute name. [Error] :14:135: Missing attribute name. [Error] :15:153: Missing attribute name. [Error] :16:169: Missing attribute name. [Error] :35:228: Missing attribute name. [Error] :36:179: Missing attribute name. [Error] :45:82: Missing attribute name. [Error] :46:116: Missing attribute name. [Error] :47:129: Missing attribute name. [Error] :48:123: Missing attribute name. [Error] :49:50: Missing attribute name. [Error] :50:75: Missing attribute name. [Error] :51:70: Missing attribute name. [Error] :52:179: Missing attribute name. [Error] :53:187: Missing attribute name. [Error] :67:210: Missing attribute name. [Error] :150:581: Missing attribute name. [Error] :151:247: Missing attribute name. [Error] :152:135: Missing attribute name. [Error] :153:107: Missing attribute name. [Error] :153:186: Missing attribute name. [Error] :154:74: Missing attribute name. [Error] :174:123: Missing attribute name. [Error] :187:14: Missing attribute name. [Error] :348:581: Premature end of file encountered. [Error] :348:581: Premature end of file encountered. [Warning] :348:581: Element <PATH> not closed properly. [Warning] :348:581: Element <SYMBOL> not closed properly. [Warning] :348:581: Element <DEFS> not closed properly. [Warning] :348:581: Element <SVG> not closed properly. [Warning] :348:581: Element <BODY> not closed properly. [Warning] :348:581: Element <HTML> not closed properly. 2018-04-02 17:09:51,377 TRACE html.HtmlParser (HtmlParser.java:getParse(205)) - Meta tags for file:/tmp/exact_code.html: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null * general tags: - viewport = width=device-width, initial-scale=1 - generator = WordPress 4.9.4 - robots = index,follow * http-equiv tags: 2018-04-02 17:09:51,377 TRACE html.HtmlParser (HtmlParser.java:getParse(211)) - Getting text... 2018-04-02 17:09:51,385 TRACE html.HtmlParser (HtmlParser.java:getParse(217)) - Getting title... 2018-04-02 17:09:51,386 TRACE html.HtmlParser (HtmlParser.java:getParse(235)) - Getting links... 2018-04-02 17:09:51,388 TRACE html.HtmlParser (HtmlParser.java:getParse(240)) - found 47 outlinks in file:/tmp/exact_code.html 2018-04-02 17:09:51,389 WARN parse.ParseUtil (ParseUtil.java:runParser(173)) - Error parsing file:/tmp/exact_code.html with org.apache.nutch.parse.html.HtmlParser@6a84a97d java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: org/apache/any23/extractor/ExtractorRegistryImpl at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268) Caused by: java.lang.NoClassDefFoundError: org/apache/any23/extractor/ExtractorRegistryImpl at org.apache.any23.Any23.<init>(Any23.java:137) at org.apache.any23.Any23.<init>(Any23.java:147) at org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:109) at org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:92) at org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:172) at org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:46) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:257) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-04-02 17:09:51,390 WARN parse.ParseUtil (ParseUtil.java:parse(104)) - Unable to successfully parse content file:/tmp/exact_code.html of type text/html 2018-04-02 17:09:51,391 INFO crawl.SignatureFactory (SignatureFactory.java:getSignature(51)) - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2018-04-02 17:09:51,400 INFO parse.ParserChecker (ParserChecker.java:run(214)) - parsing: file:/tmp/exact_code.html 2018-04-02 17:09:51,400 INFO parse.ParserChecker (ParserChecker.java:run(215)) - contentType: text/html 2018-04-02 17:09:51,400 INFO parse.ParserChecker (ParserChecker.java:run(216)) - signature: 650db1bac1e2c1c04ad51c0f1b54f379 2018-04-02 17:09:51,401 INFO parse.ParserChecker (ParserChecker.java:run(244)) - --------- Url --------------- 2018-04-02 17:09:51,401 INFO parse.ParserChecker (ParserChecker.java:run(246)) - --------- ParseData --------- 2018-04-02 17:09:51,401 INFO parse.ParserChecker (ParserChecker.java:run(249)) - --------- ParseText ---------