[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2937: ----------------------------------- Fix Version/s: 1.20 (was: 1.21) > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > ------------------------------------------------------------------------------------------- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)