[jira] [Commented] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834532#comment-17834532 ] Tim Allison commented on NUTCH-2937: I really, really, really wish we didn't have to do this! :P Happy to help! > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2937. Resolution: Fixed Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]! > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2937: -- Assignee: Tim Allison > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2937: --- Fix Version/s: 1.20 (was: 1.21) > parse-tika: review dependency exclusions and avoid dependency conflicts in > distributed mode > --- > > Key: NUTCH-2937 > URL: https://issues.apache.org/jira/browse/NUTCH-2937 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > While testing NUTCH-2919 I've seen the following error caused by a > conflicting dependency to commons-io: > - 2.11.0 Nutch core > - 2.11.0 parse-tika (excluded to avoid duplicated dependencies) > - 2.5 provided by Hadoop > This causes errors parsing some office and other documents (but not all), for > example: > {noformat} > 2022-01-15 01:36:31,365 WARN [FetcherThread] > org.apache.nutch.parse.ParseUtil: Error parsing > http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) > at > org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431) > Caused by: java.lang.NoSuchMethodError: > 'org.apache.commons.io.input.CloseShieldInputStream > org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)' > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3005. Resolution: Implemented Done by [~lewismc] as part of NUTCH-3036, commit [1563396|https://github.com/apache/nutch/blob/1563396d952393462fffab1f686e9ffd5d006cf6/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L151] . > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3016. Resolution: Duplicate > Upgrade Apache Ivy to 2.5.2 > --- > > Key: NUTCH-3016 > URL: https://issues.apache.org/jira/browse/NUTCH-3016 > Project: Nutch > Issue Type: Task > Components: build, ivy >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > [Apache Ivy > v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was > released on August 20 2023! > We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3016: --- Fix Version/s: 1.20 (was: 1.21) > Upgrade Apache Ivy to 2.5.2 > --- > > Key: NUTCH-3016 > URL: https://issues.apache.org/jira/browse/NUTCH-3016 > Project: Nutch > Issue Type: Task > Components: build, ivy >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > [Apache Ivy > v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was > released on August 20 2023! > We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Affects Version/s: 1.19 > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Fix Version/s: 1.20 > Upgrade selenium as needed > -- > > Key: NUTCH-3005 > URL: https://issues.apache.org/jira/browse/NUTCH-3005 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.20 > > > When we choose to upgrade selenium, we should take note of this blog about > changes in headless chromium: > https://www.selenium.dev/blog/2023/headless-is-going-away/ > ChromeOptions options = new ChromeOptions(); > options.addArguments("--headless=new"); > WebDriver driver = new ChromeDriver(options); > driver.get("https://selenium.dev;); > driver.quit(); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Affects Version/s: 1.19 > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Fix Version/s: 1.21 > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)