[jira] [Commented] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834532#comment-17834532
 ] 

Tim Allison commented on NUTCH-2937:


I really, really, really wish we didn't have to do this! :P

Happy to help!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2937.

Resolution: Fixed

Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2937:
--

Assignee: Tim Allison

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2937:
---
Fix Version/s: 1.20
   (was: 1.21)

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3005.

Resolution: Implemented

Done by [~lewismc] as part of NUTCH-3036, commit 
[1563396|https://github.com/apache/nutch/blob/1563396d952393462fffab1f686e9ffd5d006cf6/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L151]
 .

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3016.

Resolution: Duplicate

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: NUTCH-3016
> URL: https://issues.apache.org/jira/browse/NUTCH-3016
> Project: Nutch
>  Issue Type: Task
>  Components: build, ivy
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> [Apache Ivy 
> v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was 
> released on August 20 2023!
> We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3016:
---
Fix Version/s: 1.20
   (was: 1.21)

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: NUTCH-3016
> URL: https://issues.apache.org/jira/browse/NUTCH-3016
> Project: Nutch
>  Issue Type: Task
>  Components: build, ivy
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> [Apache Ivy 
> v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was 
> released on August 20 2023!
> We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3005:
---
Affects Version/s: 1.19

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3005:
---
Fix Version/s: 1.20

> Upgrade selenium as needed
> --
>
> Key: NUTCH-3005
> URL: https://issues.apache.org/jira/browse/NUTCH-3005
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> When we choose to upgrade selenium, we should take note of this blog about 
> changes in headless chromium: 
> https://www.selenium.dev/blog/2023/headless-is-going-away/
> ChromeOptions options = new ChromeOptions();
> options.addArguments("--headless=new");
> WebDriver driver = new ChromeDriver(options);
> driver.get("https://selenium.dev;);
> driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3028:
---
Affects Version/s: 1.19

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3028:
---
Fix Version/s: 1.21

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)