[ 
https://issues.apache.org/jira/browse/NUTCH-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051379#comment-18051379
 ] 

ASF GitHub Bot commented on NUTCH-3110:
---------------------------------------

lewismc opened a new pull request, #887:
URL: https://github.com/apache/nutch/pull/887

   This PR is an attempt to address 
[NUTCH-3110](https://issues.apache.org/jira/browse/NUTCH-3110) and in the 
process supersede https://github.com/apache/nutch/pull/850.
   Essentially it upgrades Apache Tika from the shaded artifacts to the 
official Tika 3.2.3 release, addressing compatibility issues and restoring full 
functionality. Some noteworthy proposals
   * Both plugins (language-identifier & parse-tika) exclude slf4j-api to 
prevent class loader conflicts (NUTCH-3108)
   * Duplicate outlinks: Changed `HashMap` to `LinkedHashMap` in 
`DOMContentUtils.java` to preserve link insertion order while deduplicating.
   * UTF-16 encoding test: Fixed double BOM issue in `TestHtmlParser.java` 
where Java's UTF-16 encoder was adding a second BOM.
   Boilerpipe support: Restored boilerpipe content extraction using the new 
`tika-handler-boilerpipe` module.
   
   Additionally a bunch of new tests will assist in future Tika upgrades
   
   - TestBoilerpipeExtraction - Boilerpipe integration tests
   - TestLinkExtractionEdgeCases - Link extraction behavior tests
   - TestEncodingDetection - Charset detection tests
   - TestMetadataExtraction - HTML metadata extraction tests
   - TestParserFailureHandling - Error handling/graceful degradation tests




> Upgrade to Tika 3.2.3
> ---------------------
>
>                 Key: NUTCH-3110
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3110
>             Project: Nutch
>          Issue Type: Improvement
>          Components: dependency, parse-filter, parser
>    Affects Versions: 1.20
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.22
>
>
> Upgrade either to the default Tika 3.1.0 or the shaded packages 3.1.0.0 
> provided by [~tallison], see discussion in [PR 
> #849|https://github.com/apache/nutch/pull/849].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to