lewismc opened a new pull request, #887:
URL: https://github.com/apache/nutch/pull/887

   This PR is an attempt to address 
[NUTCH-3110](https://issues.apache.org/jira/browse/NUTCH-3110) and in the 
process supersede https://github.com/apache/nutch/pull/850.
   Essentially it upgrades Apache Tika from the shaded artifacts to the 
official Tika 3.2.3 release, addressing compatibility issues and restoring full 
functionality. Some noteworthy proposals
   * Both plugins (language-identifier & parse-tika) exclude slf4j-api to 
prevent class loader conflicts (NUTCH-3108)
   * Duplicate outlinks: Changed `HashMap` to `LinkedHashMap` in 
`DOMContentUtils.java` to preserve link insertion order while deduplicating.
   * UTF-16 encoding test: Fixed double BOM issue in `TestHtmlParser.java` 
where Java's UTF-16 encoder was adding a second BOM.
   Boilerpipe support: Restored boilerpipe content extraction using the new 
`tika-handler-boilerpipe` module.
   
   Additionally a bunch of new tests will assist in future Tika upgrades
   
   - TestBoilerpipeExtraction - Boilerpipe integration tests
   - TestLinkExtractionEdgeCases - Link extraction behavior tests
   - TestEncodingDetection - Charset detection tests
   - TestMetadataExtraction - HTML metadata extraction tests
   - TestParserFailureHandling - Error handling/graceful degradation tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to