Peter Ansell created ANY23-554:
----------------------------------
Summary: Avoid using carriage return to detect windows-1252
charset if content type has been identified from metadata
Key: ANY23-554
URL: https://issues.apache.org/jira/browse/ANY23-554
Project: Apache Any23
Issue Type: Task
Reporter: Peter Ansell
Two encoding detection tests are failing on Windows and Windows Subsystem for
Linux due to a condition that overrides a meta tag with a heuristic, which is
not likely correct in its current form as carriage returns are present in many
different Windows produced documents, which may legitimately follow ISO-8859-1.
If someone has put a meta tag in with ISO-8859-1, we shouldn't be using the
presence of carriage return characters overriding that with an incompatible
windows specific codepage, windows-1252.
The relevant code is:
https://github.com/apache/any23/blob/any23-2.6/encoding/src/main/java/org/apache/any23/encoding/EncodingUtils.java#L62-L69
The tests that are failing on Windows and WSL2 are:
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TikaEncodingDetectorTest.testISO8859HTML:58->assertEncoding:128
Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
[ERROR] TikaEncodingDetectorTest.testISO8859XHTML:63->assertEncoding:128
Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
[INFO]
[ERROR] Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Any23 2.6:
[INFO]
[INFO] Apache Any23 ....................................... SUCCESS [01:57 min]
[INFO] Apache Any23 :: Base API ........................... SUCCESS [ 56.016 s]
[INFO] Apache Any23 :: Test Resources ..................... SUCCESS [ 1.068 s]
[INFO] Apache Any23 :: CSV Utilities ...................... SUCCESS [ 2.759 s]
[INFO] Apache Any23 :: Mime Type Detection ................ SUCCESS [01:10 min]
[INFO] Apache Any23 :: Encoding Detection ................. FAILURE [ 4.160 s]
[INFO] Apache Any23 :: Core ............................... SKIPPED
[INFO] Apache Any23 :: CLI ................................ SKIPPED
[INFO] ------------------------------------------------------------------------
--
This message was sent by Atlassian Jira
(v8.20.1#820001)