[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765034#comment-15765034 ]
Pascal Essiembre commented on TIKA-2219: ---------------------------------------- I am relying on CharsetDetector. Thanks for the fix! > CharsetDetector no longer detects windows-1252 charset > ------------------------------------------------------ > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.14 > Environment: Any. > Reporter: Pascal Essiembre > Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > // CharsetMatch m = new CharsetMatch(this, csr, > confidence); > // matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)