[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765412#comment-15765412 ]
Hudson commented on TIKA-2219: ------------------------------ FAILURE: Integrated in Jenkins build tika-2.x-windows #86 (See [https://builds.apache.org/job/tika-2.x-windows/86/]) TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: rev 54154e0045066dfb50a10d158090262acaabaaba) * (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java > CharsetDetector no longer detects windows-1252 charset > ------------------------------------------------------ > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.14 > Environment: Any. > Reporter: Pascal Essiembre > Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > // CharsetMatch m = new CharsetMatch(this, csr, > confidence); > // matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)