[ https://issues.apache.org/jira/browse/TIKA-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176070#comment-13176070 ]
Nick Burch commented on TIKA-793: --------------------------------- I've tracked this to two bugs. Both relate to the handling of UTF-16 encoded strings. I've fixed the first in r1224865, which was a problem in the null termination stripping The second is the handling of the COMM (Comment) tag, which contains both a language and text. We don't currently support the language being encoded differently to the text, that remains to be fixed (and really needs a test file too) > Invalid ASCII character (65533) when retriving MP3 metadata > ----------------------------------------------------------- > > Key: TIKA-793 > URL: https://issues.apache.org/jira/browse/TIKA-793 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.0 > Environment: Ubuntu 10.04 (x64), Android (2.2 +) > Reporter: William Seemann > Priority: Minor > Attachments: TikaTest.java > > > When extracting metadata from certain mp3's (the id3 version appears to be > 2.4) I'm seeing invalid characters at the end of the parsed fields. For > example: > American M� > which should be: > American Me -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira