Hi Yuriy! Am Sun, 24 May 2015 23:08:12 +0300 schrieb "Yuriy M. Kaminskiy" <[email protected]>:
> utf-16 decoder in id3 parser improperly identifies surrogate pairs, > resulting in improper identification of characters in 0xf800-0xfffe > range as leading surrogate and decoding failure. > > E.g. attempt to decode title "「x」~y~" results in: > [id3.c:1065] error: Invalid UTF16 surrogate pair at 0 (0xff62). > and empty parsed title. Could you please send me (mpg123 upstream maintainer) a little (piece of an) example file to add as regression test for this? As ID3 tag writers also have a history of messing up encoding, I'd like to use the original and not a fake I did myself;-) Regarding the patch: Oh, yes, I see stupid me not getting the proper idea about bit masks back in 2006/2007 in this case. Let's recap to be on the safe side: high surrogate range: 0xD800 to 0xDBFF low suggogate range: 0xDC00 to 0xDFFF Do we agree on that or is my knowledge of UTF-16 outdated? I sense that the mask 0xf800 doesn't cover the first range properly, neither. We need to detect bit sequences between 0b1101100000000000 0b1101101111111111 We don't want to catch 0b110111xxxxxxxxxx in there. So a proper mask should be 0b1111110000000000 which is 0xfc00 in hex, too. Verifying the low surrogate range: 0b1101110000000000 0b1101111111111111 The mask 0b1111110000000000 seems appropriate here, too. How convenient. This smells of intelligent design, doesn't it? ;-) So 0xfc00 should be used both for low and high surrogates to properly tell them apart with the additional bit. I'm attaching a revised patch that should enter mpg123 trunk shortly. Feel free to yell and show the error in my current reasoning … Alrighty then, Thomas
Index: src/libmpg123/id3.c
===================================================================
--- src/libmpg123/id3.c (Revision 3642)
+++ src/libmpg123/id3.c (Arbeitskopie)
@@ -1051,10 +1051,10 @@
for(i=0; i < n; i+=2)
{
unsigned long point = ((unsigned long) s[i+high]<<8) + s[i+low];
- if((point & 0xd800) == 0xd800) /* lead surrogate */
+ if((point & 0xfc00) == 0xd800) /* lead surrogate */
{
unsigned short second = (i+3 < l) ? (s[i+2+high]<<8) + s[i+2+low] : 0;
- if((second & 0xdc00) == 0xdc00) /* good... */
+ if((second & 0xfc00) == 0xdc00) /* good... */
{
point = FULLPOINT(point,second);
length += UTF8LEN(point); /* possibly 4 bytes */
@@ -1077,7 +1077,7 @@
for(i=0; i < n; i+=2)
{
unsigned long codepoint = ((unsigned long) s[i+high]<<8) + s[i+low];
- if((codepoint & 0xd800) == 0xd800) /* lead surrogate */
+ if((codepoint & 0xfc00) == 0xd800) /* lead surrogate */
{
unsigned short second = (s[i+2+high]<<8) + s[i+2+low];
codepoint = FULLPOINT(codepoint,second);
pgpNizo58qS9m.pgp
Description: Digitale Signatur von OpenPGP

