Hi Yuriy!

Am Sun, 24 May 2015 23:08:12 +0300
schrieb "Yuriy M. Kaminskiy" <yum...@gmail.com>: 

> utf-16 decoder in id3 parser improperly identifies surrogate pairs, 
> resulting in improper identification of characters in 0xf800-0xfffe 
> range as leading surrogate and decoding failure.
> 
> E.g. attempt to decode title "「x」~y~" results in:
> [id3.c:1065] error: Invalid UTF16 surrogate pair at 0 (0xff62).
> and empty parsed title.

Could you please send me (mpg123 upstream maintainer) a little (piece
of an) example file to add as regression test for this? As ID3 tag
writers also have a history of messing up encoding, I'd like to use the
original and not a fake I did myself;-)

Regarding the patch: Oh, yes, I see stupid me not getting the proper
idea about bit masks back in 2006/2007 in this case.

Let's recap to be on the safe side:

high surrogate range: 0xD800 to 0xDBFF
 low suggogate range: 0xDC00 to 0xDFFF

Do we agree on that or is my knowledge of UTF-16 outdated?

I sense that the mask 0xf800 doesn't cover the first range properly,
neither. We need to detect bit sequences between

0b1101100000000000
0b1101101111111111

We don't want to catch

0b110111xxxxxxxxxx

in there. So a proper mask should be

0b1111110000000000

which is 0xfc00 in hex, too. Verifying the low surrogate range:

0b1101110000000000
0b1101111111111111

The mask

0b1111110000000000

seems appropriate here, too. How convenient. This smells of intelligent
design, doesn't it? ;-) So 0xfc00 should be used both for low and high
surrogates to properly tell them apart with the additional bit.

I'm attaching a revised patch that should enter mpg123 trunk shortly.

Feel free to yell and show the error in my current reasoning …


Alrighty then,

Thomas

Index: src/libmpg123/id3.c
===================================================================
--- src/libmpg123/id3.c	(Revision 3642)
+++ src/libmpg123/id3.c	(Arbeitskopie)
@@ -1051,10 +1051,10 @@
 	for(i=0; i < n; i+=2)
 	{
 		unsigned long point = ((unsigned long) s[i+high]<<8) + s[i+low];
-		if((point & 0xd800) == 0xd800) /* lead surrogate */
+		if((point & 0xfc00) == 0xd800) /* lead surrogate */
 		{
 			unsigned short second = (i+3 < l) ? (s[i+2+high]<<8) + s[i+2+low] : 0;
-			if((second & 0xdc00) == 0xdc00) /* good... */
+			if((second & 0xfc00) == 0xdc00) /* good... */
 			{
 				point = FULLPOINT(point,second);
 				length += UTF8LEN(point); /* possibly 4 bytes */
@@ -1077,7 +1077,7 @@
 	for(i=0; i < n; i+=2)
 	{
 		unsigned long codepoint = ((unsigned long) s[i+high]<<8) + s[i+low];
-		if((codepoint & 0xd800) == 0xd800) /* lead surrogate */
+		if((codepoint & 0xfc00) == 0xd800) /* lead surrogate */
 		{
 			unsigned short second = (s[i+2+high]<<8) + s[i+2+low];
 			codepoint = FULLPOINT(codepoint,second);

Attachment: pgpqDpi42mv8d.pgp
Description: Digitale Signatur von OpenPGP

_______________________________________________
pkg-multimedia-maintainers mailing list
pkg-multimedia-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-multimedia-maintainers

Reply via email to