Jakub Wilk <jw...@debian.org> writes:

>>The reason is the following (see
>>https://github.com/pediapress/pyfribidi/issues/2):
>>
>> fribidi_utf8_to_unicode consumes at most 3 bytes for a single
>> unicode character, i.e. it does not handle unicode character above
>> 0xffff.
>
> As far as I can see this is not true. In Debian, we allocate 4 bytes
> per characters. (An upstream version, which the Debian package is
> based on, is completely broken in this respect: it allocates a buffer
> of static size. See bug #570068)

upstream is pretty much dead in this case. I've published our version on
PyPI. However, I didn't ask or inform the original authors about that.

>
>> For a 4 byte utf-8 sequence it will generate 2 unicode characters,
>> which overflows the logical buffer.
>
> I'm confused. What is "it" in your sentence? Why 2 Unicode characters?

"it" refers to the 4 byte utf-8 sequence.

here's the inner loop of "fribidi_utf8_to_unicode" from
fribidi-char-sets-utf8.c:

,----
|   length = 0;
|   while ((FriBidiStrIndex) (s - t) < len)
|     {
|       register unsigned char ch = *s;
|       if (ch <= 0x7f)         /* one byte */
|       {
|         *us++ = *s++;
|       }
|       else if (ch <= 0xdf)    /* 2 byte */
|       {
|         *us++ = ((*s & 0x1f) << 6) + (*(s + 1) & 0x3f);
|         s += 2;
|       }
|       else                    /* 3 byte */
|       {
|         *us++ =
|           ((int) (*s & 0x0f) << 12) +
|           ((*(s + 1) & 0x3f) << 6) + (*(s + 2) & 0x3f);
|         s += 3;
|       }
|       length++;
|     }
`----

Assume you have a 4-byte utf-8 sequence. One loop step consumes a maximum of
3 bytes of that 4-byte sequence (there's no "4 byte" case), leaving
1-byte of that sequence for further processing. this 1 byte will
generate another unicode character. pyfribidi uses the length of the
python unicode string as buffer size, which is less than what the
fribidi_utf8_to_unicode generates. and there you have your buffer
overflow.

to confirm the issue, you can add an assert and check that
fribidi_utf8_to_unicode's return value (the length of the string) equals
unicode_length.

>
> Anyway I tried to double the buffer size (8 bytes per characters of
> original string) but this didn't fix the crash. So likely the problem
> lies somewhere else.

I'm pretty sure my analysis is correct and I'm not so quite sure what
you did here.

-- 
Cheers
Ralf



-- 
To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to