On 3/31/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > Martin v. Löwis wrote: > > Neal Norwitz wrote: > >> See http://python.org/sf/1454485 for the gory details. Basically if > >> you create a unicode array (array.array('u')) and try to append an > >> 8-bit string (ie, not unicode), you can crash the interpreter. > >> > >> The problem is that the string is converted without question to a > >> unicode buffer. Within unicode, it assumes the data to be valid, but > >> this isn't necessarily the case. We wind up accessing an array with a > >> negative index and boom. > > > > There are several problems combined here, which might need discussion: > > > > - why does the 'u#' converter use the buffer interface if available? > > it should just support Unicode objects. The buffer object makes > > no promise that the buffer actually is meaningful UCS-2/UCS-4, so > > u# shouldn't guess that it is. > > (FWIW, it currently truncates the buffer size to the next-smaller > > multiple of sizeof(Py_UNICODE), and silently so) > > > > I think that part should just go: u# should be restricted to unicode > > objects. > > 'u#' is intended to match 's#' which also uses the buffer > interface. It expects the buffer returned by the object > to a be a Py_UNICODE* buffer, hence the calculation of the > length. > > However, we already have 'es#' which is a lot safer to use > in this respect: you can explicity define the encoding you > want to see, e.g. 'unicode-internal' and the associated > codec also takes care of range checks, etc. > > So, I'm +1 on restricting 'u#' to Unicode objects.
Note: 2.5 no longer crashes, 2.4 does. Does this mean you would like to see this patch checked in to 2.5? What should we do about 2.4? Index: Python/getargs.c =================================================================== --- Python/getargs.c (revision 45333) +++ Python/getargs.c (working copy) @@ -1042,11 +1042,8 @@ STORE_SIZE(PyUnicode_GET_SIZE(arg)); } else { - char *buf; - Py_ssize_t count = convertbuffer(arg, p, &buf); - if (count < 0) - return converterr(buf, arg, msgbuf, bufsize); - STORE_SIZE(count/(sizeof(Py_UNICODE))); + return converterr("cannot convert raw buffers"", + arg, msgbuf, bufsize); } format++; } else { > > - should Python guarantee that all characters in a Unicode object > > are between 0 and sys.maxunicode? Currently, it is possible to > > create Unicode strings with either negative or very large Py_UNICODE > > elements. > > > > - if the answer to the last question is no (i.e. if it is intentional > > that a unicode object can contain arbitrary Py_UNICODE values): should > > Python then guarantee that Py_UNICODE is an unsigned type? > > Py_UNICODE must always be unsigned. The whole implementation > relies on this and has been designed with this in mind (see > PEP 100). AFAICT, the configure does check that Py_UNICODE > is always unsigned. Martin fixed the crashing problem in 2.5 by making wchar_t unsigned which was a bug. (A configure test was reversed IIRC.) Can this change to wchar_t be made in 2.4? That technically changes all the interfaces even though it was a mistake. What should be done for 2.4? n _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com