Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-14 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>> I'd argue that such code is broken anyway: 3rd party code simply
>> cannot make any assumptions on the typedef behind Py_UNICODE.
> 
> The code would work just fine now, but will stop working with the
> change. Users of the code might not be open to arguments.

Fair enough. Let's leave things as they are for 2.4, then.

> In any case, it's up to the release manager to decide.
> 
>> Note that you'd only see this change when compiling Python in the
>> non-standard UCS4 setting on Linux.
> 
> Sure. That happens to be the default on many Linux distributions,
> though.

Unfortunately, that's true.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 14 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-14 Thread Martin v. Löwis
M.-A. Lemburg wrote:
> I'd argue that such code is broken anyway: 3rd party code simply
> cannot make any assumptions on the typedef behind Py_UNICODE.

The code would work just fine now, but will stop working with the
change. Users of the code might not be open to arguments.

In any case, it's up to the release manager to decide.

> Note that you'd only see this change when compiling Python in the
> non-standard UCS4 setting on Linux.

Sure. That happens to be the default on many Linux distributions,
though.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-14 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> Neal Norwitz wrote:
>> I'll leave this decision to Martin or someone else, since I'm not
>> familiar with the ramifications.  Since it was documented as unsigned,
>> I think it's reasonable to consider changing.  Though it could create
>> signed-ness warnings in other modules.  I'm not sure but it's possible
>> it could create problems for C++ compilers since they are pickier.
> 
> My concern is not so much that it becomes unsigned in 2.4.4, but that
> it stops being a typedef for wchar_t on Linux. C++ code that uses that
> assumption might stop compiling.

I'd argue that such code is broken anyway: 3rd party code simply
cannot make any assumptions on the typedef behind Py_UNICODE.

Note that you'd only see this change when compiling Python in the
non-standard UCS4 setting on Linux.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 14 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-14 Thread Martin v. Löwis
Neal Norwitz wrote:
> I'll leave this decision to Martin or someone else, since I'm not
> familiar with the ramifications.  Since it was documented as unsigned,
> I think it's reasonable to consider changing.  Though it could create
> signed-ness warnings in other modules.  I'm not sure but it's possible
> it could create problems for C++ compilers since they are pickier.

My concern is not so much that it becomes unsigned in 2.4.4, but that
it stops being a typedef for wchar_t on Linux. C++ code that uses that
assumption might stop compiling.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-13 Thread Neal Norwitz
On 4/13/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> >
> > Does this mean you would like to see this patch checked in to 2.5?
>
> Yes.

Ok, I checked this in to 2.5 (minus the syntax error).

> I also think that changing the type from signed to unsigned
> by backporting the configure fix will only make things safer
> for the user, since extensions will probably not even be aware
> of the fact that Py_UNICODE could be signed (it has always been
> documented to be unsigned).
>
> So +1 on backporting the configure test fix to 2.4.

I'll leave this decision to Martin or someone else, since I'm not
familiar with the ramifications.  Since it was documented as unsigned,
I think it's reasonable to consider changing.  Though it could create
signed-ness warnings in other modules.  I'm not sure but it's possible
it could create problems for C++ compilers since they are pickier.

n
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-13 Thread M.-A. Lemburg
Neal Norwitz wrote:
> On 3/31/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> Martin v. Löwis wrote:
>>> Neal Norwitz wrote:
 See http://python.org/sf/1454485 for the gory details.  Basically if
 you create a unicode array (array.array('u')) and try to append an
 8-bit string (ie, not unicode), you can crash the interpreter.

 The problem is that the string is converted without question to a
 unicode buffer.  Within unicode, it assumes the data to be valid, but
 this isn't necessarily the case.  We wind up accessing an array with a
 negative index and boom.
>>> There are several problems combined here, which might need discussion:
>>>
>>> - why does the 'u#' converter use the buffer interface if available?
>>>   it should just support Unicode objects. The buffer object makes
>>>   no promise that the buffer actually is meaningful UCS-2/UCS-4, so
>>>   u# shouldn't guess that it is.
>>>   (FWIW, it currently truncates the buffer size to the next-smaller
>>>multiple of sizeof(Py_UNICODE), and silently so)
>>>
>>>   I think that part should just go: u# should be restricted to unicode
>>>   objects.
>> 'u#' is intended to match 's#' which also uses the buffer
>> interface. It expects the buffer returned by the object
>> to a be a Py_UNICODE* buffer, hence the calculation of the
>> length.
>>
>> However, we already have 'es#' which is a lot safer to use
>> in this respect: you can explicity define the encoding you
>> want to see, e.g. 'unicode-internal' and the associated
>> codec also takes care of range checks, etc.
>>
>> So, I'm +1 on restricting 'u#' to Unicode objects.
> 
> Note:  2.5 no longer crashes, 2.4 does.
> 
> Does this mean you would like to see this patch checked in to 2.5? 

Yes.

> What should we do about 2.4?

Perhaps you could add a warning that is displayed when
using u# with non-Unicode objects ?!

> Index: Python/getargs.c
> ===
> --- Python/getargs.c(revision 45333)
> +++ Python/getargs.c(working copy)
> @@ -1042,11 +1042,8 @@
> STORE_SIZE(PyUnicode_GET_SIZE(arg));
> }
> else {
> -   char *buf;
> -   Py_ssize_t count = convertbuffer(arg, p, &buf);
> -   if (count < 0)
> -   return converterr(buf, arg, msgbuf, bufsize);
> -   STORE_SIZE(count/(sizeof(Py_UNICODE)));
> +   return converterr("cannot convert raw 
> buffers"",
> + arg, msgbuf, bufsize);
> }
> format++;
> } else {
> 
>>> - should Python guarantee that all characters in a Unicode object
>>>   are between 0 and sys.maxunicode? Currently, it is possible to
>>>   create Unicode strings with either negative or very large Py_UNICODE
>>>   elements.
>>>
>>> - if the answer to the last question is no (i.e. if it is intentional
>>>   that a unicode object can contain arbitrary Py_UNICODE values): should
>>>   Python then guarantee that Py_UNICODE is an unsigned type?
>> Py_UNICODE must always be unsigned. The whole implementation
>> relies on this and has been designed with this in mind (see
>> PEP 100). AFAICT, the configure does check that Py_UNICODE
>> is always unsigned.
> 
> Martin fixed the crashing problem in 2.5 by making wchar_t unsigned
> which was a bug.  (A configure test was reversed IIRC.)  Can this
> change to wchar_t be made in 2.4?  That technically changes all the
> interfaces even though it was a mistake.  What should be done for 2.4?

If users want to interface from wchar_t to Python's Unicode
type they have to go through the PyUnicode_FromWideChar()
and PyUnicode_AsWideChar() interfaces. Assuming that Py_UNICODE
is the same as wchar_t is simply wrong (and always was).

I also think that changing the type from signed to unsigned
by backporting the configure fix will only make things safer
for the user, since extensions will probably not even be aware
of the fact that Py_UNICODE could be signed (it has always been
documented to be unsigned).

So +1 on backporting the configure test fix to 2.4.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 13 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-04-12 Thread Neal Norwitz
On 3/31/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> Martin v. Löwis wrote:
> > Neal Norwitz wrote:
> >> See http://python.org/sf/1454485 for the gory details.  Basically if
> >> you create a unicode array (array.array('u')) and try to append an
> >> 8-bit string (ie, not unicode), you can crash the interpreter.
> >>
> >> The problem is that the string is converted without question to a
> >> unicode buffer.  Within unicode, it assumes the data to be valid, but
> >> this isn't necessarily the case.  We wind up accessing an array with a
> >> negative index and boom.
> >
> > There are several problems combined here, which might need discussion:
> >
> > - why does the 'u#' converter use the buffer interface if available?
> >   it should just support Unicode objects. The buffer object makes
> >   no promise that the buffer actually is meaningful UCS-2/UCS-4, so
> >   u# shouldn't guess that it is.
> >   (FWIW, it currently truncates the buffer size to the next-smaller
> >multiple of sizeof(Py_UNICODE), and silently so)
> >
> >   I think that part should just go: u# should be restricted to unicode
> >   objects.
>
> 'u#' is intended to match 's#' which also uses the buffer
> interface. It expects the buffer returned by the object
> to a be a Py_UNICODE* buffer, hence the calculation of the
> length.
>
> However, we already have 'es#' which is a lot safer to use
> in this respect: you can explicity define the encoding you
> want to see, e.g. 'unicode-internal' and the associated
> codec also takes care of range checks, etc.
>
> So, I'm +1 on restricting 'u#' to Unicode objects.

Note:  2.5 no longer crashes, 2.4 does.

Does this mean you would like to see this patch checked in to 2.5? 
What should we do about 2.4?

Index: Python/getargs.c
===
--- Python/getargs.c(revision 45333)
+++ Python/getargs.c(working copy)
@@ -1042,11 +1042,8 @@
STORE_SIZE(PyUnicode_GET_SIZE(arg));
}
else {
-   char *buf;
-   Py_ssize_t count = convertbuffer(arg, p, &buf);
-   if (count < 0)
-   return converterr(buf, arg, msgbuf, bufsize);
-   STORE_SIZE(count/(sizeof(Py_UNICODE)));
+   return converterr("cannot convert raw buffers"",
+ arg, msgbuf, bufsize);
}
format++;
} else {

> > - should Python guarantee that all characters in a Unicode object
> >   are between 0 and sys.maxunicode? Currently, it is possible to
> >   create Unicode strings with either negative or very large Py_UNICODE
> >   elements.
> >
> > - if the answer to the last question is no (i.e. if it is intentional
> >   that a unicode object can contain arbitrary Py_UNICODE values): should
> >   Python then guarantee that Py_UNICODE is an unsigned type?
>
> Py_UNICODE must always be unsigned. The whole implementation
> relies on this and has been designed with this in mind (see
> PEP 100). AFAICT, the configure does check that Py_UNICODE
> is always unsigned.

Martin fixed the crashing problem in 2.5 by making wchar_t unsigned
which was a bug.  (A configure test was reversed IIRC.)  Can this
change to wchar_t be made in 2.4?  That technically changes all the
interfaces even though it was a mistake.  What should be done for 2.4?

n
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-03-31 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> Neal Norwitz wrote:
>> See http://python.org/sf/1454485 for the gory details.  Basically if
>> you create a unicode array (array.array('u')) and try to append an
>> 8-bit string (ie, not unicode), you can crash the interpreter.
>>
>> The problem is that the string is converted without question to a
>> unicode buffer.  Within unicode, it assumes the data to be valid, but
>> this isn't necessarily the case.  We wind up accessing an array with a
>> negative index and boom.
> 
> There are several problems combined here, which might need discussion:
> 
> - why does the 'u#' converter use the buffer interface if available?
>   it should just support Unicode objects. The buffer object makes
>   no promise that the buffer actually is meaningful UCS-2/UCS-4, so
>   u# shouldn't guess that it is.
>   (FWIW, it currently truncates the buffer size to the next-smaller
>multiple of sizeof(Py_UNICODE), and silently so)
> 
>   I think that part should just go: u# should be restricted to unicode
>   objects.

'u#' is intended to match 's#' which also uses the buffer
interface. It expects the buffer returned by the object
to a be a Py_UNICODE* buffer, hence the calculation of the
length.

However, we already have 'es#' which is a lot safer to use
in this respect: you can explicity define the encoding you
want to see, e.g. 'unicode-internal' and the associated
codec also takes care of range checks, etc.

So, I'm +1 on restricting 'u#' to Unicode objects.

> - should Python guarantee that all characters in a Unicode object
>   are between 0 and sys.maxunicode? Currently, it is possible to
>   create Unicode strings with either negative or very large Py_UNICODE
>   elements.
> 
> - if the answer to the last question is no (i.e. if it is intentional
>   that a unicode object can contain arbitrary Py_UNICODE values): should
>   Python then guarantee that Py_UNICODE is an unsigned type?

Py_UNICODE must always be unsigned. The whole implementation
relies on this and has been designed with this in mind (see
PEP 100). AFAICT, the configure does check that Py_UNICODE
is always unsigned.

Regarding the permitted range of values, I think the necessary
overhead to check that all Py_UNICODE* array values are within
the currently permitted range would unnecessarily slow down
the implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 31 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter

2006-03-30 Thread Martin v. Löwis
Neal Norwitz wrote:
> See http://python.org/sf/1454485 for the gory details.  Basically if
> you create a unicode array (array.array('u')) and try to append an
> 8-bit string (ie, not unicode), you can crash the interpreter.
> 
> The problem is that the string is converted without question to a
> unicode buffer.  Within unicode, it assumes the data to be valid, but
> this isn't necessarily the case.  We wind up accessing an array with a
> negative index and boom.

There are several problems combined here, which might need discussion:

- why does the 'u#' converter use the buffer interface if available?
  it should just support Unicode objects. The buffer object makes
  no promise that the buffer actually is meaningful UCS-2/UCS-4, so
  u# shouldn't guess that it is.
  (FWIW, it currently truncates the buffer size to the next-smaller
   multiple of sizeof(Py_UNICODE), and silently so)

  I think that part should just go: u# should be restricted to unicode
  objects.

- should Python guarantee that all characters in a Unicode object
  are between 0 and sys.maxunicode? Currently, it is possible to
  create Unicode strings with either negative or very large Py_UNICODE
  elements.

- if the answer to the last question is no (i.e. if it is intentional
  that a unicode object can contain arbitrary Py_UNICODE values): should
  Python then guarantee that Py_UNICODE is an unsigned type?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com