On Sat, Sep 4, 2010 at 8:29 AM, Stefan Behnel <[email protected]> wrote:
> Carl Witty, 25.08.2010 22:21:
>> On Wed, Aug 25, 2010 at 12:15 PM, Stefan Behnel wrote:
>>> Lisandro Dalcin, 25.08.2010 20:28:
>>>> When trying to cythonize my code using the -3 flag, I got many errors
>>>> like the one below:
>>>>
>>>> Error converting Pyrex file to C:
>>>> ------------------------------------------------------------
>>>> ...
>>>>       if not (<int>PetscInitializeCalled): return
>>>>       if (<int>PetscFinalizeCalled): return
>>>>       # deinstall custom error handler
>>>>       ierr = PetscPopErrorHandlerPython()
>>>>       if ierr != 0:
>>>>           fprintf(stderr, "PetscPopErrorHandler() failed "
>>>>                          ^
>>>> ------------------------------------------------------------
>>>>
>>>> /u/dalcinl/Devel/petsc4py-dev/src/PETSc/PETSc.pyx:307:24: Unicode
>>>> literals do not support coercion to C types other than Py_UNICODE.
>>>
>>> Right, the parser reads the literal as unicode string here before type
>>> analysis figures out that it's really meant to be a bytes literal.
>>>
>>> This will be hard to change as recovering the original bytes literal is
>>> impossible once it's converted to a unicode string (remember that you can
>>> use arbitrary character escape sequences in the literal). So I'm leaning
>>> towards keeping this as an error. After all, Unicode string literals is one
>>> of the things that a user explicitly requests with the -3 switch.
>>
>> How about allowing it for ASCII literals and leaving it an error if
>> there are any codepoints in the literal outside the 0-127 range?
>
> It's not so unlikely that you find C (data) strings that contain (escaped)
> non-ASCII characters. Those strings would need a 'b' prefix then. So you'd
> end up with some C strings that work without prefix and others for which
> you need a 'b', even if both clearly occur in a C char* context.

In my experience, non-ASCII literals are even more un-common than
non-ASCII user data, but it would be really nice at least to handle
the ASCII case smoothly.

> The problem is, unprefixed string literals found in source code compiled by
> Cython are equally likely to be meant as unicode strings, byte strings, C
> strings or pymorphic strings these days. There isn't one obvious "do what I
> mean" way. Remember that Lisandro brought this up because Cython reported
> an *error* when compiling the code. I find that a lot better than silently
> accepting something that may not have been meant that way.
>
> One thing we could do, however, is to parse all (unprefixed?) strings as
> both unicode strings *and* byte strings. That would induce a (minor) bit of
> overhead in the parser (both in terms of memory and speed), but it would
> allow us to recover the original byte sequence of a Unicode string during
> type analysis if we find that we need to coerce it to a byte string.
>
> In case we need to, we could then even write both types of byte sequences
> into the string constant table in the C file, so that we can recover the
> exact byte sequence and the correct Unicode character sequence depending on
> the CPython runtime.

How about we parse the literals as unicode strings, and if used in a
bytes context we raise a compile time error if any characters are
larger than a char? Thus "\u0001" would still be OK in a bytes
context, but "\u1000" would not be (compile time error). It may even
be better to set the limit to 127, as that is the truly unambiguous
range, and require a prefix if you really want something more.

- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to