Quentin Wenger <[email protected]> added the comment:
> > this limitation to the latin-1 subset is not compatible with the
> > documentation, which says that valid Python identifiers are valid group
> > names.
>
> Not all latin-1 characters are valid identifier, for example:
>
> >>> '\x94'.encode('latin1')
> b'\x94'
> >>> '\x94'.isidentifier()
> False
True but that's not the point. Δ is a valid Python identifier but not a valid
group name in bytes regexes, because it is not in the latin-1 plane. The
documentation does not mention this.
> There is a workaround, you can convert `bytes` to `str` with "latin-1"
> decoder before processing, IIRC there will be no extra overhead
> (memory/speed) during processing, then the name and content are the same
> type. :)
I am not searching a workaround for my current code.
And the simplest workaround is to latin-1-convert back to bytes, because re
should not latin-1-convert to string in the first place.
Are you saying that the proper way to use bytes regexes is to use string
regexes instead?
> Please look at these:
>
> >>> orig_name = "Ř"
> >>> orig_ch = orig_name.encode("cp1250") # Because why not?
> >>> orig_ch
> b'\xd8'
> >>> name = list(re.match(b"(?P<" + orig_ch + b">)",
> b"").groupdict().keys())[0]
> >>> name
> 'Ø' # '\xd8'
> >>> name == orig_name
> False
> >>> name.encode("latin-1")
> b'\xd8'
> >>> name.encode("latin-1") == orig_ch
> True
>
> "Ř" (\u0158) --cp1250--> b'\xd8'
> "Ø" (\u00d8) --latin-1--> b'\xd8'
That's no surprize, I carefully crafted this example. :-)
Rather, that is exactly my point: several different strings (which can all be
valid Python identifiers) can have the same single-byte representation, simply
by the mean of different encodings (duh).
So why convert group names to strings when outputting them from matches, when
you don't know where the bytes come from, or even whether they ever were
strings? That should be left to the programmer.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue40980>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com