Quentin Wenger added the comment:
Other pathological case: literal backslashes
```
>>> re.match(".*", r"\\")
```
--
___
Python tracker
<ht
Quentin Wenger added the comment:
*off
--
___
Python tracker
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe:
Quentin Wenger added the comment:
(but those are one-character escapes, so that should be fine - either the
escape is complete or the backslash is trailing and can be "peeled of")
--
___
Python tracker
<https://bugs.python.o
Quentin Wenger added the comment:
And ascii escapes should also not be forgotten.
```
>>> re.match(b".*", b"\t")
>>> re.match(".*", "\t")
```
--
_
Quentin Wenger added the comment:
An extraneous difficulty also exists for bytes regexes, because there non-ascii
characters are repr'ed using escape sequences. So there's a risk of cutting one
in the middle.
```
>>> import re
>>
Quentin Wenger added the comment:
bytes are _not_ Unicode code points, not even in the 256 range. End of the
story.
--
___
Python tracker
<https://bugs.python.org/issue40
Quentin Wenger added the comment:
If I don't have to think about the str -> bytes direction, re should first stop
going in the other direction.
When I have bytes regexes I actually don't care about strings and would happily
receive group names as bytes. But no, re decides th
Quentin Wenger added the comment:
Because utf-8 is Python's default encoding, e.g. in source files, decode() and
encode(). Literally everywhere.
If you ask around "I have a bytestring, I need a string, what do I do?", using
latin-1 will not be the first answer (and moreov
Quentin Wenger added the comment:
I just had an "aha moment": What re claims is that, rather than doing as I
suggested:
> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
>
> # what character does the group name
Quentin Wenger added the comment:
File objects are an example of a square-bracket repr with string parameters in
the repr, but no truncation is performed (see
https://github.com/python/cpython/blob/master/Modules/_io/textio.c#L2912).
Various truncations with the same (lack of?) clarity are
Quentin Wenger added the comment:
Oh ok, I was mislead by the example in your first message, where you did have
both the quote and ellipsis.
I don't have a strong opinion.
- having the quote is a bit more "clean"
- but not having it makes clear than the pattern is truncated
Quentin Wenger added the comment:
You questioned my knowledge of encodings. Let's quote from one of the most
famous introductory articles on the subject
(https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicod
Quentin Wenger added the comment:
The problem can also be played in reverse, maybe it is more telling:
```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"
# what character does the group name correspond to?
# maybe we can try to infer it b
Quentin Wenger added the comment:
And there's no need for a cryptic encoding like cp1250 for this problem to
arise. Here is a simple example with Python's default encoding utf-8:
```
>>> a = "ú"
>>> b = list(re.match(b"(?P<" + a.encode() + b&
Quentin Wenger added the comment:
> > this limitation to the latin-1 subset is not compatible with the
> > documentation, which says that valid Python identifiers are valid group
> > names.
>
> Not all latin-1 characters are valid identifier, for example:
>
>
Quentin Wenger added the comment:
I prove my point that the decoding to string is arbitrary:
```
>>> import re
>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> name = list(re.match(b"(?P<"
Quentin Wenger added the comment:
> It seems you don't know some knowledge of encoding yet.
I don't have to be ashamed of my knowledge of encoding. Yet you are right that
I was missing a subtlety, which is that latin-1 is a strict subset of Unicode
rather than a complet
Quentin Wenger added the comment:
I welcome any counter-example to the eval()'able property in the stdlib.
I do believe in this rule as hard and fast, because it works for small
patterns, only bitting you when you grow, probably programmatically (so exactly
when you actually could nee
Quentin Wenger added the comment:
@eric.smith thanks, no problem.
If I can give any advice on this present issue, I would suggest to have the
ellipsis _inside_ the quote, to make clear that the pattern is being truncated,
not the match. So instead of
```
<_sre.SRE_Match object; span=(0,
Quentin Wenger added the comment:
The issue with the second variant is that utf-8 is an arbitrary (although
default) choice.
But: re is doing that same arbitrary choice already in decoding the group names
into a string, which is my original complaint
Quentin Wenger added the comment:
For a bit of background, the other issue is about the repr of compiled
patterns, not match objects.
Please see my argument there about the conformance to repr's doc - merely
adding an ellipsis would _not_ solve this case.
I have however nothing agains
Quentin Wenger added the comment:
Sorry, b"(?P<\xce\x94>)"
--
___
Python tracker
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list ma
Quentin Wenger added the comment:
But Δ has no latin-1 representation. So Δ currently cannot be used as a group
name in bytes regex, although it is a valid Python identifier. So that's a bug.
I mean, if you insist of having group names as strings even for bytes regexes,
then it i
Quentin Wenger added the comment:
> So b'\xe9' is mapped to \u00e9, it is `é`.
Yes but \xe9 is not strictly valid utf-8, or say not the canonical
representation of "é". So there is no way to get \xe9 starting from é without
leaving utf-8. So starting with é
Quentin Wenger added the comment:
Of course an inconvenience in my program is not per se the reason to change the
language. I just wanted to motivate that the current situation gives unexpected
results.
"\xe9" doesn't look like proper utf-8 to me:
```
>>> &quo
Quentin Wenger added the comment:
All in all, it is simply a matter of compliance. The doc of repr says that a
repr is either
- a string that can be eval()'ed back to (an equivalent of) the original object
- or a "more loose" angle-bracket representation.
re.compile with
Quentin Wenger added the comment:
should *be a valid name
--
___
Python tracker
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsub
Quentin Wenger added the comment:
Agreed to some extent, but there is the difference that group names are
embedded in the pattern, which has to be bytes if the target is bytes.
My use case is in an all-bytes, no-string project where I construct a large
regular expression at startup, with
Quentin Wenger added the comment:
Pardon me, but I see an important difference with the other bug report: that
one is about a repr in angle brackets, and as such does not require an exact
output, so an ellipsis is good enough.
In this bug, the output of repr gives a string than can, at
Quentin Wenger added the comment:
This also affects functions/methods expecting a group name as parameter (e.g.
match.group), the group name has to be passed as string.
--
___
Python tracker
<https://bugs.python.org/issue40
Quentin Wenger added the comment:
Note: it actually truncates at 200 characters, counting the initial quote of
the argument's repr.
--
___
Python tracker
<https://bugs.python.org/is
Change by Quentin Wenger :
--
title: re.compile's repr truncates patterns at 199 characters -> re.compile's
repr truncates patterns at 200 characters
___
Python tracker
<https://bugs.pytho
New submission from Quentin Wenger :
This seems somewhat arbitrary and yields unusable results, going against the
doc:
> repr(object)
> Return a string containing a printable representation of an object. For many
> types, this function makes an attempt to return a string that would
New submission from Quentin Wenger :
I noticed that match.groupdict() returns string keys, even for a bytes regex:
```
>>> import re
>>> re.match(b"(?P)", b"").groupdict()
{'a': b''}
```
This seems somewhat strange, because string an
34 matches
Mail list logo