[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger
Quentin Wenger added the comment: Other pathological case: literal backslashes ``` >>> re.match(".*", r"\\") ``` -- ___ Python tracker <ht

[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger
Quentin Wenger added the comment: *off -- ___ Python tracker <https://bugs.python.org/issue39949> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger
Quentin Wenger added the comment: (but those are one-character escapes, so that should be fine - either the escape is complete or the backslash is trailing and can be "peeled of") -- ___ Python tracker <https://bugs.python.o

[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger
Quentin Wenger added the comment: And ascii escapes should also not be forgotten. ``` >>> re.match(b".*", b"\t") >>> re.match(".*", "\t") ``` -- _

[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger
Quentin Wenger added the comment: An extraneous difficulty also exists for bytes regexes, because there non-ascii characters are repr'ed using escape sequences. So there's a risk of cutting one in the middle. ``` >>> import re >>> re.m

[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger
Quentin Wenger added the comment: bytes are _not_ Unicode code points, not even in the 256 range. End of the story. -- ___ Python tracker <https://bugs.python.org/issue40

[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger
Quentin Wenger added the comment: If I don't have to think about the str -> bytes direction, re should first stop going in the other direction. When I have bytes regexes I actually don't care about strings and would happily receive group names as bytes. But no, re decides that lati

[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger
Quentin Wenger added the comment: Because utf-8 is Python's default encoding, e.g. in source files, decode() and encode(). Literally everywhere. If you ask around "I have a bytestring, I need a string, what do I do?", using latin-1 will not be the first answer (and moreover, t

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: I just had an "aha moment": What re claims is that, rather than doing as I suggested: > ``` > # consider the following bytestring pattern > >>> p = b"(?P<\xc3\xba>)" > > # what character does the group

[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: File objects are an example of a square-bracket repr with string parameters in the repr, but no truncation is performed (see https://github.com/python/cpython/blob/master/Modules/_io/textio.c#L2912). Various truncations with the same (lack of?) clarity

[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Oh ok, I was mislead by the example in your first message, where you did have both the quote and ellipsis. I don't have a strong opinion. - having the quote is a bit more "clean" - but not having it makes clear than the pattern is truncated (per

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: The problem can also be played in reverse, maybe it is more telling: ``` # consider the following bytestring pattern >>> p = b"(?P<\xc3\xba>)" # what character does the group name correspond to? # maybe we can try to infer it b

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: And there's no need for a cryptic encoding like cp1250 for this problem to arise. Here is a simple example with Python's default encoding utf-8: ``` >>> a = "ú" >>> b = list(re.match(b"(?P<" + a.encode() + b">)

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: > > this limitation to the latin-1 subset is not compatible with the > > documentation, which says that valid Python identifiers are valid group > > names. > > Not all latin-1 characters are valid identifier, for example: > >

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: I prove my point that the decoding to string is arbitrary: ``` >>> import re >>> orig_name = "Ř" >>> orig_ch = orig_name.encode("cp1250") # Because why not? >>> name = list(re.match(b"(?P<"

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: > It seems you don't know some knowledge of encoding yet. I don't have to be ashamed of my knowledge of encoding. Yet you are right that I was missing a subtlety, which is that latin-1 is a strict subset of Unicode rather than a completely arbitr

[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: I welcome any counter-example to the eval()'able property in the stdlib. I do believe in this rule as hard and fast, because it works for small patterns, only bitting you when you grow, probably programmatically (so exactly when you actually could need

[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: @eric.smith thanks, no problem. If I can give any advice on this present issue, I would suggest to have the ellipsis _inside_ the quote, to make clear that the pattern is being truncated, not the match. So instead of ``` <_sre.SRE_Match object; span=(0,

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: The issue with the second variant is that utf-8 is an arbitrary (although default) choice. But: re is doing that same arbitrary choice already in decoding the group names into a string, which is my original complaint

[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: For a bit of background, the other issue is about the repr of compiled patterns, not match objects. Please see my argument there about the conformance to repr's doc - merely adding an ellipsis would _not_ solve this case. I have however nothing against

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Sorry, b"(?P<\xce\x94>)" -- ___ Python tracker <https://bugs.python.org/issue40980> ___ ___ Python-bugs-list ma

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: But Δ has no latin-1 representation. So Δ currently cannot be used as a group name in bytes regex, although it is a valid Python identifier. So that's a bug. I mean, if you insist of having group names as strings even for bytes regexes

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: > So b'\xe9' is mapped to \u00e9, it is `é`. Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group n

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Of course an inconvenience in my program is not per se the reason to change the language. I just wanted to motivate that the current situation gives unexpected results. "\xe9" doesn't look like proper utf-8 to me: ``` >>> "é&q

[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: All in all, it is simply a matter of compliance. The doc of repr says that a repr is either - a string that can be eval()'ed back to (an equivalent of) the original object - or a "more loose" angle-bracket representation. re.compile with smal

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: should *be a valid name -- ___ Python tracker <https://bugs.python.org/issue40980> ___ ___ Python-bugs-list mailing list Unsub

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Agreed to some extent, but there is the difference that group names are embedded in the pattern, which has to be bytes if the target is bytes. My use case is in an all-bytes, no-string project where I construct a large regular expression at startup

[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Pardon me, but I see an important difference with the other bug report: that one is about a repr in angle brackets, and as such does not require an exact output, so an ellipsis is good enough. In this bug, the output of repr gives a string than can

[issue40980] group names of bytes regexes are strings

2020-06-15 Thread Quentin Wenger
Quentin Wenger added the comment: This also affects functions/methods expecting a group name as parameter (e.g. match.group), the group name has to be passed as string. -- ___ Python tracker <https://bugs.python.org/issue40

[issue40984] re.compile's repr truncates patterns at 199 characters

2020-06-15 Thread Quentin Wenger
Quentin Wenger added the comment: Note: it actually truncates at 200 characters, counting the initial quote of the argument's repr. -- ___ Python tracker <https://bugs.python.org/issue40

[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-15 Thread Quentin Wenger
Change by Quentin Wenger : -- title: re.compile's repr truncates patterns at 199 characters -> re.compile's repr truncates patterns at 200 characters ___ Python tracker <https://bugs.python.org/issu

[issue40984] re.compile's repr truncates patterns at 199 characters

2020-06-15 Thread Quentin Wenger
New submission from Quentin Wenger : This seems somewhat arbitrary and yields unusable results, going against the doc: > repr(object) > Return a string containing a printable representation of an object. For many > types, this function makes an attempt to return a string that wo

[issue40980] group names of bytes regexes are strings

2020-06-14 Thread Quentin Wenger
New submission from Quentin Wenger : I noticed that match.groupdict() returns string keys, even for a bytes regex: ``` >>> import re >>> re.match(b"(?P)", b"").groupdict() {'a': b''} ``` This seems somewhat strange, because string and bytes matching in