Steven D'Aprano writes:
> On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
> > You can get almost the same result using pattern matching. For example,
> > your
> > "foo:bar;baz".partition(":", ";")
> > can be done by a well-known matching idiom:
> > re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
>
> "Well-known" he says :-)
It *is* well-known to those who know. Just because you don't like
regex doesn't mean it's not well-known. I wouldn't use that idiom
though; I'd use an explicit character class in most cases I encounter.
> I think that the regex solution is also wrong because it requires you
> to know *exactly* what order the separators are found in the source
> string.
But that's characteristic of many examples. In "structured" mail
headers like Content-Type, you want the separators to come in the
order ':', '=', ';'. In a URI scheme with an authority component, you
want them in the order '@', ':'. Except that you don't, in both those
examples. In Content-Type, the '=' is optional, and there may be
multiple ';'. In authority, the existing ':' is optional, and there's
an optional ':' to separate password from username before the '@'.
And it gets worse: in the authority case, the username is optional.
In the common case of anonymous access, the username is omitted, so
user, _, domain = "example.com".partition('@')
does the wrong thing!
> If we swap the semi-colon and the colon in the source, but not
> the pattern, the idiom fails:
>
> >>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AttributeError: 'NoneType' object has no attribute 'groups'
>
> So that makes it useless for the case where you want to split of any of
> a number of separators, but don't know which order they occur in.
Examples where the order of separators doesn't matter? In most of the
examples I need, swapping order is a parse error.
> You call it "almost the same result" but it is nothing like the result
> from partition. The separators are lost,
Trivial to fix, just add parens, in the simpler grouping form as a
bonus! I'm not asking you to like the resulting regexp better, just
pointing out that your dislike of regex is driving the discussion in
unprofitable directions.
> and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I
think.
I really don't see any of the variations on the proposal as a
particularly valuable addition. It's already easy to screw up your
parse with str.partition (the authority example: although you can fix
the order problem with '@' by using str.rpartition, the multiple
optional ':' mean that whichever r?partition you use, you can get it
wrong unless you check the order of '@' and ':', so you have to use a
recursive parse, not a sequential parse). But you can write a regex
version of authority to give a sequence of tokens rather than a parse,
and you convert that into a parse by checking each element of the
sequence for None in a deterministic order. I prefer the latter
approach (Emacs user since Emacs was programmed in TECO), but as long
as you allow me to use regex for character classes and sequences, I
can live with retrictions on use of regex in the style guide.
Parsing is hard. Both regex and r?partition are best used as low-
level tools for tokenizing, and you're asking for trouble if you try
to use them for parsing past a certain point. My breaking point for
regex is somewhere around the authority example, but I wouldn't push
back if my project's style guide said to to break that up. I *would*
however often prefer regexp to r?partition because it would allow
character classes, and in most of the areas I work with (mail, URIs,
encodings) being able to detect lexical errors by using character
classes is helpful. And I would prefer "one bite per call" partition
to a partition at multiple points. Where I'm being pretty fuzzy, the
.split methods are fine.
-- Yet another Steve
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/VDZQVHGUPAOUCPL4HPAXFTQPNAHNJZIK/
Code of Conduct: http://python.org/psf/codeofconduct/