Steven D'Aprano writes:

 > On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
 > > You can get almost the same result using pattern matching. For example, 
 > > your
 > > "foo:bar;baz".partition(":", ";")
 > > can be done by a well-known matching idiom:
 > > re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
 > 
 > "Well-known" he says :-)

It *is* well-known to those who know.  Just because you don't like
regex doesn't mean it's not well-known.  I wouldn't use that idiom
though; I'd use an explicit character class in most cases I encounter.

 > I think that the regex solution is also wrong because it requires you 
 > to know *exactly* what order the separators are found in the source 
 > string.

But that's characteristic of many examples.  In "structured" mail
headers like Content-Type, you want the separators to come in the
order ':', '=', ';'.  In a URI scheme with an authority component, you
want them in the order '@', ':'.  Except that you don't, in both those
examples.  In Content-Type, the '=' is optional, and there may be
multiple ';'.  In authority, the existing ':' is optional, and there's
an optional ':' to separate password from username before the '@'.

And it gets worse: in the authority case, the username is optional.
In the common case of anonymous access, the username is omitted, so

user, _, domain = "example.com".partition('@')

does the wrong thing!

 > If we swap the semi-colon and the colon in the source, but not 
 > the pattern, the idiom fails:
 > 
 >     >>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups()
 >     Traceback (most recent call last):
 >       File "<stdin>", line 1, in <module>
 >     AttributeError: 'NoneType' object has no attribute 'groups'
 > 
 > So that makes it useless for the case where you want to split of any of 
 > a number of separators, but don't know which order they occur in.

Examples where the order of separators doesn't matter?  In most of the
examples I need, swapping order is a parse error.

 > You call it "almost the same result" but it is nothing like the result 
 > from partition. The separators are lost,

Trivial to fix, just add parens, in the simpler grouping form as a
bonus!  I'm not asking you to like the resulting regexp better, just
pointing out that your dislike of regex is driving the discussion in
unprofitable directions.

 > and it splits the string all at once instead of one split per call.

So does the original proposal, that's part of the point of it, I
think.

I really don't see any of the variations on the proposal as a
particularly valuable addition.  It's already easy to screw up your
parse with str.partition (the authority example: although you can fix
the order problem with '@' by using str.rpartition, the multiple
optional ':' mean that whichever r?partition you use, you can get it
wrong unless you check the order of '@' and ':', so you have to use a
recursive parse, not a sequential parse).  But you can write a regex
version of authority to give a sequence of tokens rather than a parse,
and you convert that into a parse by checking each element of the
sequence for None in a deterministic order.  I prefer the latter
approach (Emacs user since Emacs was programmed in TECO), but as long
as you allow me to use regex for character classes and sequences, I
can live with retrictions on use of regex in the style guide.

Parsing is hard.  Both regex and r?partition are best used as low-
level tools for tokenizing, and you're asking for trouble if you try
to use them for parsing past a certain point.  My breaking point for
regex is somewhere around the authority example, but I wouldn't push
back if my project's style guide said to to break that up.  I *would*
however often prefer regexp to r?partition because it would allow
character classes, and in most of the areas I work with (mail, URIs,
encodings) being able to detect lexical errors by using character
classes is helpful.  And I would prefer "one bite per call" partition
to a partition at multiple points.  Where I'm being pretty fuzzy, the
.split methods are fine.

-- Yet another Steve
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VDZQVHGUPAOUCPL4HPAXFTQPNAHNJZIK/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to