All your observations are on the mark, Andrew, so I went ahead and coded the patch against *cpython*. This is the (draft) pull request.
https://github.com/python/cpython/pull/17735 I did not write new tests for *findalliter() *or *findfirst()*, because the correctness of *findalliter()* is implied by having *findall()* reimplemented in terms of it, and the correctness of *findfirst()* depends entirely on the correctness of *first()*, and *first()* is an ongoing discussion. Later I may try implementing *findalliter()* in C (move the core of *findall()* to within the outer of *finditer()*?) Cheers, On Sat, Dec 28, 2019 at 4:03 PM Andrew Barnert <abarn...@yahoo.com> wrote: > On Dec 28, 2019, at 10:12, Juancarlo Añez <apal...@gmail.com> wrote: > > > As far as I understand it, my implementation of *findalliter()* matches > the semantics in the *switch* statement. > > > There’s nothing outside the switch statement that converts PyNone values > to empty strings, so whatever the difference is, it must be inside the > switch statement. And, even though your control flow is the same, there are > two obvious ways in which you aren’t using the same input as the C code, so > the semantics aren’t going to be the same. > > The C code pulls values out of the pattern’s internal state without > building a match object. My guess is that this is where the difference > is—either building the match object, or somewhere inside the groups method, > unmatched groups get converted into something else, which the groups method > then replaces with its default parameter, which defaults to None, while the > C state_getslice function is just pulling a 0-length string out of the > input. But without diving into the code that’s just a guess. > > The C code also switches on the number of groups in the pattern (which I > think is exposed from the compiled pattern object?), not the number of > results in the current match. I’d guess that’s guaranteed to always be the > same even in weird cases like nested groups, so isn’t relevant here, but > again that’s just a guess. > > This is the matching implementation: > > for m in re.finditer(pattern, string, flags=flags): > g = m.groups() > if len(g) == 1: > yield g[0] > elif g: > yield tuple(s if s else '' for s in g) > else: > yield m.group() > > > Why not just call groups(default='') instead of calling groups() to > replace them with None and then using a genexpr to convert that None to ''? > > More importantly, you can’t return '', you have to return '' or b'' > depending on the type of the input string, using the same rule (whatever it > is) that findall and the rest of the module use. (I think that’s worked out > at compile time and exposed on the compiler pattern object, but I’m not > sure.) > > And even using a default with groups assumes I guessed right about the > problem, and that it’s the only difference in behavior. If not, it may > still be a hack that only sometimes gets the right answer and just nobody’s > thought up a test case otherwise. I think you really do need to go through > either the C code or the docs to make sure there aren’t any other edge > cases. > > Updated unit test: > > > Are there tests for findall and/or finditer in the stdlib test suite with > wide coverage that you could adapt to compare list(findalliter) vs. findall > or something? > > -- Juancarlo *Añez*
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/35OMYLEM35PGFWCXMVGHHOJRXNXLNTKP/ Code of Conduct: http://python.org/psf/codeofconduct/