All your observations are on the mark, Andrew, so I went ahead and coded
the patch against *cpython*. This is the (draft) pull request.

https://github.com/python/cpython/pull/17735

I did not write new tests for *findalliter() *or *findfirst()*, because the
correctness of *findalliter()* is implied by having *findall()* reimplemented
in terms of it, and the correctness of *findfirst()* depends entirely on
the correctness of *first()*, and *first()* is an ongoing discussion.

Later I may try implementing *findalliter()* in C (move the core of
*findall()* to within the outer of *finditer()*?)

Cheers,

On Sat, Dec 28, 2019 at 4:03 PM Andrew Barnert <abarn...@yahoo.com> wrote:

> On Dec 28, 2019, at 10:12, Juancarlo Añez <apal...@gmail.com> wrote:
>
>
> As far as I understand it, my implementation of *findalliter()* matches
> the semantics in the *switch* statement.
>
>
> There’s nothing outside the switch statement that converts PyNone values
> to empty strings, so whatever the difference is, it must be inside the
> switch statement. And, even though your control flow is the same, there are
> two obvious ways in which you aren’t using the same input as the C code, so
> the semantics aren’t going to be the same.
>
> The C code pulls values out of the pattern’s internal state without
> building a match object. My guess is that this is where the difference
> is—either building the match object, or somewhere inside the groups method,
> unmatched groups get converted into something else, which the groups method
> then replaces with its default parameter, which defaults to None, while the
> C state_getslice function is just pulling a 0-length string out of the
> input. But without diving into the code that’s just a guess.
>
> The C code also switches on the number of groups in the pattern (which I
> think is exposed from the compiled pattern object?), not the number of
> results in the current match. I’d guess that’s guaranteed to always be the
> same even in weird cases like nested groups, so isn’t relevant here, but
> again that’s just a guess.
>
> This is the matching implementation:
>
>     for m in re.finditer(pattern, string, flags=flags):
>         g = m.groups()
>         if len(g) == 1:
>             yield g[0]
>         elif g:
>             yield tuple(s if s else '' for s in g)
>         else:
>             yield m.group()
>
>
> Why not just call groups(default='') instead of calling groups() to
> replace them with None and then using a genexpr to convert that None to ''?
>
> More importantly, you can’t return '', you have to return '' or b''
> depending on the type of the input string, using the same rule (whatever it
> is) that findall and the rest of the module use. (I think that’s worked out
> at compile time and exposed on the compiler pattern object, but I’m not
> sure.)
>
> And even using a default with groups assumes I guessed right about the
> problem, and that it’s the only difference in behavior. If not, it may
> still be a hack that only sometimes gets the right answer and just nobody’s
> thought up a test case otherwise. I think you really do need to go through
> either the C code or the docs to make sure there aren’t any other edge
> cases.
>
> Updated unit test:
>
>
> Are there tests for findall and/or finditer in the stdlib test suite with
> wide coverage that you could adapt to compare list(findalliter) vs. findall
> or something?
>
>

-- 
Juancarlo *Añez*
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/35OMYLEM35PGFWCXMVGHHOJRXNXLNTKP/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to