[Python-ideas] Re: findfirst() from findalliter(), and first

Juancarlo Añez Sat, 28 Dec 2019 10:15:12 -0800

Good point, Andrew!

I missed including the unit tests I wrote. I will now, at the bottom.


For reference, this is the implementation of *findall()*:

static PyObject *
_sre_SRE_Pattern_findall_impl(PatternObject *self, PyObject *string,
                              Py_ssize_t pos, Py_ssize_t endpos)
/*[clinic end generated code: output=f4966baceea60aca
input=5b6a4ee799741563]*/
{
    SRE_STATE state;
    PyObject* list;
    Py_ssize_t status;
    Py_ssize_t i, b, e;

    if (!state_init(&state, self, string, pos, endpos))
        return NULL;

    list = PyList_New(0);
    if (!list) {
        state_fini(&state);
        return NULL;
    }

    while (state.start <= state.end) {

        PyObject* item;

        state_reset(&state);

        state.ptr = state.start;

        status = sre_search(&state, PatternObject_GetCode(self));
        if (PyErr_Occurred())
            goto error;

        if (status <= 0) {
            if (status == 0)
                break;
            pattern_error(status);
            goto error;
        }

        /* don't bother to build a match object */
        switch (self->groups) {
        case 0:
            b = STATE_OFFSET(&state, state.start);
            e = STATE_OFFSET(&state, state.ptr);
            item = getslice(state.isbytes, state.beginning,
                            string, b, e);
            if (!item)
                goto error;
            break;
        case 1:
            item = state_getslice(&state, 1, string, 1);
            if (!item)
                goto error;
            break;
        default:
            item = PyTuple_New(self->groups);
            if (!item)
                goto error;
            for (i = 0; i < self->groups; i++) {
                PyObject* o = state_getslice(&state, i+1, string, 1);
                if (!o) {
                    Py_DECREF(item);
                    goto error;
                }
                PyTuple_SET_ITEM(item, i, o);
            }
            break;
        }

        status = PyList_Append(list, item);
        Py_DECREF(item);
        if (status < 0)
            goto error;

        state.must_advance = (state.ptr == state.start);
        state.start = state.ptr;
    }

    state_fini(&state);
    return list;

error:
    Py_DECREF(list);
    state_fini(&state);
    return NULL;

}

As far as I understand it, my implementation of *findalliter()* matches the
semantics in the *switch* statement.

This is my unit test. As you noted, it's not complete as it lacks a case
for unmatched groups.

def test_findfirst():
    def check(pattern, string):
        try:
            g = re.findall(pattern, string)
            assert findfirst(pattern, string) == g[0], '%s %s' % (pattern,
string)
            assert list(findalliter(pattern, string)) == g, '%s %s' %
(pattern, string)
        except ValueError:
            assert len(g) == 0
            with pytest.raises(ValueError):
                findfirst(pattern, string)

    s = 'xxxaxxxxbxxx'
    check('a', s)
    check('(a)', s)
    check('(a|b)', s)
    check('a.*b', s)
    check('(a).*(b)', s)

    # no match
    check('y', s)

    assert findfirst('y', s, default='THE DEFAULT') == 'THE DEFAULT'

    base = s
    for i in range(2, 10):
        s = base * i
        g = re.findall('a', s)
        assert len(g) > 1

        check('a', s)
        check('(a)', s)
        check('(a|b)', s)
        check('a.*b', s)
        check('(a).*(b)', s)

        check('y', s)

You're right about *findall()* *always* returning srings:

[ins] In [1]: import re


[ins] In [2]: re.findall(r'(a).*(b)?.*(c)', 'axxxxxxxxxxxxc')

Out[2]: [('a', '', 'c')]

[ins] In [3]:


This is the matching implementation:

    for m in re.finditer(pattern, string, flags=flags):
        g = m.groups()
        if len(g) == 1:
            yield g[0]
        elif g:
            yield tuple(s if s else '' for s in g)
        else:
            yield m.group()

Updated unit test:

def test_findfirst():
    def check(pattern, string):
        try:
            g = re.findall(pattern, string)
            assert findfirst(pattern, string) == g[0], '%s %s' % (pattern,
string)
            assert list(findalliter(pattern, string)) == g, '%s %s' %
(pattern, string)
        except ValueError:
            assert len(g) == 0
            with pytest.raises(ValueError):
                findfirst(pattern, string)


    def all_checks(s):
        check('a', s)
        check('(a)', s)
        check('(a|b)', s)
        check('a.*b', s)
        check('(a).*(b)', s)
        check('(a).*(c)?.*(b)', s)  # non-matching group

        # no match
        check('y', s)
        check('(y)?.*(z)?.*(q)', s)

        # empty matches
        check('(a).*(y?).*(b)', s)

        assert findfirst('y', s, default='THE DEFAULT') == 'THE DEFAULT'

    s = 'xxxaxxxxbxxx'
    all_checks(s)

    base = s
    for i in range(2, 10):
        s = base * i
        g = re.findall('a', s)
        assert len(g) == i

        all_checks(s)


Cheers,


On Fri, Dec 27, 2019 at 5:37 PM Andrew Barnert <abarn...@yahoo.com> wrote:

> > On Dec 27, 2019, at 09:00, Juancarlo Añez <apal...@gmail.com> wrote:
> >
> >    for m in re.finditer(pattern, string, flags=flags):
> >        g = m.groups()
> >        if len(g) == 1:
> >            yield g[0]
> >        elif g:
> >            yield g
> >        else:
> >            yield m.group()
>
> I don’t think this does the same thing as findall in every case. For
> example, for capture groups that don’t participate in the match, you’ll get
> tuples like ('spam', None, '42'), when I’m pretty sure findall always has
> strings no matter what. I’m not sure exactly what the rule is for how it
> does that: maybe it’s just the same thing as m.groups(default='') would
> give you?
>
> At any rate, it seems like this isn’t as trivial to port from C as it
> looked, so this needs solid unit tests. (Maybe the ones for findall are
> already good enough if you just adapt them?)
>
> That’s also a great argument that it should be added to the re module, so
> people don’t have to try to figure out how to port C code to Python and
> then test the hell out of it just to get something that’s only missing in
> the first place for historical/naming reasons.
>
>
>

-- 
Juancarlo *Añez*

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LGNZ7Q5VUYXHUOCSDID5FL6S5TNUABHR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: findfirst() from findalliter(), and first

Reply via email to