[Python-ideas] Re: Universal parsing library in the stdlib to alleviate security issues

Nam Nguyen Tue, 16 Jul 2019 21:32:07 -0700

On Mon, Jul 15, 2019 at 8:47 PM Andrew Barnert <abarn...@yahoo.com> wrote:


> On Jul 15, 2019, at 18:44, Nam Nguyen <bits...@gmail.com> wrote:
>
> I have implemented a tiny (~200 SLOCs) package at
> https://gitlab.com/nam-nguyen/parser_compynator that demonstrates
> something like this is possible. There are several examples for you to have
> a feel of it, as well as some early benchmark numbers to consider. This is
> far smaller than any of the Python parsing libraries I have looked at, yet
> more universal than many of them. I hope that it would convert the skeptics
> ;).
>
>
> For at least some of your use cases, I don’t think it’s a problem that
> it’s 70x slower than the custom parsers you’d be replacing. How often do
> you need to parse a million URLs in your inner loop? Also, if the function
> composition is really the performance hurdle, can you optimize that away
> relatively simply, just by building an explicit tree (expression-template
> style) and walking the tree in a __call__ method, rather than building an
> implicit tree of nested calls? (And that could be optimized further if
> needed, e.g. by turning the tree walk into a simple virtual machine where
> all of the fundamental operations are inlined into the loop, and maybe even
> accelerating that with C code.)
>
> But I do think it’s a problem that there seems to be no way to usefully
> indicate failure to the caller, and I’m not sure that could be fixed as
> easily.
>
An empty set signifies the parse has failed. Perhaps I have misunderstood
what you indicated here.

> Invalid inputs in your readme examples don’t fail, they successfully
> return an empty set.
>
Because the library supports ambiguity, it can return more than one parse
results. The guarantee here is if it returns an empty set, the parse has
failed.

> There also doesn’t seem to be any way to trigger a hard fail rather than a
> backtrack.
>
You can have a parser that raises an exception. None of the primitive
parsers do that though.

> So I’m not sure how a real urlparse replacement could do the things the
> current one does, like raising a ValueError on  https://abc.d[ef.ghi/
> complaining that the netloc looks like an invalid IPv6 address. (Maybe you
> could def a function that raises a ValueError and attach it as a where
> somewhere in the parser tree? But even if that works, wouldn’t you get a
> meaningless exception that doesn’t have any information about where in the
> source text or where in the parse tree it came from or why it was raised,
> and, as your readme says, a stack trace full of garbage?)
>
urlparse right now raises ValueError('Invalid IPv6 URL'). It does not
mention where in the source text the error comes from.

> Can you add failure handling without breaking the “~200LOC and easy to
> read” feature of the library, and without breaking the “easy to read once
> you grok parser combinators” feature of the parsers built with it?
>
This is a good request. I will have to play around with this idea more.
What I think could be the most challenging task is to attribute failure to
appropriate rule(s) (i.e. expr expects a term + term, but you only have
term +). I feel like some metadata about the grammar might be required
here, and that might be too unwieldy to provide in a parser combinator
formulation. Interestingly enough, regex doesn't have anything like this
either.

Cheers,
Nam

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5LDEJMFQZ6H7UF3JXJRLEIG4Q36RV5MJ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Universal parsing library in the stdlib to alleviate security issues

Reply via email to