On Apr 10, 8:38 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > The impression that I have (from a distance) is that Pyparsing is a good > interface abstraction with a kludgy and slow implementation. That the > implementation uses regexps just goes to show how kludgy it is. One > hopes that someday there will be a more serious implementation, perhaps > using llvm-py (I wonder whatever happened to that project, by the way) > so that your parser script will compile to executable machine code on > the fly.
I am definitely flattered that pyparsing stirs up so much interest, and among such a distinguished group. But I have to take some umbrage at Paul Rubin's left-handed compliment, "Pyparsing is a good interface abstraction with a kludgy and slow implementation," especially since he forms his opinions "from a distance". I actually *did* put some thought into what I wanted in pyparsing before designing it, and this forms this chapter of "Getting Started with Pyparsing" (available here as a free online excerpt: http://my.safaribooksonline.com/9780596514235/what_makes_pyparsing_so_special#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODA1OTY1MTQyMzUvMTYmaW1hZ2VwYWdlPTE2), the "Zen of Pyparsing" as it were. My goals were: - build parsers using explicit constructs (such as words, groups, repetition, alternatives), vs. expression encoding using specialized character sequences, as found in regexen - easy parser construction from primitive elements to complex groups and alternatives, using Python's operator overloading for ease of direct implementation of parsers using ordinary Python syntax; include mechanisms for defining recursive parser expressions - implicit skipping of whitespace between parser elements - results returned not just as a list of strings, but as a rich data object, with access to parsed fields by name or by list index, taking interfaces from both dicts and lists for natural adoption into common Python idioms - no separate code-generation steps, a la lex/yacc - support for parse-time callbacks, for specialized token handling, conversion, and/or construction of data structures - 100% pure Python, to be runnable on any platform that supports Python - liberal licensing, to permit easy adoption into any user's projects anywhere So raw performance really didn't even make my short-list, beyond the obvious "should be tolerably fast enough." I have found myself reading posts on c.l.py with wording like "I'm trying to parse <blah-blah> and I've been trying for hours/days to get this regex working." For kicks, I'd spend 5-15 minutes working up a working pyparsing solution, which *does* run comparatively slowly, perhaps taking a few minutes to process the poster's data file. But the net solution is developed and running in under 1/2 an hour, which to me seems like an overall gain compared to hours of fruitless struggling with backslashes and regex character sequences. On top of which, the pyparsing solutions are still readable when I come back to them weeks or months later, instead of staring at some line-noise regex and just scratch my head wondering what it was for. And sometimes "comparatively slowly" means that it runs 50x slower than a compiled method that runs in 0.02 seconds - that's still getting the job done in just 1 second. And is the internal use of regexes with pyparsing really a "kludge"? Why? They are almost completely hidden from the parser developer. And yet by using compiled regexes, I retain the portability of 100% Python while leveraging the compiled speed of the re engine. It does seem that there have been many posts of late (either on c.l.py or the related posts on Stackoverflow) where the OP is trying to either scrape content from HTML, or parse some type of recursive expression. HTML scrapers implemented using re's are terribly fragile, since HTML in the wild often contains little surprises (unexpected whitespace; upper/lower case inconsistencies; tag attributes in unpredictable order; attribute values with double, single, or no quotation marks) which completely frustrate any re-based approach. Granted, there are times when an re-parsing-of-HTML endeavor *isn't* futile or doomed from the start - the OP may be working with a very restricted set of HTML, generated from some other script so that the output is very consistent. Unfortunately, this poster usually gets thrown under the same "you'll never be able to parse HTML with re's" bus. I can't explain the surge in these posts, other than to wonder if we aren't just seeing a skewed sample - that is, the many cases where people *are* successfully using re's to solve their text extraction problems aren't getting posted to c.l.py, since no one posts questions they already have the answers to. So don't be too dismissive of pyparsing, Mr. Rubin. I've gotten many e- mails, wiki, and forum posts from Python users at all levels of the expertise scale, saying that pyparsing has helped them to be very productive in one or another aspect of creating a command parser, or adding safe expression evaluation to an app, or just extracting some specific data from a log file. I am encouraged that most report that they can get their parsers working in reasonably short order, often by reworking one of the examples that comes with pyparsing. If you're offering to write that extension to pyparsing that generates the parser runtime in fast machine code, it sounds totally bitchin' and I'd be happy to include it when it's ready. -- Paul -- http://mail.python.org/mailman/listinfo/python-list