Re: Intro to Pyparsing Article at ONLamp
Anton Vredegoor wrote: And pave the way for a natural language parser. Maybe there's even some (sketchy) path now to link computer languages and natural languages. In my mind Python has always been closer to human languages than other programming languages. From what I learned about it, language recognition is the easy part, language production is what is hard. But even the easy part has a long way to go, and since we're also using a I think you're underestimating just how far a long way to go is, for natural language processing. I daresay that no current computer-language parser will come even close to recognizing a significant fraction of human language. Using English, because that's the only language I'm fluent in, consider the sentence: The horse raced past the barn fell. It's just one of many garden path sentences, where something that occurs late in the sentence needs to trigger a reparse of the entire sentence. This is made even worse because of the semantic meanings of English words -- English, along with every other nonconstructed language that I know of, is grammatically ambiguous, in that semantic meanings are necessary to make 100% confident parses. That's indeed the basis of a class of humour. Generating human language -- turning concepts into words -- is the easy part. A concept-English transformer would only need to transform into a subset of English, and nobody will notice the difference. -- It's just an object; it's not what you think. :wq -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Christopher Subich wrote: Using English, because that's the only language I'm fluent in, consider the sentence: The horse raced past the barn fell. It's just one of many garden path sentences, where something that occurs late in the sentence needs to trigger a reparse of the entire sentence. I can't parse that at all. Are you sure it's correct? Aren't raced and fell both trying to be verbs on the same subject? English surely doesn't allow that forbids that sort of thing. (wink) -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Peter Hansen [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Christopher Subich wrote: Using English, because that's the only language I'm fluent in, consider the sentence: The horse raced past the barn fell. It's just one of many garden path sentences, where something that occurs late in the sentence needs to trigger a reparse of the entire sentence. I can't parse that at all. Upon seeing 'fell' as the main verb, you have to reparse 'raced past the barn' as not the predicate but as a past participle adjectival phrase, like 'bought last year' or 'expected to win'. The phrase parsed as a predicate reparses as a modifier, as in this sentence ;-) Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
On Mon, 30 Jan 2006 16:39:51 -0500 in comp.lang.python, Peter Hansen [EMAIL PROTECTED] wrote: Christopher Subich wrote: Using English, because that's the only language I'm fluent in, consider the sentence: The horse raced past the barn fell. It's just one of many garden path sentences, where something that occurs late in the sentence needs to trigger a reparse of the entire sentence. I can't parse that at all. Are you sure it's correct? Aren't raced and fell both trying to be verbs on the same subject? English surely doesn't allow that forbids that sort of thing. (wink) I had a heck of a time myself. Try The horse that was raced... and see if it doesn't make more sense. Regards, -=Dave -- Change is inevitable, progress is not. -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
On Mon, 30 Jan 2006 16:39:51 -0500, Peter Hansen [EMAIL PROTECTED] wrote: Christopher Subich wrote: Using English, because that's the only language I'm fluent in, consider the sentence: The horse raced past the barn fell. It's just one of many garden path sentences, where something that occurs late in the sentence needs to trigger a reparse of the entire sentence. I can't parse that at all. Are you sure it's correct? Aren't raced and fell both trying to be verbs on the same subject? English surely doesn't allow that forbids that sort of thing. (wink) The computer at CMU is pretty good at parsing. You can try it at http://www.link.cs.cmu.edu/link/submit-sentence-4.html Here's what it did with The horse raced past the barn fell. : Time 0.00 seconds (81.38 total) Found 2 linkages (2 with no P.P. violations) Linkage 1, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=13) +Xp+ |+Ss---+ | +-Wd-+ +Js+ | | | +--Ds-+---Mv--+--MVp--++--Ds-+ | | | | | | || | | | LEFT-WALL the horse.n raced.v past.p the barn.n fell.v . Constituent tree: (S (NP (NP The horse) (VP raced (PP past (NP the barn (VP fell) .) IIUC, that's the way I parse it too ;-) (I.e., The horse [being] raced past the barn fell.) BTW, the online response has some clickable elements in the diagram to get to definitions of the terms. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Bengt Richter wrote: On Mon, 30 Jan 2006 16:39:51 -0500, Peter Hansen [EMAIL PROTECTED] wrote: [...] The computer at CMU is pretty good at parsing. You can try it at http://www.link.cs.cmu.edu/link/submit-sentence-4.html Here's what it did with The horse raced past the barn fell. : [...] I suppose we shouldn't torment these programs ... Time 0.03 seconds (81.41 total) Found 10 linkages (6 with no P.P. violations) Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=19) +--Os-+--Bs*t-+-MVt-+ +- +-Sp*i-+--Ce--+Sp*i++---D*u--+-R+--Cr--+--Ss-+--MVb-+ +Mpc+ | | |||| | | | | | | I.p thought.v I.p saw.v the langauge[?].n that.r Python is.v better.a than in ---Js+ +---Ds--+ | | the corridor.n Constituent tree: (S (NP I) (VP thought (SBAR (S (NP I) (VP saw (NP (NP the langauge) (SBAR (WHNP that) (S (NP Python) (VP is (ADVP better) (PP (NP than) (PP in (NP the corridor -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Terry Reedy wrote: Peter Hansen [EMAIL PROTECTED] wrote: Christopher Subich wrote: The horse raced past the barn fell. It's just one of many garden path sentences, where something that occurs late in the sentence needs to trigger a reparse of the entire sentence. I can't parse that at all. Upon seeing 'fell' as the main verb, you have to reparse 'raced past the barn' as not the predicate but as a past participle adjectival phrase, like 'bought last year' or 'expected to win'. The phrase parsed as a predicate reparses as a modifier, as in this sentence ;-) Ah, as in the horse that was raced past the barn fell. Got it. :-) -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Paul McGuire wrote: There are two types of parsers: design-driven and data-driven. With design-driven parsing, you start with a BNF that defines your language or data format, and then construct the corresponding grammar parser. As the design evolves and expands (new features, keywords, additional options), the parser has to be adjusted to keep up. With data-driven parsing, you are starting with data to be parsed, and you have to discern the patterns that structure this data. Data-driven parsing usually shows this exact phenomenon that you describe, that new structures that were not seen or recognized before arrive in new data files, and the parser breaks. There are a number of steps you can take to make your parser less fragile in the face of uncertain data inputs: - using results names to access parsed tokens, instead of relying on simple position within an array of tokens - anticipating features that are not shown in the input data, but that are known to be supported (for example, the grammar expressions returned by pyparsing's makeHTMLTags method support arbitrary HTML attributes - this creates a more robust parser than simply coding a parser or regexp to match 'A HREF=' + quotedString) - accepting case-insensitive inputs - accepting whitespace between adjacent tokens, but not requiring it - pyparsing already does this for you I'd like to add another parser type, lets call this a natural language parser type. Here we have to quickly adapt to human typing errors or problems with the tranmission channel. I think videotext pages offer both kinds of challenges, so could provide good training material. Of course in such circumstances it seems to be hardly possible for a computer alone to produce correct parsing. Sometimes I even have to start up a chess program to inspect a game after parsing it into a pgn file and correct unlikely or impossible move sequences. So since we're now into human assisted parsing anyway, the most gain would be made in further inproving the user interface? For example, I had this experience when parsing chess games from videotext pages I grab from my videotext enabled TV capture card. Maybe once or twice in a year there's a chess page with games on videotext, but videotext chess display format always changes slightly in the meantime so I have to adapt my script. For such things I've switched back to 'hand' coding because it seems to be more flexible. Do these chess games display in PGN format (for instance, 15. Bg5 Rf8 16. a3 Bd5 17. Re1+ Nde5)? The examples directory that comes with pyparsing includes a PGN parser (submitted by Alberto Santini). Ah, now I remember, I think this was what got me started on pyparsing some time ago. The dutch videotext pages are online too (and there's a game today): http://teletekst.nos.nl/tekst/683-01.html But as I said there can be transmission errors and human errors. And the dutch notation is used, for example a L is a B, a P is a K, D is Q, T is R. I'd be interested in a parser that could make inferences about chess games and use it to correct these pages! What I would like to see, in order to improve on this situation is a graphical (tkinter) editor-highlighter in which it would be possible to select blocks of text from an (example) page and 'name' this block of text and select a grammar which it complies with, in order to assign a role to it later. That would be the perfect companion to pyparsing. At the moment I don't even know if such a thing would be feasible... There are some commercial parser generator products that work exactly this way, so I'm sure it's feasible. Yes, this would be a huge enabler for creating grammars. And pave the way for a natural language parser. Maybe there's even some (sketchy) path now to link computer languages and natural languages. In my mind Python has always been closer to human languages than other programming languages. From what I learned about it, language recognition is the easy part, language production is what is hard. But even the easy part has a long way to go, and since we're also using a *visual* interface for something that in the end originates from sound sequences (even what I type here is essentially a representation of a verbal report) we have ultimately a difficult switch back to auditory parsing ahead of us. But in the meantime the tools produced (even if only for text parsing) are already useful and entertaining. Keep up the good work. Anton. -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Paul McGuire wrote: I just published my first article on ONLamp, a beginner's walkthrough for pyparsing. Please check it out at http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html, and be sure to post any questions or comments. I like your article and pyparsing. But since you ask for comments I'll give some. For unchanging datafile formats pyparsing seems to be OK. But for highly volatile data like videotext pages or maybe some html tables one often has the experience of failure after investing some time in writing a grammar because the dataformats seem to change between the times one uses the script. For example, I had this experience when parsing chess games from videotext pages I grab from my videotext enabled TV capture card. Maybe once or twice in a year there's a chess page with games on videotext, but videotext chess display format always changes slightly in the meantime so I have to adapt my script. For such things I've switched back to 'hand' coding because it seems to be more flexible. (Or use a live internet connection to view the game instead of parsing videotext, but that's a lot less fun, and I don't have internet in some places.) What I would like to see, in order to improve on this situation is a graphical (tkinter) editor-highlighter in which it would be possible to select blocks of text from an (example) page and 'name' this block of text and select a grammar which it complies with, in order to assign a role to it later. That would be the perfect companion to pyparsing. At the moment I don't even know if such a thing would be feasible, or how hard it would be to make it, but I remember having seen data analyzing tools based on fixed column width data files, which is of course in a whole other league of difficulty of programming, but at least it gives some encouragement to the idea that it would be possible. Thank you for your ONLamp article and for making pyparsing available. I had some fun experimenting with it and it gave me some insights in parsing grammars. Anton -- http://mail.python.org/mailman/listinfo/python-list
Re: Intro to Pyparsing Article at ONLamp
Anton Vredegoor [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I like your article and pyparsing. But since you ask for comments I'll give some. For unchanging datafile formats pyparsing seems to be OK. But for highly volatile data like videotext pages or maybe some html tables one often has the experience of failure after investing some time in writing a grammar because the dataformats seem to change between the times one uses the script. There are two types of parsers: design-driven and data-driven. With design-driven parsing, you start with a BNF that defines your language or data format, and then construct the corresponding grammar parser. As the design evolves and expands (new features, keywords, additional options), the parser has to be adjusted to keep up. With data-driven parsing, you are starting with data to be parsed, and you have to discern the patterns that structure this data. Data-driven parsing usually shows this exact phenomenon that you describe, that new structures that were not seen or recognized before arrive in new data files, and the parser breaks. There are a number of steps you can take to make your parser less fragile in the face of uncertain data inputs: - using results names to access parsed tokens, instead of relying on simple position within an array of tokens - anticipating features that are not shown in the input data, but that are known to be supported (for example, the grammar expressions returned by pyparsing's makeHTMLTags method support arbitrary HTML attributes - this creates a more robust parser than simply coding a parser or regexp to match 'A HREF=' + quotedString) - accepting case-insensitive inputs - accepting whitespace between adjacent tokens, but not requiring it - pyparsing already does this for you For example, I had this experience when parsing chess games from videotext pages I grab from my videotext enabled TV capture card. Maybe once or twice in a year there's a chess page with games on videotext, but videotext chess display format always changes slightly in the meantime so I have to adapt my script. For such things I've switched back to 'hand' coding because it seems to be more flexible. Do these chess games display in PGN format (for instance, 15. Bg5 Rf8 16. a3 Bd5 17. Re1+ Nde5)? The examples directory that comes with pyparsing includes a PGN parser (submitted by Alberto Santini). What I would like to see, in order to improve on this situation is a graphical (tkinter) editor-highlighter in which it would be possible to select blocks of text from an (example) page and 'name' this block of text and select a grammar which it complies with, in order to assign a role to it later. That would be the perfect companion to pyparsing. At the moment I don't even know if such a thing would be feasible... There are some commercial parser generator products that work exactly this way, so I'm sure it's feasible. Yes, this would be a huge enabler for creating grammars. Thank you for your ONLamp article and for making pyparsing available. I had some fun experimenting with it and it gave me some insights in parsing grammars. Glad you enjoyed it, thanks for taking the time to reply! -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Intro to Pyparsing Article at ONLamp
I just published my first article on ONLamp, a beginner's walkthrough for pyparsing. Please check it out at http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html, and be sure to post any questions or comments. -- Paul -- http://mail.python.org/mailman/listinfo/python-list