I am trying to write a parser to handle human-generated "info files" that accompany the type of legal live concert recordings you can find at http://bt.etree.org (see for example http://bklyn.org/~cae/info-files/mmw2002-04-20.txt and numerous other examples in http://bklyn.org/~cae/info-files/)
These generally follow a common structure, but since they are typed up by hand there can be a lot of variation. The overall structure is usually something along the lines of band name, date, venue, source and transfer information, and then setlist/tracking info. Because of the irregular structure, I am finding writing a pure token-based parser is pretty tricky. I have a halfway-decent line-oriented parser that I've implemented mostly as a bunch of "if" statements which test against some state variables and regular expressions which match certain tell-tale strings (for example different brands of microphones, DAT decks, concert hall names, state abbreviations, etc). For some masochistic reason though, I've decided that I need to reimplement this using a proper grammar and Parse::RecDescent seems like a good fit. But maybe not. As I said, I'm having difficulty with the token-based nature of P::RD. In some cases I want things split up word-wise, but in others I'd prefer to look for strings anywhere within a line (e.g. microphone names like "Schoeps" are a pretty good indicator that I'm dealing with source info and that is pretty much guaranteed to span an entire line). Here's my line-based parser: http://bklyn.org/~cae/InfoFile.pm Here's the skeletal Parse::RecDescent parser I'm trying to use to do the same thing: http://bklyn.org/~cae/parser I've tried my hand at using the <skip> directive with a little luck (see the "artist" rule which seems to work well), and also some spectacular failures: if I try to use it in the source or sourceinfo rules, things end up not matching. I'm also having difficulty with some of my rules being to greedy and am not sure how to stop them. For example, the "source" rule as written often ends up gobbling the tokens like "Disc 1" which I'm hoping to match with the "disc" rule or "Set I" which I try to match with the "set" rule. I've tried using ...!rule a bit, but again with little luck. I'd like to have some way to tell the parser that a newline should (usually) signal the end of a rule. If anyone has any advice, I'd greatly appreciate it. It may be the case that the data set I'm working with is just NOT suited to this type of parsing, but I don't think I know enough about the solution domain to reach this decision myself. -- Caleb Epstein | bklyn . org | cae at | Brooklyn Dust | Th' MIND is the Pizza Palace of th' SOUL bklyn dot org | Bunny Mfg. |