On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <kars...@torproject.org> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hello developers, > > in the past few days I have been working on a grammar to parse Tor > bridge network statuses and hopefully other Tor descriptors in the > future. It's working, for some definition of working, but some issues > remain and I need some help. > > I just uploaded my sources, consisting only of the grammar with a fair > amount of documentation: > > https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4
Nice work, Karsten! I'm hoping we move towards some kind of machine-readable grammar/schema for all our data formats, and that we have our actual parsing/encoding code generated from it. (When I did a survey of where all our crash/assertion bugs for the last few years were, they seemed to have a higher-than-usual concentration in our parsing code.) One thing about this grammar in particular, though: It is over-strict. It matches only the formats we use today, and not the formats we are allowed to use in the future. For one example, a flag on an 's' line can be any non-space string - but this grammar will fail to parse unrecognized flags. On the other hand, while we specify the order of r, s, w, p, a, lines in a generated consensus, clients are required to parse the s, w, p, and a lines in any order, but not to allow two s lines in a single 'r' entry. I think that because of the free-ordering and multiplicity-restriction rules for our data formats, a context-free grammar simply isn't going to match our spec very well. > Quoting from that file to facilitate discussion here: > > There are multiple goals of having a grammar for Tor descriptors > available on CollecTor: > > 1. Translate descriptors to JSON for statistical analysis: Some tools > and databases require Tor descriptors in a standard format like JSON. > This grammar and a parser generated from it can help making that > translation as easy as possible, also to keep future maintenance as > low as possible. > > 2. Provide a basis for descriptor-parsing libraries: As of late 2015, > there are three libraries for parsing Tor descriptors: metrics-lib for > Java, Stem for Python, and Zoossh for Go. It would be beneficial to > place as much knowledge about the descriptor format into a grammar > shared by all those libraries and then generate parsers for different > languages from that grammar. > > 3. Serve as documentation for the Tor directory protocol > specification: Tor descriptors are already documented using a > hand-written grammar, but that may contain slight inaccuracies because > it's not verified. This grammar could fix that by either detecting > inaccuracies while trying to rewrite it to an executable grammar form > or by replacing the grammar in the specification documentation with > this executable grammar. > > Open issues and questions: > > - Was it smart to explicitly include all those SP tokens in the > rules, or should those be discarded right away by the lexer? The main > reason for keeping them was to stay as close to the specification as > possible, but maybe that has downsides on the other goals. IMO, once we have a grammar that is truly correct, that grammar should _be_ the spec, and we should revise the main spec to reference the grammar. > - If a bridge uses a nickname (or other token that's supposed to be a > STRING) that is also a keyword like "r" or "published", things get > confusing. Try editing the input bridge network status and observe > the result. But those are perfectly valid nicknames, so what can we do? Change the lexing rules so that keywords are only recognized as such at position 0 on the line, outside of a base64 block? best wishes, -- Nick _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev