On 3 Jun 2015 23:09, "Jonathan S. Shapiro" <[email protected]> wrote: > Keean: read the constraints again. The machine in question does not have adequate memory resources to buffer the input by many, many orders of magnitude. It is impossible to build a generally backwards-seekable stream under these conditions.
The kernel buffer cache uses as much memory as available, so it is an optimisation. With a seekable stream you just ask the other end to seek to the location if it's not locally buffered. I agree that if the stream is 'live', say captured from a read only radio link from a satellite, then you need to parse the stream in finite bounds. > I'm sorry - that doesn't tell me how you are going about extracting the keywords. Short answer: you don't need to, the parser will fail when it sees the next "token" in an unexpected position. > You said in your earlier note that you were concerned about the need for context-sensitive tokenization. I think there are two cases of this: > > 1. Sub-languages, as when javascript code appears within an HTML script tag. This one isn't so bad. You can model the tokenizer as a stream and introduce sub-tokenizers that advance the stream input in a fashion conceptually similar to what fparsec's CharStream construct is doing. > > 2. Truly context sensitive lexing, in which certain lexemes are [or are not] admitted in certain parse contexts. In effect, this amounts to having different start states for the tokenizer, which is in turn equivalent to having multiple sub-tokenizers as I just sketched in [1]. It's a fairly vile language design approach, but you don't always control the language that you are dealing with, so sometimes there is little choice. > > Note that context-sensitive lexing makes it difficult to cache the tokens in the token stream when backtracking occurs. It's doable, but certainly not pretty... Its well known that many languages have hacks in their tokenizers (for things like string literals) that push the syntactic structure into the tokenizer. For example if we allow user defined operator symbols, we still need to treat some tokens differently like brackets etc. This makes it hard to change the syntax of the language when bits of the syntax are distributed throughout the tokenizer and the parser. As I said above, if you accept a bit more backtracking then you dont need the tokenizer. My parser is designed for implementing languages, so the expected case is reading from disk files. I understand reading from a stream is a valid use case, but it's not one I am targeting. I think the general parser problem is interesting, and I would like to find the time to write a parser combinator library that does: combinators -> passing graph -> graph rewrite to eliminate backtracking. The idea being we can take two parsers like: "do" or "done" And convert to: "do", (optional "ne") Where optional has two continuations for each of the original continuations. In fact I think I should add a branching parser for exactly that case, as its useful even without the parser tree rewriting. Keean.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
