On 3 Jun 2015 23:09, "Jonathan S. Shapiro" <[email protected]> wrote:
> Keean: read the constraints again. The machine in question does not have
adequate memory resources to buffer the input by many, many orders of
magnitude. It is impossible to build a generally backwards-seekable stream
under these conditions.

The kernel buffer cache uses as much memory as available, so it is an
optimisation. With a seekable stream you just ask the other end to seek to
the location if it's not locally buffered.

I agree that if the stream is 'live', say captured from a read only radio
link from a satellite, then you need to parse the stream in finite bounds.

> I'm sorry - that doesn't tell me how you are going about extracting the
keywords.

Short answer: you don't need to, the parser will fail when it sees the next
"token" in an unexpected position.

> You said in your earlier note that you were concerned about the need for
context-sensitive tokenization. I think there are two cases of this:
>
> 1. Sub-languages, as when javascript code appears within an HTML script
tag. This one isn't so bad. You can model the tokenizer as a stream and
introduce sub-tokenizers that advance the stream input in a fashion
conceptually similar to what fparsec's CharStream construct is doing.
>
> 2. Truly context sensitive lexing, in which certain lexemes are [or are
not] admitted in certain parse contexts. In effect, this amounts to having
different start states for the tokenizer, which is in turn equivalent to
having multiple sub-tokenizers as I just sketched in [1]. It's a fairly
vile language design approach, but you don't always control the language
that you are dealing with, so sometimes there is little choice.
>
> Note that context-sensitive lexing makes it difficult to cache the tokens
in the token stream when backtracking occurs. It's doable, but certainly
not pretty...

Its well known that many languages have hacks in their tokenizers (for
things like string literals) that push the syntactic structure into the
tokenizer. For example if we allow user defined operator symbols, we still
need to treat some tokens differently like brackets etc.

This makes it hard to change the syntax of the language when bits of the
syntax are distributed throughout the tokenizer and the parser.

As I said above, if you accept a bit more backtracking then you dont need
the tokenizer.

My parser is designed for implementing languages, so the expected case is
reading from disk files. I understand reading from a stream is a valid use
case, but it's not one I am targeting.

I think the general parser problem is interesting, and I would like to find
the time to write a parser combinator library that does:

combinators -> passing graph -> graph rewrite to eliminate backtracking.

The idea being we can take two parsers like:

"do" or "done"

And convert to:

"do", (optional "ne")

Where optional has two continuations for each of the original continuations.

In fact I think I should add a branching parser for exactly that case, as
its useful even without the parser tree rewriting.

Keean.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to