On 3 Jun 2015 20:31, "Jonathan S. Shapiro" <[email protected]> wrote:
>
> Keean:
>
> You make a couple of assertions in your response that I can't work my way
through. Could very well be that I just haven't thought this through hard
enough yet. Apologies, but I think this will make more sense if I take your
comments out of order.
>
> BUFFERING
>
> I *did* note the smileys, but I want to make sure we're laughing at the
same joke. How do you buffer a terabyte stream arriving over a network port
(thus not backwards seekable) on a machine with 4GB to 8GB of memory?

Use a seekable stream, and buffer as many pages in tmp RAM as you have
available ram. Kernel page buffers in the unified page cache work this way,
so you don't need to do anything special, just use a buffered stream and
fseek.

> DO/DONE
>
> You wrote:
>
>> I dont see the do / done problem. One looks for "d" "o" (optional space)
some other token, the other "d" "o" "n" "e", there is no ambiguity.. The
"do" parser should clearly reject "don...".
>
>
> We're doing our token matching with simple comparison on the leading
string. In order to realize that the input 'd', 'o', 'n', 'e' should not
satisfy "do", we need to know one of two things:
>
> 1. The list of all tokens, by which we might come to know that there is a
potential longer match, or

> 2. The list of potential token separator/terminator characters, by which
we would know that the character after 'o' must be consumed under the
maximal munch rule.

> The reason it's a problem is that I don't see either of those
requirements satisfied. There's no tokenizer state, so we don't know the
token terminator characters. Meanwhile, the "do" and "done" matching rules
are worlds apart with no obvious way to combine them into a single regexp.
>
> Oh.  I see how GLSL parser did it. The keyword combinator checks that the
matched string is not trailed by a keyword continuation character. Is that
the short answer? I also see that this disambiguates keywords from other
identifiers **provided** we match the keywords first.

Yes.

>
> VIRTUAL TOKENIZATION
>
> You wrote:
>
>> I use virtual tokenization, which is a parser-combinator. IE I take a
parser description in combinators, and convert it into a tokenizer. It does
both in one pass by effectively using lazy tokenization.
>
>
> I'm probably missing something perfectly obvious, but I don't see how
this is done. I understand that if you can gather the tokens together into
a list of regexps you can generate a tokenizer. I also understand that the
tokenizer can be lazily turned into optimized code. What I don't see is how
to gather the regexps.
>
> At the time the various constructs like (keyword "if") appear, they are
appearing in completely disconnected [sub]parsers. Later those [sub]parsers
are joined by connecting up the resulting functions. At that point you can
no longer walk the parsers to locate the contained keyword matchers.
>
> I can certainly see how to do it if the type of "parser" is not (stream,
state) -> result. That is: if the return value of a combinator is not of
function type.
>
> So: how are you able to extract the various keyword and similar regexps
for merging?

Well the lazy tokenization I am currently using relies on white-space. So
"do " is clearly different from "done ".

This is not ideal as you can't do things like "do{..." Although in this
case with backtracking there is no problem (because done would fail).

You only get problems if both "done" and "do ne" are valid syntax.

Keean.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to