Re: [Factor-talk] new parser

Doug Coleman Fri, 07 Aug 2015 17:45:48 -0700

I just now added the ![[ ]] comment syntax and fixed !comments.

Force-pushed to the erg/modern4 branch.


On Fri, Aug 7, 2015 at 5:27 PM, Doug Coleman <doug.cole...@gmail.com> wrote:

> [[This is kind of a brain-dump and not completely organized, but I'm going
> to send it.]]
>
> The proposed "new-parser" is a lexer and parser with specific roles for
> each. SYNTAX: words that execute arbitrary code should be replaced with
> PARSER: words that only parse text, and a compile pass
>
> The main goals for the new-parser are:
>
> 1) allow the new-parser to parse files without compiling them
>
> Since the lexer/parser must know all parsing words before encountering
> them, or risk a bad parse, we have to choose between the following:
>
> a) USE:/USING: forms are handled before other code
> b) have -syntax.factor files that define PARSER:s and load them all and
> force disambiguation
> c) keep a metafile with a USING: list, like npm's package.json that pulls
> in modules before parsing.
> d) something else!
>
> 2) to remember the parsed text to allow renaming/moving/deleting
> words/vocabularies and other refactoring tools
>
> 3) exact usage to allow perfect code reloading/renaming, even for syntax
> that "goes away", such as octal literals, with the current parser
>
> 4) to avoid having to use backslashes to escape strings by using lua
> long-string syntax which allows strings with arbitrary content to be
> embedded inside any source file without ambiguity
>
> a) this allows embedding DSLs with any syntax you want
>
> 5) allow for better docs/markdown syntax while still being 100% Factor
> syntax, or allow registering different file endings so Factor knows how to
> handle each file
>
> Lexer algorithm
>
> The lexer takes an entire stream (``utf8 file-contents`` for files) and
> parses it into tokens, which are typed slices of the underlying stream
> data. The parser sees each token and if the token is a PARSER: then it runs
> that token's parser to complete the parse.
>
> ``tokens`` the lexer will recognize:
>
> 1) single line comments
>
> ! this is a comment
> !this is a comment
> append! ! this is the word append! and a comment
> USING: ! the using list, comments are ok anywhere since the lexer knows
> kernel math ;
>
> restrictions:
> a) words that start with a ``!`` are not allowed, but words ending or with
> ! in the middle are fine, e.g. append! map!reduce are ok, !append is a
> comment
>
>
> 2) typed strings
>
> "regular string! must escape things \"quotes\" etc, but
> can be multiline"
> resource"core/math/math.factor" ! replaces "resource:core/math/math.factor"
> vocab"math" ! replaces "vocab:math"
> url"google.com" ! instead of URL" google.com"
> sbuf"my string buffer"
>
> restrictions:
> c) can't have a " in word names, they will parse as typed strings instead
>
>
> 3) typed array-likes (compile-time)
>
> { 1 2 }
> { 1 2 3 4 + } ! becomes { 1 2 7 } at compile-time
> suffix-array{ 1 2 3 } ! suffix array literal
> V{ } ! vector literal
> H{ { 1 2 } }
> simplify{ "x" 0 * 1 + 3 * 12 + } ! from
> http://re-factor.blogspot.com/2015/08/automated-reasoning.html
>
> restrictions:
> d) words that end in { parse until the matching } using lexer tokenization
> rules
>
>
> 4) typed quotation-likes (run-time)
>
> [ ] ! regular quotation
> [ { 1 2 3 + } { 4 5 } v+ ] ! [ { 1 5 } { 4 5 } v+ ] at compile-time
> { { 1 2 3 + } { 4 5 } v+ ] ! { 5 10 } at compile-time
> H{ { 1 2 } { 2 3 4 + } [ 5 + ] } ! H{ { 1 2 } { 2 7 } { 5 + } } at
> compile-time
> simplify[ "x" 0 * 1 + 3 * 12 + ] ! from
> http://re-factor.blogspot.com/2015/08/automated-reasoning.html
>
> restrictions:
> e) words that end in [ parse until the matching ] using lexer tokenization
> rules
>
>
> 5) typed stack annotation word
>
> ( a b c -- d ) ! regular stack effect
>
> ( a b c ) ! input stack effect, lexical variable assignment
>
> 1 2 3 :> ( a b c ) ! current multiple assignment follows the rule
>
> shuffle( a b -- b a ) ! current shuffle word follows this
>
> FUNCTION: int getcwd ( char *buf, size_t size ) ; ! follows the rule
>
> restrictions:
> words that end in ( must parse til ) using lexer tokenization rules
>
>
> 6) typed long-strings
>
> [[long string]]
>
> [[This string doesn't need "escapes"\n and is a single line since the
> newline is just a "backslash n".]]
>
> [=[embed the long string "[[long string]]"]=]
>
> [==[embed the previous string: embed the long string "[[long
> string]]"]=]]==]]
>
> ! The current EBNF: syntax still works, but you can also have arbitrary
> EBNF literals
>
> CONSTANT: simple-tokenizer-ebnf-literal EBNF[=[
> space = [ \t\n\r]
> escaped-char = "\\" .:ch => [[ ch ]]
> quoted = '"' (escaped-char | [^"])*:a '"' => [[ a ]]
> unquoted = (escaped-char | [^ \t\n\r"])+
> argument = (quoted | unquoted) => [[ >string ]]
> command = space* (argument:a space* => [[ a ]])+:c !(.) => [[ c ]]
> ]=]
>
> CONSTANT: hello-world c-program[====[
> #include <stdio.h>
>
> int main(int argc, char *argv[]) {
>     printf("hello\n");
>     printf("oh noes, the closing ]] ]=] ]==] ]===]\n");
>     return 0;
> }
> ]====]
>
>
> restrictions:
> words that have the following tokens anywhere will parse as long strings:
> [= {= [[ {{
>
> - ``[=`` throws an error if any character other than = followed by [ is
> found, e.g. ``[======[`` is ok ``[=====   [`` is error
> - ``[===[`` parses until ``]===]`` or throws an error
>
>
> To sum the lexer up:
> ! starts a comment except within a word
> foo" starts a typed foo string
> foo{ starts a typed compile-time literal
> foo[ starts a typed run-time literal
> foo( starts a typed stack annotation word
> foo{{ starts a typed compile-time string
> foo[[ foo[=[ starts a typed run-time string
>
> I want to add multiline comments, not sure what the syntax would be, but
> leaning toward ![[ ![===[ etc so you don't have to deal with C-style
> embedded comments, ML-style matched comments (* *) that you can't put
> arbitrary text in, etc.
>
> The goal of the long-string is to not have to thing about quoting strings
> or nesting comments, which has wasted thousands (millions?) of programmer
> hours and caused countless bugs and so much frustration. Triple quoted
> strings just hide the problem for awhile, but eventually you will need to
> escape something even in a triple-quoted string, e.g. docs about
> triple-quoted strings, pages and pages of code you want to just copy/paste
> into a literal in the repl, etc. Lua recognizes this, but I am not a Lua
> programmer so I don't know the extent to which this solves problems for
> people. I think it works even better in Factor and when you allow adding
> types to strings for DSLs, module system file handlers, Factor docs syntax,
> etc.
>
>
> If the libertarian/anarchist/free-spirit in you feels troubled by all the
> naming restrictions that THE MAN is forcing on you, and really want to name
> a word ``key-[`` or have ``funky-literals{ ]``, then we could think about
> adding lexing words which override the rules laid out above. The lexer
> would see the start of a lexer rule but check if you have overridden it and
> act accordingly.
>
> Module system idea:
>
> Handling -docs.factor, -tests.factor:
> If we had a docs[[ ]] form, and a tests[[ ]] form, and then you could
> register certain file endings/extensions with these parsers, then adding
> different files could be automated and simplified.
> foo/bar/bar.factor -- loads to a factor[[ ]], an arbitrary factor code
> literal
> foo/bar/bar-syntax.factor -- loads to a syntax[[]]
> foo/bar/bar-docs.factor -- loads with a docs[[]]
> foo/bar/bar-tests.factor -- loads to a tests[[]]
> bar.c -- is really just a C file, loads into a c-file[[ ]] and factor
> compiles it or forwards it to clang or whatever you want
>
> Since you can nest the long-strings arbitrarily, to handle them you just
> strip off the delimiters and parse the inside as whatever you want, and you
> can even invoke the Factor parser again. Docs could be like this:
>
> ! DOCS EXAMPLE in -docs.factor file
>
> $article[=[$link[[interned-words]] $title[[Looking up and creating words]]
> A word is said to be $emphasis[[interned]] if it is a member of the
> vocabulary named by its vocabulary slot. Otherwise, the word is
> $emphasis[[uninterned]].
>
> Parsing words add definitions to the current vocabulary. When a source
> file is being parsed, the current vocabulary is initially set to
> $vocab-link[[scratchpad]]. The current vocabulary may be changed with the
> $link[[IN:]] parsing word (see $link[[word-search]]).
> $subsections[
>     create-word
>     create-word-in
>     lookup-word
> ]
> ]=]
>
> Or you could just register a markdown[[ ]] hander for .md/.markdown files
> and write a markdown to Factor docs compiler, or compile Factor docs to
> .markdown etc. The current docs could be converted mechanically.
> Suggestions?
>
>
>
> More thoughts, unimplemented, open for discussion (as everything is!):
>
> If we wanted to have no PARSER: words at all, we could have another rule:
> CAPITAL: words parse until ;
> lowercase: take-one-token
>
> This almost works perfectly with the current system, except for words like
> GENERIC: GENERIC# etc. However, Slava laid out plans to remove such words
> in a blog post from 2008. FUNCTION: currently doesn't have a semi but it
> could be added back.
>
>
> http://factor-language.blogspot.com/2008/06/syntax-proposal-for-multi-methods.html
>
> The advantage of having such a regular syntax is that even non-programmers
> can look at the code and see exactly how it parses, with the familiar
> syntax of English such as matching parentheses, upper and lower case, and
> scanning ahead until a token, which is how sentences work.
>
>
> Some new-parser words (extra/modern/factor/factor.factor):
>
> lexer primitives:
> token - get a lexer token
> parse - get a lexer token and run it as a parsing word (if it is one)
> raw - bypass the lexer, get a token until whitespace
> ";" raw-until - call ``raw`` until you hit a ";"
> ";" parse-until - call parse until you hit a ";'
> ";EBNF" multiline-string-until - take chars until you hit ;EBNF
>
> lexer syntax sugar, shortcuts:
> body - syntax sugar for ``";" parse-until``
> new-word - syntax sugar for ``token``, but tags it as a new word
> new-class - syntax sugar for ``token``, but tags it as a new class
> existing-word - same deal
> existing-class - same deal
>
> examples:
> QPARSER: qparser QPARSER: raw raw body ;
> QPARSER: function : new-word parse body ;
> QPARSER: c-function FUNCTION: token new-word parse ;
> QPARSER: memo MEMO: new-word parse body ;
> QPARSER: postpone POSTPONE: raw ;
> QPARSER: symbols SYMBOLS: ";" raw-until ;
> QPARSER: char CHAR: raw ;
> QPARSER: constant CONSTANT: new-word parse ;
> QPARSER: morse [MORSE "MORSE]" multiline-string-until ;
>
> All foo[ foo( foo{ etc don't really need QPARSER: definitions. (Q stands
> for quick, as this is the quicker iteration of the parser compared to the
> previous implementation ;)
>
> Tools:
>
> I have a tool in another branch that can rename the comment character from
> ! to whatever you want and rewrite all the Factor files.
>
> The new-parse can parse 4249 source, docs, and tests files in 2.5 seconds
> before any extra optimizations, which I'm sure there's potential for.
>
>
> Compilation:
>
> Compilation will be going a few passes:
>
> 1) parse everything into a sequence of definitions
> 2) iterate the definitions and define new class/word symbols
> 3) take into account the USING: list and resolve all classes/words, and
> numbers. anything that is not these will throw an error
> 4) output a top-level quotation that compiles all the words at once, where
> each word builds its own quotation that ends in ``define``,
> ``define-generic``, etc.
>
>
> Other random ideas and consequences:
>
> - can allow circularly dependent vocabularies
> - circular definitions
> - can remove DEFER:, << >>, POSTPONE: (replace with ``\``), maybe others
> - symbols can only be either a class or a word but not both, which is
> almost the case now (around five are both still)
> - IN: can go away, base it on filename, IN-UNSAFE: can replace IN: for
> when it doesn't match filename
> - possibly ALL code could be in scope, or scope by USE: core, USE: basis,
> etc, and disambiguate as needed
> - need to reexamine Joe's module system proposal for ideas
> https://github.com/slavapestov/factor/issues/641
> https://gist.github.com/jckarter/3440892
>
>
> Road map:
> I need a few days of large blocks of uninterrupted time to get things to
> compile and reload correctly. Compiler invoker (not the actual compiler),
> refactoring tools, source writer need to be fixed up. Walker tool needs to
> be rewritten but can handle parsing words unexpanded, local variables, etc.
> Help is welcome!
>
>
> The code so far:
>
> The parser works as described but without long-string comments. The other
> vocabs are kind of half-baked but I have written them a couple of times to
> vary degrees of completeness. The current parser can parse the entire
> Factor codebase without erroring. It still may have some problems writing
> files back but those can be ironed out because I did it once before.
>
> git remote add erg https://github.com/erg/factor.git
> git fetch erg
> git checkout modern4
> code is in extra/modern/
>
> "modern" load
> all-factor-files [ dup quick-parse-path ] { } map>assoc
>
> "1 2 3" qparse >out .
> "math" qparse-vocab
> "math" qparse-vocab.
>
>
> Let me know what you think about any of this!
>
> Doug
>
> On Fri, Aug 7, 2015 at 2:52 PM, Jon Harper <jon.harpe...@gmail.com> wrote:
>
>> Hi Doug,
>> so I guess everyone has been teased with all the clues about the new
>> parser :)
>> 1fcf96cada0737 says "something else soon.",
>> https://github.com/slavapestov/factor/issues/1398 mentions it, etc.
>>
>> Could you share your plans for the new parser ? How will it be different,
>> what will it improve, etc ?
>>
>> Thanks,
>> Jon
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Factor-talk mailing list
>> Factor-talk@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/factor-talk
>>
>>
>

------------------------------------------------------------------------------

_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] new parser

Reply via email to