I just now added the ![[ ]] comment syntax and fixed !comments. Force-pushed to the erg/modern4 branch.
On Fri, Aug 7, 2015 at 5:27 PM, Doug Coleman <doug.cole...@gmail.com> wrote: > [[This is kind of a brain-dump and not completely organized, but I'm going > to send it.]] > > The proposed "new-parser" is a lexer and parser with specific roles for > each. SYNTAX: words that execute arbitrary code should be replaced with > PARSER: words that only parse text, and a compile pass > > The main goals for the new-parser are: > > 1) allow the new-parser to parse files without compiling them > > Since the lexer/parser must know all parsing words before encountering > them, or risk a bad parse, we have to choose between the following: > > a) USE:/USING: forms are handled before other code > b) have -syntax.factor files that define PARSER:s and load them all and > force disambiguation > c) keep a metafile with a USING: list, like npm's package.json that pulls > in modules before parsing. > d) something else! > > 2) to remember the parsed text to allow renaming/moving/deleting > words/vocabularies and other refactoring tools > > 3) exact usage to allow perfect code reloading/renaming, even for syntax > that "goes away", such as octal literals, with the current parser > > 4) to avoid having to use backslashes to escape strings by using lua > long-string syntax which allows strings with arbitrary content to be > embedded inside any source file without ambiguity > > a) this allows embedding DSLs with any syntax you want > > 5) allow for better docs/markdown syntax while still being 100% Factor > syntax, or allow registering different file endings so Factor knows how to > handle each file > > Lexer algorithm > > The lexer takes an entire stream (``utf8 file-contents`` for files) and > parses it into tokens, which are typed slices of the underlying stream > data. The parser sees each token and if the token is a PARSER: then it runs > that token's parser to complete the parse. > > ``tokens`` the lexer will recognize: > > 1) single line comments > > ! this is a comment > !this is a comment > append! ! this is the word append! and a comment > USING: ! the using list, comments are ok anywhere since the lexer knows > kernel math ; > > restrictions: > a) words that start with a ``!`` are not allowed, but words ending or with > ! in the middle are fine, e.g. append! map!reduce are ok, !append is a > comment > > > 2) typed strings > > "regular string! must escape things \"quotes\" etc, but > can be multiline" > resource"core/math/math.factor" ! replaces "resource:core/math/math.factor" > vocab"math" ! replaces "vocab:math" > url"google.com" ! instead of URL" google.com" > sbuf"my string buffer" > > restrictions: > c) can't have a " in word names, they will parse as typed strings instead > > > 3) typed array-likes (compile-time) > > { 1 2 } > { 1 2 3 4 + } ! becomes { 1 2 7 } at compile-time > suffix-array{ 1 2 3 } ! suffix array literal > V{ } ! vector literal > H{ { 1 2 } } > simplify{ "x" 0 * 1 + 3 * 12 + } ! from > http://re-factor.blogspot.com/2015/08/automated-reasoning.html > > restrictions: > d) words that end in { parse until the matching } using lexer tokenization > rules > > > 4) typed quotation-likes (run-time) > > [ ] ! regular quotation > [ { 1 2 3 + } { 4 5 } v+ ] ! [ { 1 5 } { 4 5 } v+ ] at compile-time > { { 1 2 3 + } { 4 5 } v+ ] ! { 5 10 } at compile-time > H{ { 1 2 } { 2 3 4 + } [ 5 + ] } ! H{ { 1 2 } { 2 7 } { 5 + } } at > compile-time > simplify[ "x" 0 * 1 + 3 * 12 + ] ! from > http://re-factor.blogspot.com/2015/08/automated-reasoning.html > > restrictions: > e) words that end in [ parse until the matching ] using lexer tokenization > rules > > > 5) typed stack annotation word > > ( a b c -- d ) ! regular stack effect > > ( a b c ) ! input stack effect, lexical variable assignment > > 1 2 3 :> ( a b c ) ! current multiple assignment follows the rule > > shuffle( a b -- b a ) ! current shuffle word follows this > > FUNCTION: int getcwd ( char *buf, size_t size ) ; ! follows the rule > > restrictions: > words that end in ( must parse til ) using lexer tokenization rules > > > 6) typed long-strings > > [[long string]] > > [[This string doesn't need "escapes"\n and is a single line since the > newline is just a "backslash n".]] > > [=[embed the long string "[[long string]]"]=] > > [==[embed the previous string: embed the long string "[[long > string]]"]=]]==]] > > ! The current EBNF: syntax still works, but you can also have arbitrary > EBNF literals > > CONSTANT: simple-tokenizer-ebnf-literal EBNF[=[ > space = [ \t\n\r] > escaped-char = "\\" .:ch => [[ ch ]] > quoted = '"' (escaped-char | [^"])*:a '"' => [[ a ]] > unquoted = (escaped-char | [^ \t\n\r"])+ > argument = (quoted | unquoted) => [[ >string ]] > command = space* (argument:a space* => [[ a ]])+:c !(.) => [[ c ]] > ]=] > > CONSTANT: hello-world c-program[====[ > #include <stdio.h> > > int main(int argc, char *argv[]) { > printf("hello\n"); > printf("oh noes, the closing ]] ]=] ]==] ]===]\n"); > return 0; > } > ]====] > > > restrictions: > words that have the following tokens anywhere will parse as long strings: > [= {= [[ {{ > > - ``[=`` throws an error if any character other than = followed by [ is > found, e.g. ``[======[`` is ok ``[===== [`` is error > - ``[===[`` parses until ``]===]`` or throws an error > > > To sum the lexer up: > ! starts a comment except within a word > foo" starts a typed foo string > foo{ starts a typed compile-time literal > foo[ starts a typed run-time literal > foo( starts a typed stack annotation word > foo{{ starts a typed compile-time string > foo[[ foo[=[ starts a typed run-time string > > I want to add multiline comments, not sure what the syntax would be, but > leaning toward ![[ ![===[ etc so you don't have to deal with C-style > embedded comments, ML-style matched comments (* *) that you can't put > arbitrary text in, etc. > > The goal of the long-string is to not have to thing about quoting strings > or nesting comments, which has wasted thousands (millions?) of programmer > hours and caused countless bugs and so much frustration. Triple quoted > strings just hide the problem for awhile, but eventually you will need to > escape something even in a triple-quoted string, e.g. docs about > triple-quoted strings, pages and pages of code you want to just copy/paste > into a literal in the repl, etc. Lua recognizes this, but I am not a Lua > programmer so I don't know the extent to which this solves problems for > people. I think it works even better in Factor and when you allow adding > types to strings for DSLs, module system file handlers, Factor docs syntax, > etc. > > > If the libertarian/anarchist/free-spirit in you feels troubled by all the > naming restrictions that THE MAN is forcing on you, and really want to name > a word ``key-[`` or have ``funky-literals{ ]``, then we could think about > adding lexing words which override the rules laid out above. The lexer > would see the start of a lexer rule but check if you have overridden it and > act accordingly. > > Module system idea: > > Handling -docs.factor, -tests.factor: > If we had a docs[[ ]] form, and a tests[[ ]] form, and then you could > register certain file endings/extensions with these parsers, then adding > different files could be automated and simplified. > foo/bar/bar.factor -- loads to a factor[[ ]], an arbitrary factor code > literal > foo/bar/bar-syntax.factor -- loads to a syntax[[]] > foo/bar/bar-docs.factor -- loads with a docs[[]] > foo/bar/bar-tests.factor -- loads to a tests[[]] > bar.c -- is really just a C file, loads into a c-file[[ ]] and factor > compiles it or forwards it to clang or whatever you want > > Since you can nest the long-strings arbitrarily, to handle them you just > strip off the delimiters and parse the inside as whatever you want, and you > can even invoke the Factor parser again. Docs could be like this: > > ! DOCS EXAMPLE in -docs.factor file > > $article[=[$link[[interned-words]] $title[[Looking up and creating words]] > A word is said to be $emphasis[[interned]] if it is a member of the > vocabulary named by its vocabulary slot. Otherwise, the word is > $emphasis[[uninterned]]. > > Parsing words add definitions to the current vocabulary. When a source > file is being parsed, the current vocabulary is initially set to > $vocab-link[[scratchpad]]. The current vocabulary may be changed with the > $link[[IN:]] parsing word (see $link[[word-search]]). > $subsections[ > create-word > create-word-in > lookup-word > ] > ]=] > > Or you could just register a markdown[[ ]] hander for .md/.markdown files > and write a markdown to Factor docs compiler, or compile Factor docs to > .markdown etc. The current docs could be converted mechanically. > Suggestions? > > > > More thoughts, unimplemented, open for discussion (as everything is!): > > If we wanted to have no PARSER: words at all, we could have another rule: > CAPITAL: words parse until ; > lowercase: take-one-token > > This almost works perfectly with the current system, except for words like > GENERIC: GENERIC# etc. However, Slava laid out plans to remove such words > in a blog post from 2008. FUNCTION: currently doesn't have a semi but it > could be added back. > > > http://factor-language.blogspot.com/2008/06/syntax-proposal-for-multi-methods.html > > The advantage of having such a regular syntax is that even non-programmers > can look at the code and see exactly how it parses, with the familiar > syntax of English such as matching parentheses, upper and lower case, and > scanning ahead until a token, which is how sentences work. > > > Some new-parser words (extra/modern/factor/factor.factor): > > lexer primitives: > token - get a lexer token > parse - get a lexer token and run it as a parsing word (if it is one) > raw - bypass the lexer, get a token until whitespace > ";" raw-until - call ``raw`` until you hit a ";" > ";" parse-until - call parse until you hit a ";' > ";EBNF" multiline-string-until - take chars until you hit ;EBNF > > lexer syntax sugar, shortcuts: > body - syntax sugar for ``";" parse-until`` > new-word - syntax sugar for ``token``, but tags it as a new word > new-class - syntax sugar for ``token``, but tags it as a new class > existing-word - same deal > existing-class - same deal > > examples: > QPARSER: qparser QPARSER: raw raw body ; > QPARSER: function : new-word parse body ; > QPARSER: c-function FUNCTION: token new-word parse ; > QPARSER: memo MEMO: new-word parse body ; > QPARSER: postpone POSTPONE: raw ; > QPARSER: symbols SYMBOLS: ";" raw-until ; > QPARSER: char CHAR: raw ; > QPARSER: constant CONSTANT: new-word parse ; > QPARSER: morse [MORSE "MORSE]" multiline-string-until ; > > All foo[ foo( foo{ etc don't really need QPARSER: definitions. (Q stands > for quick, as this is the quicker iteration of the parser compared to the > previous implementation ;) > > Tools: > > I have a tool in another branch that can rename the comment character from > ! to whatever you want and rewrite all the Factor files. > > The new-parse can parse 4249 source, docs, and tests files in 2.5 seconds > before any extra optimizations, which I'm sure there's potential for. > > > Compilation: > > Compilation will be going a few passes: > > 1) parse everything into a sequence of definitions > 2) iterate the definitions and define new class/word symbols > 3) take into account the USING: list and resolve all classes/words, and > numbers. anything that is not these will throw an error > 4) output a top-level quotation that compiles all the words at once, where > each word builds its own quotation that ends in ``define``, > ``define-generic``, etc. > > > Other random ideas and consequences: > > - can allow circularly dependent vocabularies > - circular definitions > - can remove DEFER:, << >>, POSTPONE: (replace with ``\``), maybe others > - symbols can only be either a class or a word but not both, which is > almost the case now (around five are both still) > - IN: can go away, base it on filename, IN-UNSAFE: can replace IN: for > when it doesn't match filename > - possibly ALL code could be in scope, or scope by USE: core, USE: basis, > etc, and disambiguate as needed > - need to reexamine Joe's module system proposal for ideas > https://github.com/slavapestov/factor/issues/641 > https://gist.github.com/jckarter/3440892 > > > Road map: > I need a few days of large blocks of uninterrupted time to get things to > compile and reload correctly. Compiler invoker (not the actual compiler), > refactoring tools, source writer need to be fixed up. Walker tool needs to > be rewritten but can handle parsing words unexpanded, local variables, etc. > Help is welcome! > > > The code so far: > > The parser works as described but without long-string comments. The other > vocabs are kind of half-baked but I have written them a couple of times to > vary degrees of completeness. The current parser can parse the entire > Factor codebase without erroring. It still may have some problems writing > files back but those can be ironed out because I did it once before. > > git remote add erg https://github.com/erg/factor.git > git fetch erg > git checkout modern4 > code is in extra/modern/ > > "modern" load > all-factor-files [ dup quick-parse-path ] { } map>assoc > > "1 2 3" qparse >out . > "math" qparse-vocab > "math" qparse-vocab. > > > Let me know what you think about any of this! > > Doug > > On Fri, Aug 7, 2015 at 2:52 PM, Jon Harper <jon.harpe...@gmail.com> wrote: > >> Hi Doug, >> so I guess everyone has been teased with all the clues about the new >> parser :) >> 1fcf96cada0737 says "something else soon.", >> https://github.com/slavapestov/factor/issues/1398 mentions it, etc. >> >> Could you share your plans for the new parser ? How will it be different, >> what will it improve, etc ? >> >> Thanks, >> Jon >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Factor-talk mailing list >> Factor-talk@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/factor-talk >> >> >
------------------------------------------------------------------------------
_______________________________________________ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk