[[This is kind of a brain-dump and not completely organized, but I'm going to send it.]]
The proposed "new-parser" is a lexer and parser with specific roles for each. SYNTAX: words that execute arbitrary code should be replaced with PARSER: words that only parse text, and a compile pass The main goals for the new-parser are: 1) allow the new-parser to parse files without compiling them Since the lexer/parser must know all parsing words before encountering them, or risk a bad parse, we have to choose between the following: a) USE:/USING: forms are handled before other code b) have -syntax.factor files that define PARSER:s and load them all and force disambiguation c) keep a metafile with a USING: list, like npm's package.json that pulls in modules before parsing. d) something else! 2) to remember the parsed text to allow renaming/moving/deleting words/vocabularies and other refactoring tools 3) exact usage to allow perfect code reloading/renaming, even for syntax that "goes away", such as octal literals, with the current parser 4) to avoid having to use backslashes to escape strings by using lua long-string syntax which allows strings with arbitrary content to be embedded inside any source file without ambiguity a) this allows embedding DSLs with any syntax you want 5) allow for better docs/markdown syntax while still being 100% Factor syntax, or allow registering different file endings so Factor knows how to handle each file Lexer algorithm The lexer takes an entire stream (``utf8 file-contents`` for files) and parses it into tokens, which are typed slices of the underlying stream data. The parser sees each token and if the token is a PARSER: then it runs that token's parser to complete the parse. ``tokens`` the lexer will recognize: 1) single line comments ! this is a comment !this is a comment append! ! this is the word append! and a comment USING: ! the using list, comments are ok anywhere since the lexer knows kernel math ; restrictions: a) words that start with a ``!`` are not allowed, but words ending or with ! in the middle are fine, e.g. append! map!reduce are ok, !append is a comment 2) typed strings "regular string! must escape things \"quotes\" etc, but can be multiline" resource"core/math/math.factor" ! replaces "resource:core/math/math.factor" vocab"math" ! replaces "vocab:math" url"google.com" ! instead of URL" google.com" sbuf"my string buffer" restrictions: c) can't have a " in word names, they will parse as typed strings instead 3) typed array-likes (compile-time) { 1 2 } { 1 2 3 4 + } ! becomes { 1 2 7 } at compile-time suffix-array{ 1 2 3 } ! suffix array literal V{ } ! vector literal H{ { 1 2 } } simplify{ "x" 0 * 1 + 3 * 12 + } ! from http://re-factor.blogspot.com/2015/08/automated-reasoning.html restrictions: d) words that end in { parse until the matching } using lexer tokenization rules 4) typed quotation-likes (run-time) [ ] ! regular quotation [ { 1 2 3 + } { 4 5 } v+ ] ! [ { 1 5 } { 4 5 } v+ ] at compile-time { { 1 2 3 + } { 4 5 } v+ ] ! { 5 10 } at compile-time H{ { 1 2 } { 2 3 4 + } [ 5 + ] } ! H{ { 1 2 } { 2 7 } { 5 + } } at compile-time simplify[ "x" 0 * 1 + 3 * 12 + ] ! from http://re-factor.blogspot.com/2015/08/automated-reasoning.html restrictions: e) words that end in [ parse until the matching ] using lexer tokenization rules 5) typed stack annotation word ( a b c -- d ) ! regular stack effect ( a b c ) ! input stack effect, lexical variable assignment 1 2 3 :> ( a b c ) ! current multiple assignment follows the rule shuffle( a b -- b a ) ! current shuffle word follows this FUNCTION: int getcwd ( char *buf, size_t size ) ; ! follows the rule restrictions: words that end in ( must parse til ) using lexer tokenization rules 6) typed long-strings [[long string]] [[This string doesn't need "escapes"\n and is a single line since the newline is just a "backslash n".]] [=[embed the long string "[[long string]]"]=] [==[embed the previous string: embed the long string "[[long string]]"]=]]==]] ! The current EBNF: syntax still works, but you can also have arbitrary EBNF literals CONSTANT: simple-tokenizer-ebnf-literal EBNF[=[ space = [ \t\n\r] escaped-char = "\\" .:ch => [[ ch ]] quoted = '"' (escaped-char | [^"])*:a '"' => [[ a ]] unquoted = (escaped-char | [^ \t\n\r"])+ argument = (quoted | unquoted) => [[ >string ]] command = space* (argument:a space* => [[ a ]])+:c !(.) => [[ c ]] ]=] CONSTANT: hello-world c-program[====[ #include <stdio.h> int main(int argc, char *argv[]) { printf("hello\n"); printf("oh noes, the closing ]] ]=] ]==] ]===]\n"); return 0; } ]====] restrictions: words that have the following tokens anywhere will parse as long strings: [= {= [[ {{ - ``[=`` throws an error if any character other than = followed by [ is found, e.g. ``[======[`` is ok ``[===== [`` is error - ``[===[`` parses until ``]===]`` or throws an error To sum the lexer up: ! starts a comment except within a word foo" starts a typed foo string foo{ starts a typed compile-time literal foo[ starts a typed run-time literal foo( starts a typed stack annotation word foo{{ starts a typed compile-time string foo[[ foo[=[ starts a typed run-time string I want to add multiline comments, not sure what the syntax would be, but leaning toward ![[ ![===[ etc so you don't have to deal with C-style embedded comments, ML-style matched comments (* *) that you can't put arbitrary text in, etc. The goal of the long-string is to not have to thing about quoting strings or nesting comments, which has wasted thousands (millions?) of programmer hours and caused countless bugs and so much frustration. Triple quoted strings just hide the problem for awhile, but eventually you will need to escape something even in a triple-quoted string, e.g. docs about triple-quoted strings, pages and pages of code you want to just copy/paste into a literal in the repl, etc. Lua recognizes this, but I am not a Lua programmer so I don't know the extent to which this solves problems for people. I think it works even better in Factor and when you allow adding types to strings for DSLs, module system file handlers, Factor docs syntax, etc. If the libertarian/anarchist/free-spirit in you feels troubled by all the naming restrictions that THE MAN is forcing on you, and really want to name a word ``key-[`` or have ``funky-literals{ ]``, then we could think about adding lexing words which override the rules laid out above. The lexer would see the start of a lexer rule but check if you have overridden it and act accordingly. Module system idea: Handling -docs.factor, -tests.factor: If we had a docs[[ ]] form, and a tests[[ ]] form, and then you could register certain file endings/extensions with these parsers, then adding different files could be automated and simplified. foo/bar/bar.factor -- loads to a factor[[ ]], an arbitrary factor code literal foo/bar/bar-syntax.factor -- loads to a syntax[[]] foo/bar/bar-docs.factor -- loads with a docs[[]] foo/bar/bar-tests.factor -- loads to a tests[[]] bar.c -- is really just a C file, loads into a c-file[[ ]] and factor compiles it or forwards it to clang or whatever you want Since you can nest the long-strings arbitrarily, to handle them you just strip off the delimiters and parse the inside as whatever you want, and you can even invoke the Factor parser again. Docs could be like this: ! DOCS EXAMPLE in -docs.factor file $article[=[$link[[interned-words]] $title[[Looking up and creating words]] A word is said to be $emphasis[[interned]] if it is a member of the vocabulary named by its vocabulary slot. Otherwise, the word is $emphasis[[uninterned]]. Parsing words add definitions to the current vocabulary. When a source file is being parsed, the current vocabulary is initially set to $vocab-link[[scratchpad]]. The current vocabulary may be changed with the $link[[IN:]] parsing word (see $link[[word-search]]). $subsections[ create-word create-word-in lookup-word ] ]=] Or you could just register a markdown[[ ]] hander for .md/.markdown files and write a markdown to Factor docs compiler, or compile Factor docs to .markdown etc. The current docs could be converted mechanically. Suggestions? More thoughts, unimplemented, open for discussion (as everything is!): If we wanted to have no PARSER: words at all, we could have another rule: CAPITAL: words parse until ; lowercase: take-one-token This almost works perfectly with the current system, except for words like GENERIC: GENERIC# etc. However, Slava laid out plans to remove such words in a blog post from 2008. FUNCTION: currently doesn't have a semi but it could be added back. http://factor-language.blogspot.com/2008/06/syntax-proposal-for-multi-methods.html The advantage of having such a regular syntax is that even non-programmers can look at the code and see exactly how it parses, with the familiar syntax of English such as matching parentheses, upper and lower case, and scanning ahead until a token, which is how sentences work. Some new-parser words (extra/modern/factor/factor.factor): lexer primitives: token - get a lexer token parse - get a lexer token and run it as a parsing word (if it is one) raw - bypass the lexer, get a token until whitespace ";" raw-until - call ``raw`` until you hit a ";" ";" parse-until - call parse until you hit a ";' ";EBNF" multiline-string-until - take chars until you hit ;EBNF lexer syntax sugar, shortcuts: body - syntax sugar for ``";" parse-until`` new-word - syntax sugar for ``token``, but tags it as a new word new-class - syntax sugar for ``token``, but tags it as a new class existing-word - same deal existing-class - same deal examples: QPARSER: qparser QPARSER: raw raw body ; QPARSER: function : new-word parse body ; QPARSER: c-function FUNCTION: token new-word parse ; QPARSER: memo MEMO: new-word parse body ; QPARSER: postpone POSTPONE: raw ; QPARSER: symbols SYMBOLS: ";" raw-until ; QPARSER: char CHAR: raw ; QPARSER: constant CONSTANT: new-word parse ; QPARSER: morse [MORSE "MORSE]" multiline-string-until ; All foo[ foo( foo{ etc don't really need QPARSER: definitions. (Q stands for quick, as this is the quicker iteration of the parser compared to the previous implementation ;) Tools: I have a tool in another branch that can rename the comment character from ! to whatever you want and rewrite all the Factor files. The new-parse can parse 4249 source, docs, and tests files in 2.5 seconds before any extra optimizations, which I'm sure there's potential for. Compilation: Compilation will be going a few passes: 1) parse everything into a sequence of definitions 2) iterate the definitions and define new class/word symbols 3) take into account the USING: list and resolve all classes/words, and numbers. anything that is not these will throw an error 4) output a top-level quotation that compiles all the words at once, where each word builds its own quotation that ends in ``define``, ``define-generic``, etc. Other random ideas and consequences: - can allow circularly dependent vocabularies - circular definitions - can remove DEFER:, << >>, POSTPONE: (replace with ``\``), maybe others - symbols can only be either a class or a word but not both, which is almost the case now (around five are both still) - IN: can go away, base it on filename, IN-UNSAFE: can replace IN: for when it doesn't match filename - possibly ALL code could be in scope, or scope by USE: core, USE: basis, etc, and disambiguate as needed - need to reexamine Joe's module system proposal for ideas https://github.com/slavapestov/factor/issues/641 https://gist.github.com/jckarter/3440892 Road map: I need a few days of large blocks of uninterrupted time to get things to compile and reload correctly. Compiler invoker (not the actual compiler), refactoring tools, source writer need to be fixed up. Walker tool needs to be rewritten but can handle parsing words unexpanded, local variables, etc. Help is welcome! The code so far: The parser works as described but without long-string comments. The other vocabs are kind of half-baked but I have written them a couple of times to vary degrees of completeness. The current parser can parse the entire Factor codebase without erroring. It still may have some problems writing files back but those can be ironed out because I did it once before. git remote add erg https://github.com/erg/factor.git git fetch erg git checkout modern4 code is in extra/modern/ "modern" load all-factor-files [ dup quick-parse-path ] { } map>assoc "1 2 3" qparse >out . "math" qparse-vocab "math" qparse-vocab. Let me know what you think about any of this! Doug On Fri, Aug 7, 2015 at 2:52 PM, Jon Harper <jon.harpe...@gmail.com> wrote: > Hi Doug, > so I guess everyone has been teased with all the clues about the new > parser :) > 1fcf96cada0737 says "something else soon.", > https://github.com/slavapestov/factor/issues/1398 mentions it, etc. > > Could you share your plans for the new parser ? How will it be different, > what will it improve, etc ? > > Thanks, > Jon > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk > >
------------------------------------------------------------------------------
_______________________________________________ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk