> Well, way back when I tried to make sense of busybox's awk > implementation, which is around 3000 lines of C. > > More recently, I read about half the posix awk description and dug up a > copy of the original "The AWK Programming Language" book by Aho, > Kernighan, and Weinberger from 1988, which I've read the introduction of.
I have that book, and skimmed it again a few weeks ago. It's actually a real gem that goes beyond awk into computer science -- they have a nice mini-Make implementation in awk, and as well as topological sorting and recursive descent parsing, etc. (FWIW, Programming in Lua by the Lua authors is another book in the same that vein -- it's great even if like me you're not that keen on Lua the language.) The 8K LOC Kernighan Awk is apparently the one described in this book. But I wonder how far that is from common practice? If busybox awk can do it in 3K LOC (+ libs and regex), then that's VERY encouraging. I thought it was going to be closer to the 21K LOC mawk. busybox awk looks like a pretty straightforward interpreter architecture from what I can tell -- lex, parse, walk a tree to execute, and runtime support with hash tables and so forth. next_token() is a little longer than I'd like at 200+ lines, but that's nothing compared to 1000+ line lexing functions in mksh or bash (and I'm not even counting the serious macro usage). Everything about shell is harder than awk -- lexing, parsing, execution, interactivity, etc. (An exception might be the number of dialects; shell does seem to be pretty highly standardized and the implementations are conformant.) >> The lexer is 582 lines of clean looking C code (it's Kernighan, so I guess >> we all know his style :) ), which is not insignificant! > > I'm not adding yacc as a build dependency. AFAICT, the lexer in bwk doesn't really depend on Yacc. It does include ytab.h and implement a function called yylex(), but I don't think there should be any problem just renaming that function and calling it from your own code. Lexers are generally easier to reuse than parsers because there are only a limited number of ways to write them, and they do something pretty well-defined. They also seem worth reusing because there is a lot of hairy logic and corner cases. But yeah it seems like a BSD license, not a public domain license in any case. Though Dennis seems to think it's compatible with your philosophy. >>>From my research, it seems like a significant easier problem than the >> shell. > > Yeah, probably. The shell is actually about 30 commands integrated > together, several of which are literal commands and several of which are > implicit "expand this environment variable, which could be $RANDOM or > $PPID or $SECONDS which actually invokes a function but you still need > to support ${#SECONDS} or ${RANDOM:1:3}. Some of them literally are > external commands where [ is "test" and : is "true" and "help" I already > did, and I already did ulimit, some are $(( )) is kinda like expr but > not quite, "trap" is its own thing, "read" is actually fairly elaborate, > command history navigation (and "history expansion" which I've never > personally used)... Job control is a whole subsystem (and pipes and > redirection are integrated into that; you suspend a _pipeline_ not a > process, and kill needs job control integration when run from toysh). > $PWD != abspath although it looks like getcwd() returns what we need > there, but I need to adjust cd to strip directory entries instead ofa > ctually traversing the filesystem .. (but only for _leading_ .. in the > path, I think? Need to test). There's all that loop and test logic, > shell functions and alias, pushd/popd/dirs, I have NEVER understood what > the "getopts" command is for but need to try again, don't get me STARTED > on the dozens of different things "set" does let alone "set -o"... Right, there is an entire sublanguage within ${} and $(()), and those sublanguages have to be intertwined with the main language in parsing and execution. I think I have all the language stuff under control. The hardest part is parsing $() because you have run the entire parser and know where the ending ) is. Right paren is used in four places: 1. subshell -- ( echo hi ) 2. command sub $(echo hi) -- this is parsed MUCH differently than subshell, because it's at the intra-word level and not the outer word level, e.g. 'foo'$(echo hi)'"bar"$((1+1)) is all a single shell word, but you can't put a subshell in there. 3. case foo) -- bash has bugs with putting this in a subshell. To make it more exciting, case (foo) is also valid in the POSIX grammar (and all implementations). 4. func() { } Since anything can appear within $(), all those right parens have to be disambiguated. There is a similar problem with ${} overloading -- it's used for anonymous blocks and brace expansion, in addition to var expansion. I found bash bugs here too. I have a very nice and clean way of figuring out ) though... if anyone's interested I can show it. The bash aosabook chapter which I've referred to several times talks about how they had to duplicate a lot of the parser to handle this, and it still isn't right: "Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (parse_comsub), which knows an uncomfortable amount of shell syntax and duplicates rather more of the token-reading code than is optimal. This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it's in a case statement); it took a while to get that right." http://aosabook.org/en/bash.html The reason I can do better essentially boils down to the fact that I'm using a top-down rather than bottom-up parser (yacc). "getopts" is simply a binding for libc getopt(). It just looks weird because sh is a weird programming language. But it's basically a 1 to 1 mapping. You provide a spec string like "a:bcd:" and then you loop over global variables to fill out your own flags struct/global variables. The main part I don't have figured out is interactive job control. I don't really ever use that, since I use tmux. I have interactive parsing and autocompletion figured out though (with prototype code.) Shell is definitely hard because it's big, and there's a lot of overlapping and interacting features. But I think just running in batch mode is a huge step forward because you can test it with builds, and you can be confident you have all the parsing solved. Though I made sure that my parser works both in batch mode and interactively. Interactive parsing is hard because of $PS2!!! $PS2 is when you get > after an incomplete command. That interacts with quoting, here docs, compound commands, etc. This is barely mentioned in the POSIX spec but every implementation I tried does it right (except for a tiny bug in dash I found.) Andy _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net