Hi all, I'm terrible at writing papers but I thought I'd share this because I think the technique is good I just want it out there. Hopefully it makes sense but probably not.
Parsing JavaScript by counting character pairs ============================================== Abstract -------- The traditional way to parse JavaScript is using a lexer and converting characters to tokens first and then use those tokens to define the grammar. The paper describes a unique way of parsing JavaScript allowing tokenization on the fly and obtaining the current state by using pairs of characters and their positions when the last state occurred. This paper challenges the accepted norm of parsing and hopes to introduce new ideas and improve how fast JavaScript engines can execute code. Language symmetry ----------------- The parsing technique described requires a certain level of symmetry in order to be effective and it would work best when the characters used to describe the language are unique. JavaScript isn't completely symmetrical in the fact that it's syntax overlaps with other parts. A good example of this is the ternary operator it overlaps when using the colon character with a labeled statement or object literal identifier however the parsing speed would be more effective if the language had chosen characters that didn't overlap and were always symmetrical. Defining character pairs ------------------------ Pairs of characters refers to when a character requires another character in order to be valid syntax, for example an array literal opening character is "[" and it's closing character is "]" which this paper refers to them as a pair and the opening and closing of that pair is used to track state. The ending character of each pair can know it's starting character's last state therefore determining it's current state based on what happened before it's starting character by using the counts of every pair of characters you can use this method to obtain states from not only the starting character but the current executing block of code. The following characters are character pairs in JavaScript "[" and "]", "(" and ")", "{" and "}". Each character pair requires a lookup variable to store the current state being tracked relative to the position count. The count position is obtained by incrementing a counter for when each pair opens and decrementing it when it closes. We will refer to the pair "[]" as square lookup (when wanting to obtain current JavaScript state) and square count when changing the count of the opening and closing state. Likewise "()" will be referred to as paren lookup and paren count and "{}" will be referred to as curly lookup and curly count. Left flag --------- A left flag variable is used when the previous state has a value that can be used in conjunction with the next state. A good example of this is the divide operator as it requires a valid object on the left hand side in order to be valid or it is not an operator. If the left flag is not set for the "/" character then it assumed that this character is a regular expression and not a divide operator when other states have been take into account. The left flag always starts at zero or false and is set to true or 1 when the current state is finished and it's a valid object. If we take the following syntax example: /a/ here we have a regular expression, the left flag is zero when the "/" is encountered if the next character is not "*" or "/" then it will be detected as a regular expression and parsed until complete. Then when the regular expression is closed the left flag is set to 1 since the regular expression is a valid object and can be used with the next state. Storing state ------------- To track the state of a "square pair" we use the left flag to determine the current state of the opening character if the left flag is 1 then the starting state is an object accessor, if the flag is 0 when the starting state is an array literal. When the starting state is determined the "square lookup" state is set using the current counter positions of all available pairs of characters. E.g. the square pair counter (SQ) will be set as 1, the paren pair counter (PA) will be set as 0 and the curly pair (CU) will be set as 0. The lookup square will then concatenate each counter pair (or other method) and set the state based on that position. squareLookup[''+SQ+PA+CU]=state. It is important to use every pair count since there could be n number of nested array literals or other statements. When the ending character is encountered it can know it's partners state by decrementing it's counter or reading then decrementing. //[ opening character encountered Left = 0 SQ = 0 PA = 0 CU = 0 (SQ++) + PA + CU squareLookup[''+SQ+PA+CU]='Array literal'; //] ending character encountered currentState = squareLookup[''+SQ+PA+CU]; SQ-- left = 1 Non-symmetrical syntax ---------------------- As mentioned previously syntax that is non-symmetrical such as object literals or ternary operators present a problem since their syntax overlaps with characters we consider pairs. This could result in an incorrect state being assigned or a counter being increased/decreased incorrectly. The problem is easily solved by using the current counter positions to determine the state of a character. Lets use ternary operators as an example if the opening character "?" is encountered we track it as ternary since that character is unique to ternary operators we track this by using the position of the current pairs of characters e.g. SQ = 0, PA = 0, CU = 0 using these combinations a isTernary flag is used then by using this flag no matter how many characters or pairs occur inside the ternary the ending combination will always be the same when the ending ":" character is encountered or nested ternary. Syntax checking needs to be added if the combinations of characters produces an invalid ternary statement but this can also be determined using the counters. These set of rules also need to be applied to any other syntax that is non-symmetrical such as case statements, var statements, if statements etc. Special cases ------------- If statements present a problem since the language is badly designed in order to account for poorly constructed code in the early days of the web. An example of this is the following syntax error if(0);;else alert(1) whereas this is valid syntax if(0);else alert(1). Here we need to account for asi (automatically semi-colon insertion) and count the number of statements after a if statement occurs in order to validate the else statement. Another special case is a for loop, the semi-colon within a for loop acts differently than an end statement. A flag is required to determine the difference between block statements or expressions. The following code demonstrates this for(;function(){}/123;)break; In order to determine if the function is a function statement or function expression we need to look up the for in flag based on the count positions. It gets slightly complex since the for keyword occurs before the paren and is parsed separately so when a lookup occurs it need to be -1 of the paren count. Validation ---------- At the end of parsing syntax the counter positions should always be zero, if any is greater than zero then you have a unmatched character since it means at least one character is still open and the closing character for the pair hasn't been found. You also need to prevent incorrect order of characters such as ")" occurring before "(" by using a rule-set of strict rules that lists the expected states and determines which states can follow each other. The counter positions can also be used to validate the amount of characters within a certain statement for example the for statement could be validated by checking the for flag to find if you are inside a for statement and the semi-colons could be counted to verify the correct amount characters. The counter positions play an important role here since there could be n number of statements and so you need to track the count of characters related to the current statement. Rulesets -------- By using a separate rule-set to control which state is allowed after a previous state you can make the rules control how you parse JavaScript. E.g if a new feature is introduced then you could modify the rule-set rather than the parser itself to introduce the new feature. Storing the states as a number would enable faster parsing since JavaScript is faster when evaluating numbers rather than strings and would also reduce memory overhead. The rule-set could store expected states after a certain state is found allowing you to create a strict pattern of characters that should follow each other yet have the control separate from the parsing logic. Using rule-sets separates parser logic into a manageable form that can easily be edited by others even if unfamiliar with the parsing logic itself. It could enable quick testing of new feature ideas, dynamically changing parsing rules or supporting different languages easily. Conclusion ---------- Further work needs to be done to establish whether it is feasible to implement such a parser using the techniques described, in particular the overhead of storing states on heavily nested JavaScript within multiple pairs of characters and it's impact on performance compared to traditional parsing. In addition a more elegant way of storing lookup states for each pair of characters rather than conversions of int based counters to strings and it's impact on performance of these lookups. The techniques presented in this paper can apply to any language and should allow extremely quick parsing and be even more effective when characters do not overlap with other parts of the syntax. It also makes the use of a separate tokenizing step obsolete since tokenization can occur as the string is being parsed. State lookups should also reduce the need to look ahead or behind when parsing code since the partner character can determine the correct state of the ending character resulting in better overall parser performance and faster executing JavaScript. Credits ------- Much of this work has been produced while defending attacks presented by Alexey Silin and Jonas Magazinius on my JavaScript sandbox over the years. I'd also like to thank the work of my fellow "slackers" Mario Heiderich, Eduardo Vela, David Lindsay, Stefano Di Paola, Soroush Dalili, Giorgio Maone and anyone else I forgot. Cheers Gareth
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss