Paper: Parsing JavaScript by counting character pairs

gaz Heyes Wed, 16 Jan 2013 01:42:19 -0800

Hi all,

I'm terrible at writing papers but I thought I'd share this because I think
the technique is good I just want it out there. Hopefully it makes sense
but probably not.


Parsing JavaScript by counting character pairs
==============================================

Abstract
--------

The traditional way to parse JavaScript is using a lexer and converting
characters to tokens first and then use those tokens to define the grammar.
The paper describes a unique way of parsing JavaScript allowing
tokenization on the fly and obtaining the current state by using pairs of
characters and their positions when the last state occurred. This paper
challenges the accepted norm of parsing and hopes to introduce new ideas
and improve how fast JavaScript engines can execute code.

Language symmetry
-----------------

The parsing technique described requires a certain level of symmetry in
order to be effective and it would work best when the characters used to
describe the language are unique. JavaScript isn't completely symmetrical
in the fact that it's syntax overlaps with other parts. A good example of
this is the ternary operator it overlaps when using the colon character
with a labeled statement or object literal identifier however the parsing
speed would be more effective if the language had chosen characters that
didn't overlap and were always symmetrical.

Defining character pairs
------------------------

Pairs of characters refers to when a character requires another character
in order to be valid syntax, for example an array literal opening character
is "[" and it's closing character is "]" which this paper refers to them as
a pair and the opening and closing of that pair is used to track state.

The ending character of each pair can know it's starting character's last
state therefore determining it's current state based on what happened
before it's starting character by using the counts of every pair of
characters you can use this method to obtain states from not only the
starting character but the current executing block of code.

The following characters are character pairs in JavaScript "[" and "]", "("
and ")", "{" and "}". Each character pair requires a lookup variable to
store the current state being tracked relative to the position count.

The count position is obtained by incrementing a counter for when each pair
opens and decrementing it when it closes. We will refer to the pair "[]" as
square lookup (when wanting to obtain current JavaScript state) and square
count when changing the count of the opening and closing state. Likewise
"()" will be referred to as paren lookup and paren count and "{}" will be
referred to as curly lookup and curly count.

Left flag
---------

A left flag variable is used when the previous state has a value that can
be used in conjunction with the next state. A good example of this is the
divide operator as it requires a valid object on the left hand side in
order to be valid or it is not an operator. If the left flag is not set for
the "/" character then it assumed that this character is a regular
expression and not a divide operator when other states have been take into
account. The left flag always starts at zero or false and is set to true or
1 when the current state is finished and it's a valid object. If we take
the following syntax example: /a/ here we have a regular expression, the
left flag is zero when the "/" is encountered if the next character is not
"*" or "/" then it will be detected as a regular expression and parsed
until complete. Then when the regular expression is closed the left flag is
set to 1 since the regular expression is a valid object and can be used
with the next state.

Storing state
-------------

To track the state of a "square pair" we use the left flag to determine the
current state of the opening character if the left flag is 1 then the
starting state is an object accessor, if the flag is 0 when the starting
state is an array literal. When the starting state is determined the
"square lookup" state is set using the current counter positions of all
available pairs of characters. E.g. the square pair counter (SQ) will be
set as 1, the paren pair counter (PA) will be set as 0 and the curly pair
(CU) will be set as 0. The lookup square will then concatenate each counter
pair (or other method) and set the state based on that position.
squareLookup[''+SQ+PA+CU]=state. It is important to use every pair count
since there could be n number of nested array literals or other statements.
When the ending character is encountered it can know it's partners state by
decrementing it's counter or reading then decrementing.

//[ opening character encountered
Left = 0
SQ = 0
PA = 0
CU = 0
(SQ++) + PA + CU
squareLookup[''+SQ+PA+CU]='Array literal';


//] ending character encountered
currentState = squareLookup[''+SQ+PA+CU];
SQ--
left = 1

Non-symmetrical syntax
----------------------

As mentioned previously syntax that is non-symmetrical such as object
literals or ternary operators present a problem since their syntax overlaps
with characters we consider pairs. This could result in an incorrect state
being assigned or a counter being increased/decreased incorrectly. The
problem is easily solved by using the current counter positions to
determine the state of a character. Lets use ternary operators as an
example if the opening character "?" is encountered we track it as ternary
since that character is unique to ternary operators we track this by using
the position of the current pairs of characters e.g. SQ = 0, PA = 0, CU = 0
using these combinations a isTernary flag is used then by using this flag
no matter how many characters or pairs occur inside the ternary the ending
combination will always be the same when the ending ":" character is
encountered or nested ternary. Syntax checking needs to be added if the
combinations of characters produces an invalid ternary statement but this
can also be determined using the counters. These set of rules also need to
be applied to any other syntax that is non-symmetrical such as case
statements, var statements, if statements etc.

Special cases
-------------

If statements present a problem since the language is badly designed in
order to account for poorly constructed code in the early days of the web.
An example of this is the following syntax error if(0);;else alert(1)
whereas this is valid syntax if(0);else alert(1). Here we need to account
for asi (automatically semi-colon insertion) and count the number of
statements after a if statement occurs in order to validate the else
statement. Another special case is a for loop, the semi-colon within a for
loop acts differently than an end statement. A flag is required to
determine the difference between block statements or expressions. The
following code demonstrates this for(;function(){}/123;)break; In order to
determine if the function is a function statement or function expression we
need to look up the for in flag based on the count positions. It gets
slightly complex since the for keyword occurs before the paren and is
parsed separately so when a lookup occurs it need to be -1 of the paren
count.

Validation
----------

At the end of parsing syntax the counter positions should always be zero,
if any is greater than zero then you have a unmatched character since it
means at least one character is still open and the closing character for
the pair hasn't been found. You also need to prevent incorrect order of
characters such as ")" occurring before "(" by using a rule-set of strict
rules that lists the expected states and determines which states can follow
each other. The counter positions can also be used to validate the amount
of characters within a certain statement for example the for statement
could be validated by checking the for flag to find if you are inside a for
statement and the semi-colons could be counted to verify the correct amount
characters. The counter positions play an important role here since there
could be n number of statements and so you need to track the count of
characters related to the current statement.

Rulesets
--------

By using a separate rule-set to control which state is allowed after a
previous state you can make the rules control how you parse JavaScript. E.g
if a new feature is introduced then you could modify the rule-set rather
than the parser itself to introduce the new feature. Storing the states as
a number would enable faster parsing since JavaScript is faster when
evaluating numbers rather than strings and would also reduce memory
overhead. The rule-set could store expected states after a certain state is
found allowing you to create a strict pattern of characters that should
follow each other yet have the control separate from the parsing logic.
Using rule-sets separates parser logic into a manageable form that can
easily be edited by others even if unfamiliar with the parsing logic
itself. It could enable quick testing of new feature ideas, dynamically
changing parsing rules or supporting different languages easily.

Conclusion
----------

Further work needs to be done to establish whether it is feasible to
implement such a parser using the techniques described, in particular the
overhead of storing states on heavily nested JavaScript within multiple
pairs of characters and it's impact on performance compared to traditional
parsing. In addition a more elegant way of storing lookup states for each
pair of characters rather than conversions of int based counters to strings
and it's impact on performance of these lookups.

The techniques presented in this paper can apply to any language and should
allow extremely quick parsing and be even more effective when characters do
not overlap with other parts of the syntax. It also makes the use of a
separate tokenizing step obsolete since tokenization can occur as the
string is being parsed. State lookups should also reduce the need to look
ahead or behind when parsing code since the partner character can determine
the correct state of the ending character resulting in better overall
parser performance and faster executing JavaScript.

Credits
-------

Much of this work has been produced while defending attacks presented by
Alexey Silin and Jonas Magazinius on my JavaScript sandbox over the years.
I'd also like to thank the work of my fellow "slackers" Mario Heiderich,
Eduardo Vela, David Lindsay, Stefano Di Paola, Soroush Dalili, Giorgio
Maone and anyone else I forgot.

Cheers

Gareth

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Paper: Parsing JavaScript by counting character pairs

Reply via email to