Ivan Alagenchev and Mark Goodwin asked me to take a look at their
project to bring DOMinator, a taint analysis for SpiderMonkey, up to
date, and help them come up with a path to having it integrated into
SpiderMonkey. Taint analyses are valuable in auditing web security -
looking for XSS vulnerabilities, for example. The implementation has a
long way to go, on several fronts, but it's a really cool tool, and it
did seem to me that there was a path to making it production-quality.
I'm posting my thoughts here, so that interested SpiderMonkey folks can
comment if they like.
The taint analysis applies to strings only, and has four parts:
* It identifies certain *sources* of strings as "tainted":
document.URL, input fields, and so on.
* The JavaScript engine propagates taint information on strings.
Taking a substring of a tainted string, or concatenating a tainted
string, yields a tainted string. Regexp operations, charAt, and so
on all propagate taint information. And so on.
* It identifies certain *sinks* as vulnerable: eval, 'src' attributes
on script elements, and so on.
* Finally, the tool's user interface logs the appearance of tainted
strings at vulnerable sinks. The taint metadata actually records the
provenance of each region of a tainted string, so the tool can
explain exactly why the final string is tainted, which is really
helpful in constructing XSS attacks.
Here's a video (no sound; amazing sub-title technique) that demonstrates
DOMinator. It's a little hard to follow, but he finds an XSS
vulnerability on www.aol.com easily:
http://www.youtube.com/watch?feature=player_embedded&v=f_It469LUFM
tl;dr: The present implementation has many issues, but with enough work,
I think it could be maintainable, efficient, and valuable.
Details:
I looked at the code in the github repo at
https://github.com/alagenchev/spider_monkey. It seems to me that the
following issues need to be addressed:
* The way the taint metadata is stored needs to change. A linked list
isn't appropriate for operating at scale.
We should steal a flag bit from JSString::d::lengthAndFlags (as we
do now), but use that bit to mean that the string has an entry in a
per-runtime taint metadata HashMap, indexed by string address.
js::gc::MarkChildren(JSTracer *, JSString *) should check that bit,
and mark any strings hash table entry refers to. JSString::finalize
should check that bit, and clean up the hash table entry. That way,
the hash table entries live exactly as long as their strings do, so
it's easy to understand their performance impact and check for
correctness.
* Atoms present a challenge to any taint analysis. SpiderMonkey's
existing atomization optimization relies on the assumption that
strings with the same text are interchangable - but that assumption
does not hold if some strings with a given text are tainted and
others aren't. The whole point of atoms is to let object identity
stand in for textual equality. I don't think we should try to change
that; atoms should not be taintable.
Fortunately, atoms are not actually stored in very many places (as
far as I can tell). String literals are atoms, but that doesn't
concern us, as string literals are never tainted. Property names,
stored in js::Shape::propid_, are atoms; but for that use, we could
store the taint metadata (or a pointer to such) in Shape directly,
alongside the atom, and let propid_ continue to work as it does now.
Operations that retrieve property names from objects, like for-in
and Object.getOwnPropertyNames, would need to check for the presence
of taint metadata, and return a properly tainted string, instead of
the real atom.
There may be other places atoms get used that I'm forgetting. But
taint analyses don't need to be perfect to be helpful; if we lose
taint from time to time, that's fine. The web pages we're analyzing
aren't hostile code, so they won't wiggle and squirm to avoid
propagating taint. So as long as we handle the common paths, the
tool should still be useful.
As far as the present implementation is concerned, it seems like
js_AtomizeString in the github repo now has a taint leak. If we
create an atom from a tainted string, then we taint the atom;
looking up the atom for an untainted string with the same text will
yield a tainted atom, when it should not.
* The patch adds functions on 'String', and an accessor on
'String.prototype'. We can't really add methods directly to String,
for the sake of web compatibility. Rather, we should handle taint
the way we handle Debugger. See JS_DefineDebuggerObject, in
js/src/vm/Debugger.cpp, and look at its uses. Or, similarly, look at
JS_InitReflect. Those functions define 'Debugger' or 'Reflect' on
globals when requested; perhaps you could define a 'Taint' object,
with functions like 'Taint.getTaintInfo(string)' and so on.
* The original author of DOMinator ignored SpiderMonkey coding style:
indentation, spacing, naming, and so on. That's easy but
time-consuming to fix, but it's got to be done. The SM style is
described here:
https://wiki.mozilla.org/JavaScript:SpiderMonkey:C_Coding_Style
Current code tends to avoid preprocessor macros and conditionals, in
favor of little utility classes with inline method definitions. I
think the TAINT_CONDITION macros could be replaced by utility
classes with inline methods; these methods could have different
definitions when building with taint analysis enabled than disabled,
such that, when taint is disabled, the compiler can (say) see that
some methods always return false, and remove the taint-propagation
code, without the use of preprocessor conditionals (other than in
the definition of the class itself).
I hope this is helpful. It certainly would be nice to hear what other
SpiderMonkey developers think about it.
_______________________________________________
dev-tech-js-engine-internals mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals