On 06/26/2014 04:50 AM, Robert O'Callahan wrote:
Your email is unclear as to whether you're proposing integrating some
particular analysis engine or framework into Spidermonkey (or more than
one), or just some minimal set of hooks to enable others to supply such
engines/frameworks. I'm going to assume the latter since I think the former
makes no sense at all.

Yes, the idea I have in mind is to have some-kind of self-hosted compartment dedicated to analysis where if a function named "xyz" is declared on the global, then it can be used preferably asynchronously (as we might not want to pay the cross-compartment call), or synchronously (waiting for the day we inline cross-compartment calls in ICs / code), or maybe both.

In terms of hooks, an API enabling arbitrary program transformation has big
advantages as a basis for implementing dynamic analyses, compared to other
kinds of API:
1) Maximum flexibility for tool builders. You can do essentially anything
you want with the program execution.
2) Simple interface to the underlying VM. So it's easy to maintain as the
VM evolves. And, importantly, minimal work for Mozilla.

Except if Mozilla is maintaining these tools as we want to rely on these. For example the Security team wants to rely on some taint analysis or even other simple analysis for checking if events have been validated before being processed.

3) Potentially very low overhead, because instrumentation code can be
inlined into application code by the JIT.

I have a question for you, and also for people who have made such analysis in SpiderMonkey. Why taking all the pain of integrating such analysis in SpiderMonkey's code, which is hard and change frequently when it would be easy (based on what you mention) to just do source-to-source transformation?

Why do we have 3 propositions of implementing taint analysis in SpiderMonkey so far? It sounds to me that there is something which is not easily accessible from source-to-source transformation, which might be easier to get hooked once you are deep inside the engine.

I spent a few years writing dynamic analysis tools for Java, and they all
used bytecode transformation for all these reasons.

I understand your argument saying that we should support transformations on a support which is standardized. Maybe this is just a matter of naming the API properly such as the analysis feel like that are being hooked on the Specs definitions of JavaScript.

You identified some disadvantages:
1) It may be difficult to keep the language support of code transformation
tools in sync with Spidermonkey.
2) Code transformation tools may introduce application bugs (e.g. by
polluting the global or due to a bug in translation).
3) Transformed code may incur unacceptable slowdown (e.g. due to ubiquitous
boxing).
(Did I miss anything?)

Source-to-source implies that analysis developers have to know about the JS implementation, and JS syntax. While such work belongs to the JavaScript engine developers.

The goal of such API is to balance the work to where is the knowledge, and I do not expect analysis developers to understand all the subtle details of JavaScript. (cf Jalangi issues) On the other hand, I do not expect JavaScript developers to maintain any kind of analysis integrated in the JS engine (except for optimization purposes).

Having a Dynamic Analysis API, is just a way in the middle to let people deal with the problem they know.

I think #2 really only matters for people who want to deploy dynamic
analysis in customer-facing production systems, and I don't think that will
be important anytime soon.

On the contrary, I think/hope we could have trivial taint analysis to monitor privacy, in a similar way as Lightbeam (Collusion) is doing.

#1 doesn't seem like a big problem to me. Extending a JS parser is not that
hard.

Extending a JS parser, maybe.  Extending 2 JS parser the same way, is harder.

New language features with complex semantics require significant tool
updates whatever API we use.

No as much as the syntax, the bytecode is an example of it, as the bytecode is some-kind of subset that we target with the bytecode emitter. As you mentionned, manipulating bytecode is easy, but manipulating the source to ensure that we have the same semantic might be more complex.

A trivial example is the deconstruction syntax:

  var [a, b] = a;

Where do you hook the getters? Or do you have to understand it to translate it to:

  var a = $.arrayGet(c, 0);
  var b = $.arrayGet(c, 1);

And I do not have to go far to see that is this already done by the parser, and that the parser handle the name clashes for us (what if instead of "c", this was "a"?). Do we want every analysis developer do the same mistake, or just provide them with an API as Jalangi does.

If we're using these tools ourselves, we'd
have to update the tools sometime between landing the feature in
Spidermonkey and starting to use it in FirefoxOS or elsewhere where we're
depending on analysis.

Like everything else, but there is more chance to break something which rely on source-to-source transformation than something which relies on a lower level (ECMA based?) API.

#3 is interesting and perhaps where lessons learned from Java and other
contexts do not apply. I think we should dig into specific tool examples
for this; maybe some combination of more intelligent translation and
judicious API extensions can solve the problems.

Nicolas B. Pierron wrote:

Personally, I think that these issues implies that we should avoid relying
on a source-to-source mapping if we want to provide meaningful security
results. We could replicate the same or a similar API in SpiderMonkey, and
even make one compatible with Jalangi analysis.


It's not clear what you mean by "the same or a similar API" here.

I mean that I want such API to be a JavaScript API. I do not want us to provide function for adding hooks. I want the JS engine to provide one function for registering all the hooks you want in a separated compartment.

  var a = newAnalysisGlobal();
  a.eval("load('my-analysis.js')");

  var g = newGlobal({analysis = a});

  // Generate bytecode probes based on function currently present on the
  // analysis global.
  g.eval("…");

We can either inspire our-self from Jalangi interface for making analysis, or just bridge the two with a wrapper. Such analysis should be implemented in JavaScript and not any other language as our primary target are JavaScript developers.

If we add opcodes dedicated to monitor values (at the bytecode emitter
level), instead of doing source-to-source transformation. One of the
advantage would be that frontend developers would not have to maintain
Jalangi sources when we are adding new features in SpiderMonkey, and more
over, the bytecode emitter already breakdown everything to opcodes, which
are easier to wrap than the source.

Analysis are usually made to observe the execution of a code, and not to
mutate it.  So if we only monitor the execution, instead of emulating it, we
might be able to batch analysis calls.  Doing batches asynchronously implies
that the overhead of running an analysis is  minimal while the analyzed code
is running.


Logging and log analysis have their place, but a lot of dynamic analysis
tools rely on efficient synchronous online data processing in
instrumentation code. For example, if you want to count the number of times
a program point is reached, it's much more efficient to increment a global
variable at that program point than to log to a buffer every time that
point is reached, and count log entries offline. For many analyses of
real-world applications, high-volume data logging is neither efficient nor
scalable. Here are a couple of examples of Java tools I worked on where
synchronous online data processing was essential:
-- http://fsl.cs.illinois.edu/images/e/e8/P385-goldsmith.pdf
-- http://web5.cs.columbia.edu/~junfeng/09fa-e6998/papers/hybrid.pdf
So I think injection of synchronously executed instrumentation is essential
for a large class of analyses.

The asynchronism is one suggestion to make recording analysis faster, by avoiding frequent cross-compartment calls. I do not see any issue to have synchronous request, on the contrary I think it might be interesting to interrupt the program execution on such request, or even change the program execution (things that we can only do synchronously) to prevent security holes / privacy leaks.

On the other, I do think that we should have asychronous analysis first, but only the use case of potential users can answer this question for us.

--
Nicolas B. Pierron

_______________________________________________
dev-tech-js-engine-internals mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals

Reply via email to