Re: [v8-dev] Utility to check if a given stream can parse as Javascript (ORB)

'Łukasz Anforowicz' via v8-dev Thu, 12 Aug 2021 11:10:59 -0700

Thank you very much for the feedback - much appreciated.  I've tried to
reply to some of the feedback inline, below.


Let me step back a little bit, and observe that distinguishing JS from
non-JS might not necessarily require full-fidelity JS parsing (to catch,
say, 95% of non-JS responses).  On one hand it might be undesirable to
introduce additional sniffing/parsing heuristics (defined and evolving
separately from the JS parser and spec), but maybe such heuristics would be
useful for catching PDF, ZIP, MSWORD, and other non-JS files that exhibit
some "obvious" signs that they are non-JS?  Maybe we can brainstorm
together on how such heuristics could look like?  I've tried to gather some
notes in a doc here
<https://docs.google.com/document/d/1qUbE2ySi6av3arUEw5DNdFJIKKBbWGRGsXz_ew3S7HQ/edit#heading=h.mptmm5bpjtdn>,
but let me copy them below for your convenience:

ORB-with-html/json/xml-sniffing shows that some security benefits of ORB
may be realized without full-fidelity JS sniffing/parsing.  Let’s explore
various considerations that may lead to discovery of other alternative
approaches to sniffing.


   -

   Pri1 requirement: Avoid breaking existing websites.
   -

      Requirement: Never block responses with JS body
      -

         Requirement: never block HTML/JS polyglots
         -

         Requirement: skip elements that are okay both in HTML and JS:
         -

            <!-- … --> comments
            -

            Whitespace
            -

      Non-requirement (?): Never block responses with *future* JS bodies
      -

         We don’t want to break _existing_ websites.  But maybe we can
         force _future_ websites to label their JS as JavaScript MIME type
         <https://mimesniff.spec.whatwg.org/#javascript-mime-type>.
         Therefore, we don’t have a requirement to robustly handle _future_ JS
         versions/specs (and/or _future_ image formats like JXL).
         -

   Pri2 requirement: Block as many responses as possible
   -

      Requirement: Block as many non-JS responses as possible (after
      earlier ORB steps rule out that we are dealing with an image,
audio, video,
      or stylesheet):
      -

         Requirement: Block responses starting with: %PDF-
         -

         Requirement: Block zip files - files starting with 50 4B 03 04, or
         50 4B 05 06, or 50 4B 07 08
         -

         Requirement: Block MS Office files - files starting with D0 CF 11
         E0 A1 B1 1A E1 (source: MS-XLS spec
         
<https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/cd03cb5f-ca02-4934-a391-bb674cb8aa06>
         + Microsoft Compound File Binary File Format spec
         <https://www.loc.gov/preservation/digital/formats/fdd/fdd000380.shtml>
         )
         -

         Requirement: Block CSV files
         -

         Requirement: Block XML files - files starting with: <?xml
         -

         Requirement: Block ProtoBuf (binary and text encoding)
         -

         Requirement: Block responses beginning with JSON parser breaker -
         examples: )]}' {}&& while(1);
         -

         Requirement: Block HTML files: whitespace + HTML comments followed
         by a HTML element
         -

   Pri2 requirement: Do not regress performance / latency / etc
   -

      Requirement: Make the final decision based on the 1st 1024 bytes of
      the response body.
      -

   Assumption: UTF-8 (maybe it is okay if UTF-16 encoding of Javascript is
   not recognized as Javascript?)


We probably don't want the sniffer to have PDF-specific or ZIP-specific
knowledge.  But maybe there are some generic heuristics that would detect
PDF and ZIP as non-JS?

Not-quite-working heuristic: Maybe ASCII control characters should mean:
non-JS (except LF and CR and other WhiteSpace and LineTerminator
characters)?  This is a bit problematic because SourceCharacter in JS BNF
<https://262.ecma-international.org/12.0/#sec-grammar-summary> allows any
Unicode code point.  OTOH, maybe this only matters inside JS comments or
string literals?

Thanks,

Lukasz

On Thu, Aug 12, 2021 at 12:47 AM Leszek Swirski <lesz...@chromium.org>
wrote:

> That's also the case for invalid LHS on assignment, e.g. `lhs() = 5`,
> which should be an early error but for web compat we make it a runtime
> error.
>
> Overall, this could be something we expose, but:
>
>    1. There's a couple of additional complications around JS standards
>    incompatible errors (like the two aforementioned ones), some of which are
>    intentional
>    2. There's the rule-of-2 violation
>    3. This breaks streaming compilation (since the full body of the
>    resource has to be available for parsing before it is sent to the renderer)
>
> Having to present the full body of a resource is indeed problematic
(because it requires gathering the response body before passing CORB/ORB
can pass/expose the body into the renderer process).

>
>    1. Parsing JS ain't cheap, and doing so as part of the network
>    process, presumably before sending anything to the renderer, is quite a 
> cost
>    2. We don't have a way of distinguishing valid JS from valid JSON
>    during parse, so we'd effectively need to parse twice
>
> This seems solvable with something like:

When the sniffer sees:

[ 123, 456, “long string taking X bytes”,

then it should block the response when the Content-Type is a JSON MIME
type, but otherwise it should allow the response (trading off security for
backcompatibility).

When the sniffer sees:

{ “foo”:

then it should block the response, because such a prefix never results in
valid Javascript. (Although the JSON object syntax is exactly Javascript's
object-initializer syntax, a Javascript object-initializer expression is
not valid as a standalone Javascript statement.)



>
>    1. Standards can change, and syntax can change with it, so whether or
>    not something is blocked will be version dependent
>
> That is a fair point.  OTOH, we might have some flexibility here, because
A) if CORB/ORB blocks only non-javascript responses, then this should have
very little impact on web pages that work fine and B) we mostly care about
avoiding breaking _existing_ websites (and therefore might be okay
ignoring _future_ Javascript spec changes and forcing _future_ scripts to
always be served with a correct MIME type).

>
>    1. The DX of blocking a script just because of a parse error may be
>    suboptimal
>
> Ack.  This is something that I didn't have in focus, so thanks for
bringing this up.  Maybe (?) it is okay to say that to get good Developer
eXperience one has to label their scripts with the correct JavaScript MIME
type.

>
> On Thu, Aug 12, 2021 at 9:37 AM 'Mathias Bynens' via v8-dev <
> v8-dev@googlegroups.com> wrote:
>
>> Another complication is that V8 currently doesn’t throw early (“parse”)
>> errors for regular expression literals (issue 896
>> <https://bugs.chromium.org/p/v8/issues/detail?id=896>). This would have
>> to be resolved before we can accurately validate whether a given input is
>> valid JS or not.
>>
>> On Thu, Aug 12, 2021 at 9:31 AM 'Hannes Payer' via v8-dev <
>> v8-dev@googlegroups.com> wrote:
>>
>>> Hi Lukasz,
>>>
>>> To understand your question correctly: You want an API which returns
>>> true if the JavaScript input is valid, right?
>>>
>>
Yes.  I am not sure at this point whether the input is 1) a string
containing the whole response body, 2) a string containing a prefix of the
response body (e.g. the 1st 1024 bytes), or 3) a stream.

>
>>> I think this surgery should be possible but I am deferring to the parser
>>> owners. @Leszek Swirski <lesz...@google.com> @Toon Verwaest
>>> <verwa...@google.com> WDYT? Maybe that's even a nice testing mode for
>>> JS language features.
>>>
>>> The parser is quite complicated which is a problem from a security
>>> perspective. That's a Rule-of-2 violation.
>>>
>>
Ack.

>
>>> -Hannes
>>>
>>> On Wed, Aug 11, 2021 at 9:21 PM 'Łukasz Anforowicz' via v8-dev <
>>> v8-dev@googlegroups.com> wrote:
>>>
>>>> Hello v8-dev@,
>>>>
>>>> Could you please help me with my questions below (related to parsing
>>>> Javascript)?  Please let me know if I should try another email alias
>>>> instead (I wasn't quite sure where to start asking questions).
>>>>
>>>> Context:
>>>>
>>>>    - ORB proposes <https://github.com/annevk/orb> to parse a HTTP
>>>>    response body to verify if it can be parsed as Javascript (blocking 
>>>> no-cors
>>>>    HTTP responses if the response body doesn't represent Javascript, 
>>>> because
>>>>    earlier ORB steps have already verified that the response doesn't 
>>>> represent
>>>>    other valid no-cors scenarios like audio/image/video/stylesheet/etc).
>>>>    - AFAICT, public v8 APIs provide a way to compile a script
>>>>    (e.g. v8::ScriptCompiler::CompileUnboundScript which takes a string as
>>>>    input, and a v8::ScriptCompiler::StartStreaming which takes a stream as
>>>>    input).  OTOH, v8/src/parsing/parser.cc doesn't seem to be exposed via 
>>>> the
>>>>    public API.
>>>>
>>>> Questions:
>>>>
>>>>    - *Would it be possible and/or reasonable to provide a public v8
>>>>    API for checking if a stream can be parsed as Javascript?*
>>>>       - Assumption: No cache integration is needed (the parsing will
>>>>       happen outside of a renderer process;  no compilation will be done).
>>>>       - Requirement: For JSON, the parser should indicate that this is
>>>>       not a valid Javascript (e.g. for JSON objects + for JSON lists that
>>>>       terminate without invoking any list methods)
>>>>       - I am happy to tackle this work, but I may need some guidance
>>>>       and hand-holding regarding some of the details.
>>>>    - *Is it fair to describe Javascript parsing as risky from a
>>>>    security perspective?*  (e.g. something to avoid in a
>>>>    NetworkService process and consider doing in a Utility process instead)
>>>>       - On one hand, the input is a text stream (no binary offsets)
>>>>       and the output is just a boolean (definitely-not-a-Javascript VS
>>>>       the-prefix-still-parses-as-Javascript).  And I imagine that the 
>>>> essence of
>>>>       the parser just mechanically transcribes the BNF rules for 
>>>> Javascript.
>>>>       OTOH, parsers can get fairly complex, and so it seems that the act of
>>>>       parsing might be seen as violating the Rule-of-2
>>>>       
>>>> <https://chromium.googlesource.com/chromium/src/+/refs/heads/main/docs/security/rule-of-2.md>
>>>>       .
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Lukasz
>>>>
>>>> --
>>>> --
>>>> v8-dev mailing list
>>>> v8-dev@googlegroups.com
>>>> http://groups.google.com/group/v8-dev
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "v8-dev" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to v8-dev+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/v8-dev/d4dd45ff-3b73-4d4b-883d-d2e8ba4123e7n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/v8-dev/d4dd45ff-3b73-4d4b-883d-d2e8ba4123e7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Hannes Payer |  V8 |  Google Germany GmbH |  Erika-Mann Str. 33, 80636
>>> München
>>>
>>> Registergericht und -nummer: Hamburg, HRB 86891 | Sitz der
>>> Gesellschaft: Hamburg | Geschäftsführer: Matthew Scott Sucherman, Paul
>>> Terence Manicle
>>>
>>> --
>>> --
>>> v8-dev mailing list
>>> v8-dev@googlegroups.com
>>> http://groups.google.com/group/v8-dev
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "v8-dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to v8-dev+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/v8-dev/CAKEgpyHrQ8tzyh%3D3RF58ww9bXbSZ%2BFO9ukGodgJcdb_tHom%3DXA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/v8-dev/CAKEgpyHrQ8tzyh%3D3RF58ww9bXbSZ%2BFO9ukGodgJcdb_tHom%3DXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> --
>> v8-dev mailing list
>> v8-dev@googlegroups.com
>> http://groups.google.com/group/v8-dev
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "v8-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to v8-dev+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/v8-dev/CADizRgbND4szVdtmoUqTSwvr%3DduwB9SANRN8tAysxa9kONsHLA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/v8-dev/CADizRgbND4szVdtmoUqTSwvr%3DduwB9SANRN8tAysxa9kONsHLA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> --
> v8-dev mailing list
> v8-dev@googlegroups.com
> http://groups.google.com/group/v8-dev
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "v8-dev" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> v8-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/v8-dev/CAGRskv9jhWgPAqjiTvWuy0JCyLAgdYS_9PKgg-5bAqpuKyp81Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/v8-dev/CAGRskv9jhWgPAqjiTvWuy0JCyLAgdYS_9PKgg-5bAqpuKyp81Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Thanks,

Lukasz

-- 
-- 
v8-dev mailing list
v8-dev@googlegroups.com
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/CAA_NCUFUjSg%2BazamR8fkZb7bk%3DV2bWTiD2O4TOae8sN8d6namQ%40mail.gmail.com.

Re: [v8-dev] Utility to check if a given stream can parse as Javascript (ORB)

Reply via email to