Re: [PHP-DEV] [RFC] [VOTE] is_literal

Craig Francis Wed, 21 Jul 2021 08:10:51 -0700

> On 20 Jul 2021, at 01:23, tyson andre <tysonandre...@hotmail.com> wrote:
> 
>> As an aside, only 4 of 23 'no' voters provided any comment as to why they
>> voted that way on the mailing list, which I feel undermines the point of
>> the Request For Comment process, with an additional 5 responding personally
>> off-list after prompting. This makes it harder (or impossible) for points
>> to be discussed and addressed.
> 
> 1. My earlier comments about static analysis, and on behavior depending on 
> whether opcache is enabled



Thanks Tyson,

Just to check I've not missed anything, are you referring to your comments from 
March 2020?
https://news-web.php.net/php.internals/109192 
<https://news-web.php.net/php.internals/109192>

I mentioned your email in the RFC, as I agree that developers should use Static 
Analysis, but I think it's highly unlikely we will be able to get most PHP 
developers to use it:
https://wiki.php.net/rfc/is_literal#static_analysis 
<https://wiki.php.net/rfc/is_literal#static_analysis>

As to the OPcache, Joe's implementation already works with its optimisations, 
so the results are consistent.
https://wiki.php.net/rfc/is_literal#compiler_optimisations 
<https://wiki.php.net/rfc/is_literal#compiler_optimisations>



> 2. This might prevent certain optimizations in the future. For example, 
> currently, 1-byte strings are all interned to save memory.
>    If is_literal was part of php prior to proposing that optimization, then 
> that optimization may be rejected.


Like the OPcache optimisations, the 1-byte strings and other existing 
optimisations work with Joe's implementation. I cannot comment or predict any 
future optimisation work, but this argument could be made for any change to PHP.



> 3. PHP's `string` type is used both for (hopefully) valid unicode strings and 
> for low level operations on literal byte arrays (e.g. cryptogrophy).
>    It seems really, really strange for a type system to track trustedness for 
> a low level primitive to track byte arrays. (the php 6 unicode string 
> refactoring failed)
> 
>    Further, if this were to be extended in the future beyond the original 
> proposal (e.g. literal strings that are entirely digits are automatically 
> interned or marked as trusted),
>    this might open previously safe code acting on byte arrays to side channel 
> issues such as timing attacks (https://en.wikipedia.org/wiki/Timing_attack)


We're just adding a flag to say if the string was written by the developer, 
which is key for identifying Injection Vulnerabilities (it's similar to the 
Interned flag).

With cryptography and timing attacks, the only part affected would be during 
string concatenation, which has already got several optimisations that 
introduce variability in timings (e.g. concatenating variables containing 'a' 
with 'b' is about twice as slow than 'a' with '').

We did have a discussion about integer support, but the way 
integers/floats/booleans are currently stored in memory would require a big 
change to note if those values were defined in the developers PHP source code. 
We also explored the idea of allowing all integers, and while they cannot lead 
to an Injection Vulnerability, the consensus was that a developer defined 
string is easier to understand:
https://wiki.php.net/rfc/is_literal#integer_values 
<https://wiki.php.net/rfc/is_literal#integer_values>



> 4. Internal functions and userland polyfills for those functions may 
> unintentionally differ significantly for the resulting taintedness,
>    e.g. base64_decode in userland being built up byte by byte would end up 
> being possibly untainted?


All of these functions would return a non-literal string, as the values were 
not written by the developer (as in, directly defined in their source code).



> 5. The fact that 1-byte strings are almost always interned seems like a 
> noticeable inconsistency (though library authors can deal with it once 
> they're aware of it), though for it to become an issue a library may need to 
> take multiple strings as input
>    (e.g. a contrived example`"echo -- " . $trustedPrefix . 
> shell_escape($notTrusted)` for $trustedPrefix of "'" (or "\n") and 
> $notTrusted of "; evaluate command"


1-byte strings have already been addressed in the implementation.

Libraries need to take user values separately from developer defined strings, 
because developers frequently make mistakes:
https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/mistakes.php?ts=4
 
<https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/mistakes.php?ts=4>

Especially when it comes to escaping values:
https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/escaping.php?ts=4
 
<https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/escaping.php?ts=4>

In your command example, the library could copy how Parameterised Queries work 
in SQL.
https://github.com/craigfrancis/php-is-literal-rfc/blob/main/examples/cli-basic.php
 
<https://github.com/craigfrancis/php-is-literal-rfc/blob/main/examples/cli-basic.php>



> 6. Including it in core would make it harder to remove later if it interfered 
> with opcache or jit work, or to migrate code to alternative interpreters for 
> php if those were ever implemented (if frameworks were to extensively depend 
> on is_literal)


This is why Joe was very careful with the implementation. Overall the approach 
is simple, where it's just providing libraries with the key bit of information 
to identify their primary source of Injection Vulnerabilities (when a developer 
has incorrectly included user data).



> 7. Tracking whether a string is untrusted is a definition only suitable for a 
> few (extremely common) formats for php. But for less common features, even 
> stringified integers may be a problem (e.g. binary file formats, etc)
> 
>    This is relatively minor given that php is typically used in a web 
> programming context with json or html or js/css output
> 
>    I'd think is_interned()/intern_string() is much closer to tracking 
> something that corresponds with php's internals (e.g. and may allow saving 
> memory in long-running processes which receive duplicate strings as input), 
> though the 10 people who wanted fully featured trustedness checking would 
> probably want is_literal instead


The implementation is pretty close to matching the Interned flag, but isn't 
exactly the same, as the Interned flag can be unpredictable for normal PHP 
developers (e.g. the chr() function), and it could be changed in the future 
(e.g. Joe noted how there was a suggestion to intern keys while JSON decoding 
or unserializing).



> 8. Serializing and unserializing data would lose information about 
> trustedness of inputs, unpredictably (e.g. unserialize() in php 8.0 interns 
> array keys).
> 
>    There's no (efficient) way to change trusted strings to untrusted or vice 
> versa, though there are inefficient workarounds (modifying a byte and 
> restoring it to stop trusting it, imploding single characters to create a 
> trusted string)
> 
>    This may done implicitly in frameworks using APCu/memcached/redis as a 
> cache
> 
>    (I definitely don't think the serialization data format should track 
> is_literal())


Agreed, serializing and unserializing (or any other modifications) does not set 
the literal flag.

As to providing an easy way to mark a non-literal string to a literal - in 
short, we do not want to recreate the biggest flaw of Taint Checking, where 
naive developers would see this as a way to mark escaped strings as "safe", 
giving them a false sense of security that these strings can now be trusted.



> 9. Future refactorings, optimizations or deoptimizations (or security fixes) 
> to unserialize(), etc. may unexpectedly break code using is_literal that 
> throw instead of warn (more bug reports, discourage users from upgrading, 
> etc.)


The implementation already includes several tests, and output from functions 
like unserialize() do not set the literal flag.



> 10. This RFC adds an unknown amount of future work for php-src and PECLs to 
> *intuitively* support mapping trusted inputs to trusted outputs or vice versa 
> - less commonly used or unmaintained functions may not behave as expected for 
> a while


This shouldn't be an issue, as the flag is only being used to identify literal 
strings. If new strings are being created or modified, they are no longer 
considered literal strings (the flag is off by default).



> 11. https://pecl.php.net/package/taint is available already for a use case 
> with some overlap for setups that need this


Libraries need something that's available to everyone (we all make mistakes; 
and the developers most likely to introduce these vulnerabilities are unlikely 
to install an extension, or use Static Analysis).

And with the Taint extension in particular, it has really helped me experiment 
with this concept, but that approach does give a false sense of security, which 
is why this RFC did not re-create it:
https://wiki.php.net/rfc/is_literal#taint_checking 
<https://wiki.php.net/rfc/is_literal#taint_checking>



> Aside: I'd have to wonder if ZSTR_IS_INTERNED (and the function to make an 
> interned string) would make sense to expose in a PECL as a regular 
> `extension` (not a `zend_extension`) if is_interned also fails.
> Unlike the zend_extension for 
> https://www.php.net/manual/en/function.is-tainted.php ,
> something simple may be possible without needing the performance hit and
> future conflicts with XDebug that I assume 
> https://www.php.net/manual/en/function.is-tainted.php would be prone to.
> (https://pecl.php.net/package/taint seems to use a separate bit to track 
> this. The latest release of the Taint pecl fixes XDebug compatibility)


As noted above, we couldn't expose the Interned flag directly, as it's a little 
unpredictable. That's why we added a new flag - one that at first glance might 
look the same, but handles the edge cases.

Thanks again,
Craig

Re: [PHP-DEV] [RFC] [VOTE] is_literal

Reply via email to