apr_token_* conclusions (was: Better casecmpstr[n]?)

William A Rowe Jr Wed, 25 Nov 2015 09:42:56 -0800

On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <j...@jagunet.com> wrote:


> What is the current status? Is this on hold?
>

It is looking for a good name.  I'm happy with apr_token_strcasecmp
to best indicate its use-case and provenance.  Does that work for
everyone?

It is looking for clearer docs.  Spent 20 hours just reviewing locale in C
(partly to phrase this discussion more accurately). About to pen those
based on better definitions of terms.

So here are my conclusions as they apply to apr, and to httpd after
all of this locale review...

Background
-----------------

The C spec defines the locale as "C" (e.g. "POSIX") at startup.  Until
somewhere in the code the locale of the current thread is switched with
a call to setlocale(LC_ALL, "") or similar, the application remains in this
non-deterministic (Anglicized) state.  The empty string causes LANG
and the LC_* variables to be evaluated.  Most modern *nix utilities do
this right off the bat in their source.  The compiler should *not* do so
without instruction, so while gcc does the right thing in this respect,
other compilers may not be behaving appropriately.

You might react "how do we handle UTF-8 [or other code page] then?"
The answer is that in the "C" locale, high-bit characters (in ASCII, and
even unusual EBCDIC codes) are effectively opaque.  They *can* be
UTF-8, they might belong to an SBCS (single-byte charset), and they
might be entirely meaningless.  They won't be case folded (but will
keep their unique identity).  Some consumer may recognize them,
others will not.  The C lib functions will treat each octet of an MBCS
as a distinct character, which is a reason that our old-school autoindex
(pre-fancy tables) misaligns the columns when the filename contains
any multibyte characters.  We had simply byte-counted chars.

If our code splits these multibyte sequences, things can go wrong
somewhere down the way.  In particular, treating these multibyte
sequences as distinct characters is fine in UTF-8 where every part
of the multibyte sequence is high-bit-set (and therefore opaque),
but is not fine in ISO2022-JP, where low-bit-set characters may
change their meaning (there are bugs to the effect that we search
a string for pathname separator characters, and these can occur
in multibyte sequences which are valid file names, and not path
separator characters).  There isn't much we can do about this.

When using httpd on a *nix filesystem containing UTF-8 chars, we
accept UTF-8 filenames both in client provided fields and within the
httpd.conf configuration.  On a filesystem containing SBCS filenames,
we similarly accept these without translation.  It's up to the admin
to decide their schema, without a third party module, the UTF-8
name isn't recognized in an SBCS directory, and visa-versa.

On Windows, all system resources, including filenames, are stored
in Unicode by the OS.  Within APR, we simply treat all system
resource strings as UTF-8, and therefore the conf file in httpd needs
to spell out the UTF-8 name of the resource. All that said, other than
these resource names, even on Windows most string processing
follows the same opaque logic as *nix.

The Mac OS is somewhat similar to Windows, in that all of the
resource names are actually UTF-8 encoded, AFAICT. But unlike
windows, these opaque strings "just work", we don't have to do any
Unicode transliteration to get there.

Observations
-------------------

Nowhere in apr or httpd do *we* call setlocale() to change things.  So
the current use of tolower(), toupper(), strcasecmp() etc should not be
subject to dangerous transliterations by *our* doing.

There is nothing that suggests that the APR consumer *cannot*
call setlocale()!  So we need to revisit APR 2.0 and ensure that
our functions perform as-expected even when operating under
some different locale.  This suggests that apr_token_str[n]casecmp,
apr_token_tolower|toupper, and much of the rest of the code that
exists only to evaluate ASCII alpha characters is *not* paying any
attention to the currently defined locale.

In httpd, where things go sideways is if someone is calling setlocale(),
for example in an in-process PHP, Perl or Lua script, because this
changes the core operation of httpd.  If the script switches setlocale
to turkish, for example, our forced-lowercase content-type conversion
will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the
specs intended.

I imagine that this has rarely shown up because the few scripts that
might toggle this were correctly written to toggle back the previous
locale upon completion.  But if they are running inline and have
the locale toggled during the filter handoff, the filters themselves
are likely facing some unexpected behavior.

APR conclusions
-------------------------

Adding unambiguous token handling functions would be good for
the few case-insensitive string comparison, string folding, and
search functions.  It allows the spec-consumer to trust their string
processing.  I'm going to suggest apr_token_* as the API prefix.
The API will preserve "POSIX behavior" as defined by A-Z <> a-z
character equivalence and make no compensation for locales
or for MBCS strings.

httpd conclusions
-------------------------

There are so many edge cases that we simply need to preserve
the code as-is in httpd 2.4 and warn off anyone toggling the locale
that httpd is operating under.  Making the 'offer' to run under many
non-POSIX locales is opening up a can of security vulnerabilities.
It would be irresponsible to half-fix this.

We should perform a thorough review of httpd 2.x trunk so as to
make a statement upon release that the code has been adapted
to correctly operate under most non-POSIX locales.  Warn off
users from third party modules that have not yet made a similar
statement.  Kill all the redundant httpd 2.x == APR functions that
do not belong in trunk (but may have been appropriate in 2.earlier
while waiting for APR to be released).

To the extent that implementations have very poor implementations
of strcmp(), this isn't justification to overload httpd with more band-aids.
Fix the underlying implementation.  So I'm -0.5 on backporting this
change into httpd 2.4 until we see a comprehensive justification, or
until we comprehensively have fixed trunk and are prepared to make
the same "setlocale()-safe" assertion about 2.4.future.

apr_token_* conclusions (was: Better casecmpstr[n]?)

Reply via email to