On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <j...@jagunet.com> wrote:
> What is the current status? Is this on hold? > It is looking for a good name. I'm happy with apr_token_strcasecmp to best indicate its use-case and provenance. Does that work for everyone? It is looking for clearer docs. Spent 20 hours just reviewing locale in C (partly to phrase this discussion more accurately). About to pen those based on better definitions of terms. So here are my conclusions as they apply to apr, and to httpd after all of this locale review... Background ----------------- The C spec defines the locale as "C" (e.g. "POSIX") at startup. Until somewhere in the code the locale of the current thread is switched with a call to setlocale(LC_ALL, "") or similar, the application remains in this non-deterministic (Anglicized) state. The empty string causes LANG and the LC_* variables to be evaluated. Most modern *nix utilities do this right off the bat in their source. The compiler should *not* do so without instruction, so while gcc does the right thing in this respect, other compilers may not be behaving appropriately. You might react "how do we handle UTF-8 [or other code page] then?" The answer is that in the "C" locale, high-bit characters (in ASCII, and even unusual EBCDIC codes) are effectively opaque. They *can* be UTF-8, they might belong to an SBCS (single-byte charset), and they might be entirely meaningless. They won't be case folded (but will keep their unique identity). Some consumer may recognize them, others will not. The C lib functions will treat each octet of an MBCS as a distinct character, which is a reason that our old-school autoindex (pre-fancy tables) misaligns the columns when the filename contains any multibyte characters. We had simply byte-counted chars. If our code splits these multibyte sequences, things can go wrong somewhere down the way. In particular, treating these multibyte sequences as distinct characters is fine in UTF-8 where every part of the multibyte sequence is high-bit-set (and therefore opaque), but is not fine in ISO2022-JP, where low-bit-set characters may change their meaning (there are bugs to the effect that we search a string for pathname separator characters, and these can occur in multibyte sequences which are valid file names, and not path separator characters). There isn't much we can do about this. When using httpd on a *nix filesystem containing UTF-8 chars, we accept UTF-8 filenames both in client provided fields and within the httpd.conf configuration. On a filesystem containing SBCS filenames, we similarly accept these without translation. It's up to the admin to decide their schema, without a third party module, the UTF-8 name isn't recognized in an SBCS directory, and visa-versa. On Windows, all system resources, including filenames, are stored in Unicode by the OS. Within APR, we simply treat all system resource strings as UTF-8, and therefore the conf file in httpd needs to spell out the UTF-8 name of the resource. All that said, other than these resource names, even on Windows most string processing follows the same opaque logic as *nix. The Mac OS is somewhat similar to Windows, in that all of the resource names are actually UTF-8 encoded, AFAICT. But unlike windows, these opaque strings "just work", we don't have to do any Unicode transliteration to get there. Observations ------------------- Nowhere in apr or httpd do *we* call setlocale() to change things. So the current use of tolower(), toupper(), strcasecmp() etc should not be subject to dangerous transliterations by *our* doing. There is nothing that suggests that the APR consumer *cannot* call setlocale()! So we need to revisit APR 2.0 and ensure that our functions perform as-expected even when operating under some different locale. This suggests that apr_token_str[n]casecmp, apr_token_tolower|toupper, and much of the rest of the code that exists only to evaluate ASCII alpha characters is *not* paying any attention to the currently defined locale. In httpd, where things go sideways is if someone is calling setlocale(), for example in an in-process PHP, Perl or Lua script, because this changes the core operation of httpd. If the script switches setlocale to turkish, for example, our forced-lowercase content-type conversion will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the specs intended. I imagine that this has rarely shown up because the few scripts that might toggle this were correctly written to toggle back the previous locale upon completion. But if they are running inline and have the locale toggled during the filter handoff, the filters themselves are likely facing some unexpected behavior. APR conclusions ------------------------- Adding unambiguous token handling functions would be good for the few case-insensitive string comparison, string folding, and search functions. It allows the spec-consumer to trust their string processing. I'm going to suggest apr_token_* as the API prefix. The API will preserve "POSIX behavior" as defined by A-Z <> a-z character equivalence and make no compensation for locales or for MBCS strings. httpd conclusions ------------------------- There are so many edge cases that we simply need to preserve the code as-is in httpd 2.4 and warn off anyone toggling the locale that httpd is operating under. Making the 'offer' to run under many non-POSIX locales is opening up a can of security vulnerabilities. It would be irresponsible to half-fix this. We should perform a thorough review of httpd 2.x trunk so as to make a statement upon release that the code has been adapted to correctly operate under most non-POSIX locales. Warn off users from third party modules that have not yet made a similar statement. Kill all the redundant httpd 2.x == APR functions that do not belong in trunk (but may have been appropriate in 2.earlier while waiting for APR to be released). To the extent that implementations have very poor implementations of strcmp(), this isn't justification to overload httpd with more band-aids. Fix the underlying implementation. So I'm -0.5 on backporting this change into httpd 2.4 until we see a comprehensive justification, or until we comprehensively have fixed trunk and are prepared to make the same "setlocale()-safe" assertion about 2.4.future.