On Wed, Sep 3, 2014 at 3:20 PM, Stephan Beal <[email protected]> wrote:
> On Wed, Sep 3, 2014 at 11:13 PM, jose isaias cabrera < > [email protected]> wrote: > > PHP should handle the encoding whether or not it has the BOM. > > > > As "should" Excel! > > Unlike Excel, with PHP the fix is easy - remove the BOM, which is simple > once you have a program which lets you know it's there (it's hidden in many > editors). > > My point is only - adding a BOM is not a viable solution: it's a > deprecated/discouraged/worst-practice because so many tools don't deal well > with them. > The problem is that the recommendations have varied over time (do use it, don't use it). Win32 was supporting good old 2 byte only Unicode back before there was a UTF-8 (or indeed before there was a UTF-16). The name "byte order mark" is a misnomer (though mostly accurate). The BOM is really a "zero width non breaking space" which is "harmless" at the beginning of a file and thus useful as a signature to identify the encoding of a file when other metadata / encoding information is not available. It is useful (if used) to detect the encoding of a file, but posix systems "wimped out" and basically never embraced the Unicode specification as originally written, writing their own encoding, a File System Safe Unicode Transformation Format (FSS UTF) which later became UTF-8. Don't get me wrong, I like UTF-8, preferring it to UTF-16. But the rationale for using ZWNBS as a signature is useful for all UTF encodings when parties agree to use it as such. When the only Unicode standard was UCS-2, it was only useful as a signature to determine the byte order of the originating system. After UTF-8 was introduced, it became useful in its own right to identify a byte oriented encoding. Then when UCS-2 was effectively dropped and UTF-16 replaced it (so that the code point space could be extended from 1 16 bit plane to 17 16 bit planes), and UTF-32 was introduced, it became even more useful. In any case: If we can't get all operating systems to agree what code sequence marks the end of a line of text (CR only, LF only, CR LF, something else entirely), I don't expect we'll get agreement on the "proper" way to construct Unicode centric text files any time soon. Both approaches have pros and cons, though I would maintain that external metadata that unambiguously identifies the text encoding is by far the best option, far preferable to guessing, no matter how high the confidence factor is that the guess is correct. -- Scott Robison _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

