Re: Why is BOM required to use unicode in tokens?

H. S. Teoh via Digitalmars-d-learn Mon, 14 Sep 2020 22:34:42 -0700

On Mon, Sep 14, 2020 at 09:49:13PM -0400, James Blachly via Digitalmars-d-learn 
wrote:
> I wish to write a function including ∂x and ∂y (these are trivial to
> type with appropriate keyboard shortcuts - alt+d on Mac), but without
> a unicode byte order mark at the beginning of the file, the lexer
> rejects the tokens.
> 
> It is not apparently easy to insert such marks (AFAICT no common tool
> does this specifically), while other languages work fine (i.e., accept
> unicode in their source) without it.
> 
> Is there a downside to at least presuming UTF-8?


Tested it locally, with and without BOM; the lexer rejects ∂ as a valid
token. I suspect the reason has nothing to do with BOMs, but with the
fact that ∂ is not classified as an alphanumeric (see std.uni.isAlpha,
which returns false for ∂).  The following code, which contains Cyrillic
letters, compiles just fine without BOM (std.uni.isAlpha('Ш') returns
true):

        void main() {
                int Ш = 1;
                writeln(Ш);
        }

As the docs for std.uni.isAlpha states, it tests for general Unicode
category 'Alphabetic'.  Probably identifiers are restricted to
characters of this category plus the numerics and '_' (and maybe one or
two others, perhaps '$'? Don't remember now).


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Re: Why is BOM required to use unicode in tokens?

Reply via email to