Re: Why is BOM required to use unicode in tokens?
On Wednesday, 16 September 2020 at 00:22:15 UTC, Steven Schveighoffer wrote: On 9/15/20 8:10 PM, James Blachly wrote: On 9/15/20 10:59 AM, Steven Schveighoffer wrote: [...] Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand? I don't really know the answer, as I'm not a unicode expert. Someone should verify that the character you want to use for a symbol name is actually considered a letter or not. Using phobos to prove this is kind of self-defeating, as I'm pretty sure it would be in league with DMD if there is a bug. I checked, it's not a letter. None of the math symbols are. But if it's not a letter, then it would take more than just updating the range. It would be a change in the philosophy of what constitutes an identifier name.
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 16:23:01 UTC, Jon Degenhardt wrote: # The 'Ш' and 'ä' characters are fine. $ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } void main() { Шä(); }' | dmd -run - Hello World! # But not '∂' $ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } void main() { x∂(); }' | dmd -run - __stdin.d(1): Error: char 0x2202 not allowed in identifier Yes. The same troubles for widely used Greek symbols (Sigma, alpha and some other). Unfortunally...
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote: I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens. It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it. Is there a downside to at least presuming UTF-8? As you probably already know BOM means byte order mark so it is only relevant for multi byte encodings (UTF-16, UTF-32). A BOM for UTF-8 isn't required an in fact it's discouraged. Your editor should automatically insert a BOM if appropriate when you save your file. Probably you need to select the appropriate encoding for your file. Typically this is available in the 'Save as..' dialog, or the settings.
Re: Why is BOM required to use unicode in tokens?
On Wednesday, 16 September 2020 at 07:38:26 UTC, Dominikus Dittes Scherkl wrote: We only need to define which properties a character need to be allowed in an identifier. I think the following change in the grammar would be sufficient: Identifier: IdentifierStart IdentifierStart IdentifierChars IdentifierChars: IdentifierChar IdentifierChar IdentifierChars IdentifierStart: _ Any Unicode codepoint with general category Lu, Ll, Lt, Lo, Nl or No IdentifierChar: IdentifierStart Any Unicode codepoint with general category Lm, Mn, Me, Mc or Nd
Re: Why is BOM required to use unicode in tokens?
On Wednesday, 16 September 2020 at 00:22:15 UTC, Steven Schveighoffer wrote: Someone should verify that the character you want to use for a symbol name is actually considered a letter or not. Using phobos to prove this is kind of self-defeating, as I'm pretty sure it would be in league with DMD if there is a bug. UnicodeData.txt (a data file provided by the unicode organization itself since version 1) contains exactly the necessary properties (in an easy parsable format), so we don't need to hard-code the list of allowed identifier characters, but can instead use the latest version provided by unicode (changing every year!). We only need to define which properties a character need to be allowed in an identifier.
Re: Why is BOM required to use unicode in tokens?
On 9/15/20 8:24 PM, James Blachly wrote: Again with the self-reply :/ Forgot the reference: https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf
Re: Why is BOM required to use unicode in tokens?
On 9/15/20 8:10 PM, James Blachly wrote: Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand? OK interestingly this code point 0x2202 falls within the range "mathematical operators" [0] , and I could see why in general a range called "operators" (which includes e.g. set membership, relations, operators you would see in abstract algebra, etc.) however, the first 8 codepoints in the range are "Miscellaneous mathematical symbols" and include several that would be appropriately included as/in token names. Indeed, chapter 22, page 823 of the Unicode standard groups ∂ U+2202 (the partial differential symbol in question) along with "Basic Set of Alphanumeric Characters" that includes Latin 0-9, [a-z,A-Z], uppercase greek A-Ω, nabla and variant theta, the lowercase Greek letters, and besides U+2202 ∂, six additional glyph variants. Due to de-duplication of code points, some things that may rightly appear in multiple ranges (like U+2202 ∂) are deduplicated and that I think is the fate that befell this variant delta.
Re: Why is BOM required to use unicode in tokens?
On 9/15/20 8:10 PM, James Blachly wrote: On 9/15/20 10:59 AM, Steven Schveighoffer wrote: Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses. What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers? I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards. Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand? I don't really know the answer, as I'm not a unicode expert. Someone should verify that the character you want to use for a symbol name is actually considered a letter or not. Using phobos to prove this is kind of self-defeating, as I'm pretty sure it would be in league with DMD if there is a bug. But if it's not a letter, then it would take more than just updating the range. It would be a change in the philosophy of what constitutes an identifier name. -Steve
Re: Why is BOM required to use unicode in tokens?
On 9/15/20 10:59 AM, Steven Schveighoffer wrote: Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses. What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers? I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards. -Steve Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand?
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 21:27:25 UTC, Ola Fosheim Grøstad wrote: On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote: I wish to write a function including ∂x and ∂y (these are You can use the greek letter delta instead: δ Wouldn't that imply a normal differential?
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote: I wish to write a function including ∂x and ∂y (these are You can use the greek letter delta instead: δ
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 14:59:03 UTC, Steven Schveighoffer wrote: On 9/15/20 10:18 AM, James Blachly wrote: What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers? I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards. Looks like it has to do with the '∂' character. But non-ascii alphabetic characters work generally. # The 'Ш' and 'ä' characters are fine. $ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } void main() { Шä(); }' | dmd -run - Hello World! # But not '∂' $ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } void main() { x∂(); }' | dmd -run - __stdin.d(1): Error: char 0x2202 not allowed in identifier __stdin.d(1): Error: character 0x2202 is not a valid token __stdin.d(1): Error: char 0x2202 not allowed in identifier __stdin.d(1): Error: character 0x2202 is not a valid token However, 'Ш' and 'ä' satisfy the definition of a Unicode letter, '∂' does not. (Using D's current Unicode definitions). I'll use tsv-filter (from tsv-utils) to show this rather than writing out the full D code. But, this uses std.regex.matchFirst(). # The input $ echo $'x\n∂\nШ\nä' x ∂ Ш ä The input filtered by Unicode letter '\p{L}' $ echo $'x\n∂\nШ\nä' | tsv-filter --regex 1:'^\p{L}$' x Ш ä The spec can be made more clear and correct. But if a "universal alpha" is essentially about Unicode letters you might be looking for a change in the spec to use the symbol chosen. --Jon
Re: Why is BOM required to use unicode in tokens?
On 9/15/20 10:18 AM, James Blachly wrote: On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote: On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote: On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote: Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard. I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters. ISO/IEC 9899:1999 (E) Annex D Universal character names for identifiers - --- This is outdated to the brim. Also it doesn't allow for letter-like symbols (which is debatable, but especially the mathematical ones like double-struck letters are intended for such use). Instead of some old C-Standard, D should better rely directly on the properties from UnicodeData.txt, which is updated with every new unicode version. Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses. What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers? I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards. -Steve
Re: Why is BOM required to use unicode in tokens?
On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote: On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote: On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote: Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard. I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters. ISO/IEC 9899:1999 (E) Annex D Universal character names for identifiers - ... --- This is outdated to the brim. Also it doesn't allow for letter-like symbols (which is debatable, but especially the mathematical ones like double-struck letters are intended for such use). Instead of some old C-Standard, D should better rely directly on the properties from UnicodeData.txt, which is updated with every new unicode version. Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses. What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers? James
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote: On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote: Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard. I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters. ISO/IEC 9899:1999 (E) Annex D Universal character names for identifiers - Latin: 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9 Armenian: 0531-0556, 0561-0587 Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 05F0-05F2 Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE, 06D0-06DC, 06E5-06E8, 06EA-06ED Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963 Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 09DF-09E3, 09F0-09F1 Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 0A59-0A5C, 0A5E, 0A74 Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 0AD0, 0AE0 Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D, 0B5F-0B61 Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61 Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1 Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61 Thai: 0E01-0E3A, 0E40-0E5B Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0-0EB9, 0EBB-0EBD, 0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69, 0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9 Georgian: 10A0-10C5, 10D0-10F6 Hiragana: 3041-3093, 309B-309C Katakana: 30A1-30F6, 30FB-30FC Bopomofo: 3105-312C CJK Unified Ideographs: 4E00-9FA5 Hangul: AC00-D7A3 Digits: 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F33 Special characters: 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029 --- This is outdated to the brim. Also it doesn't allow for letter-like symbols (which is debatable, but especially the mathematical ones like double-struck letters are intended for such use). Instead of some old C-Standard, D should better rely directly on the properties from UnicodeData.txt, which is updated with every new unicode version.
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote: On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote: I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens. It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it. Is there a downside to at least presuming UTF-8? According to the spec [1] this should Just Work. I'd recommend filing a bug. [1] https://dlang.org/spec/lex.html#source_text Under the identifiers section (https://dlang.org/spec/lex.html#identifiers) it describes identifiers as: Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard. I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters.
Re: Why is BOM required to use unicode in tokens?
On Mon, Sep 14, 2020 at 09:49:13PM -0400, James Blachly via Digitalmars-d-learn wrote: > I wish to write a function including ∂x and ∂y (these are trivial to > type with appropriate keyboard shortcuts - alt+d on Mac), but without > a unicode byte order mark at the beginning of the file, the lexer > rejects the tokens. > > It is not apparently easy to insert such marks (AFAICT no common tool > does this specifically), while other languages work fine (i.e., accept > unicode in their source) without it. > > Is there a downside to at least presuming UTF-8? Tested it locally, with and without BOM; the lexer rejects ∂ as a valid token. I suspect the reason has nothing to do with BOMs, but with the fact that ∂ is not classified as an alphanumeric (see std.uni.isAlpha, which returns false for ∂). The following code, which contains Cyrillic letters, compiles just fine without BOM (std.uni.isAlpha('Ш') returns true): void main() { int Ш = 1; writeln(Ш); } As the docs for std.uni.isAlpha states, it tests for general Unicode category 'Alphabetic'. Probably identifiers are restricted to characters of this category plus the numerics and '_' (and maybe one or two others, perhaps '$'? Don't remember now). T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Re: Why is BOM required to use unicode in tokens?
On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote: I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens. It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it. Is there a downside to at least presuming UTF-8? According to the spec [1] this should Just Work. I'd recommend filing a bug. [1] https://dlang.org/spec/lex.html#source_text