On Mar 28, 10 18:56, Yigal Chripun wrote:
KennyTM~ Wrote:

On Mar 26, 10 18:52, yigal chripun wrote:
KennyTM~ Wrote:

On Mar 26, 10 05:46, yigal chripun wrote:

while it's true that '?' has one unicode value for it, it's not true for all 
sorts of diacritics and combine code-points. So your approach is to pass the 
responsibility for that to the end user which in 99.9999% will not handle this 
correctlly.


Non-issue. Since when can a character literal store>   1 code-point?

character != code-point

D chars are really as you say code-points and not always complete characters.

here's a use case for you:
you want to write a fully unicode aware search engine.
If you just try to match the given sequnce of code-points in the search term, 
you will miss valid matches since, for instance you do not take into account 
permutations of the order of combining marks.
you can't just assume that the code-point value identifies the character.

Stop being off-topic. '?' is of type char, not string. A char always
holds an octet of UTF-8 encoded sequence. The numerical content is
unique and well-defined*. Therefore adding 4 to '?' also has a meaning.

* If you're paranoid you may request the spec to ensure the character is
in NFC form.

Huh? You jump in in the middle of conversation and I'm off-topic?


Yes. The original discussion is on implicit conversion, which leads to whether ('x' + 1) is semantically correct. How will this be related to search engine?

(Technically even this is off-topic. The title said implicit *enum* conversion.)

Now, to get back to the topic at hand:

D's current design is:
char/dchar/wchar are integral types that can contain any value/encoding even 
though D prefers Unicode. This is not enforced.
e.g. you can have a valid wchar which you increment by 1 and get an invalid 
wchar.


Wrong. Read the specs: http://digitalmars.com/d/1.0/type.html, http://digitalmars.com/d/2.0/type.html

 * char  = unsigned 8 bit UTF-8
 * wchar = unsigned 16 bit UTF-16
 * dchar = unsigned 32 bit UTF-32

To contain any encoding, use ubyte.

Instead, Let's have proper well defined semantics in D:

Design A:
char/wchar/dchar are defined to be Unicode code-points for the respective 
encodings. These is enforces by the language so if you want to define a 
different encoding you must use something like bits!8
arithmetic on code-points is defined according to the Unicode  standard.

Design B:
char represents a (perhaps multi-byte) character.
Arithmetic on this type is *not* defined.

In either case these types should not be treated as plain integral types.

Reply via email to