> Does Java String support all unicode characters and particularly does it > support more characters than UTF-8
It’s not about Java, it’s about UTF-8 standard. Please, take a look at [1] > In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints > of the UTF-16 character encoding: explicitly prohibiting code points > corresponding to the high and low surrogate characters removed more than 3% > of the three-byte sequences, and ending at U+10FFFF removed more than 48% of > the four-byte sequences and all five- and six-byte sequences. And [2] > The definition of UTF-8 prohibits encoding character numbers between U+D800 > and U+DFFF, which are reserved for use with the UTF-16 encoding form (as > surrogate pairs) and do not directly represent characters. Actually, we already has some modes to support this restriction of UTF-8. Please, take a look at BinaryUtils#utf8BytesToStr [3] [1] https://en.wikipedia.org/wiki/UTF-8 [2] https://datatracker.ietf.org/doc/html/rfc3629 [3] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <[email protected]> написал(а): > >> UTF-8 can’t encode all UNICODE characters. > > Nikolay, could you please elaborate? My understanding is that encoding > we speak about matters for conversion from byte arrays to strings. > Does Java String support all unicode characters and particularly does > it support more characters than UTF-8 (I am not saying here that java > String uses UTF-8)? > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <[email protected]>: >> UTF-8 is already a default encoding in our BinaryObject format. So.... I am >> for unification. >> >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <[email protected]>: >> >>> Hello, Ivan. >>> >>> UTF-8 can’t encode all UNICODE characters. >>> >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <[email protected]> >>> написал(а): >>>> >>>> Khm, maybe a better variant is to enforce all strings to be encoded in >>>> UTF-8? >>>> AFAIK multi OS cluster is a quite common case. >>>> >>>> >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <[email protected]>: >>>> >>>>> Igniters, >>>>> >>>>> Recently we faced the problem that if the cluster consists of nodes >>>>> running in the JVM with different encodings, many issues arise. >>>>> The root cause of the mentioned issues is components that use >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies on >>>>> the >>>>> system default encoding. Thus, if a string is deserialized on a node >>>>> with a different encoding from the one that serialized it, the >>>>> deserialized string can be different from the original one. >>>>> >>>>> For example: >>>>> >>>>> Serialization/deserialization of string in communication messages may >>>>> be >>>>> broken for some strings on nodes running in a JVM with a different >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to >>>>> serialize strings - [1] >>>>> >>>>> Or the IgniteAuthenticationProcessor can compute different security >>>>> IDs >>>>> for the user on different nodes in this case - [2] >>>>> >>>>> What do you think, if we solve this problem globally, by rejecting to >>>>> join nodes that run on JVMs with different encodings? >>>>> >>>>> As a result, we will be sure that all cluster nodes have the same >>>>> encoding and all related problems will be solved. >>>>> >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 >>>>> >>>>> -- >>>>> Mikhail >>>>> >>>>> >>>> >>>> -- >>>> Sincerely yours, Ivan Daschinskiy >>> >>> >> >> -- >> Sincerely yours, Ivan Daschinskiy >> > > > -- > > Best regards, > Ivan Pavlukhin
