>> But now multiple components >> independently serialize strings for their needs and use default encoding >> for this. >> For example DirectByteBufferStreamImplV2#writeString, >> MetaStorage#writeRaw and so on We should fix all of them.
>> BinaryUtils#utf8BytesToStr Lets use this everywhere. As for me, I'm expecting a way more problem with enforcing rule to fail, rather than enforcing all components to use UTF-8 Some weird cases (surrogate pairs) we can (I strongly believe it is OK) simply do not consider at all. пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>: > > Does Java String support all unicode characters and particularly does it > support more characters than UTF-8 > > It’s not about Java, it’s about UTF-8 standard. > > Please, take a look at [1] > > > In November 2003, UTF-8 was restricted by RFC 3629 to match the > constraints of the UTF-16 character encoding: explicitly prohibiting code > points corresponding to the high and low surrogate characters removed more > than 3% of the three-byte sequences, and ending at U+10FFFF removed more > than 48% of the four-byte sequences and all five- and six-byte sequences. > > And [2] > > > The definition of UTF-8 prohibits encoding character numbers between > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form > (as surrogate pairs) and do not directly represent characters. > > Actually, we already has some modes to support this restriction of UTF-8. > Please, take a look at BinaryUtils#utf8BytesToStr [3] > > > [1] https://en.wikipedia.org/wiki/UTF-8 > [2] https://datatracker.ietf.org/doc/html/rfc3629 > [3] > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com> > написал(а): > > > >> UTF-8 can’t encode all UNICODE characters. > > > > Nikolay, could you please elaborate? My understanding is that encoding > > we speak about matters for conversion from byte arrays to strings. > > Does Java String support all unicode characters and particularly does > > it support more characters than UTF-8 (I am not saying here that java > > String uses UTF-8)? > > > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>: > >> UTF-8 is already a default encoding in our BinaryObject format. So.... > I am > >> for unification. > >> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>: > >> > >>> Hello, Ivan. > >>> > >>> UTF-8 can’t encode all UNICODE characters. > >>> > >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com> > >>> написал(а): > >>>> > >>>> Khm, maybe a better variant is to enforce all strings to be encoded > in > >>>> UTF-8? > >>>> AFAIK multi OS cluster is a quite common case. > >>>> > >>>> > >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com>: > >>>> > >>>>> Igniters, > >>>>> > >>>>> Recently we faced the problem that if the cluster consists of nodes > >>>>> running in the JVM with different encodings, many issues arise. > >>>>> The root cause of the mentioned issues is components that use > >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies on > >>>>> the > >>>>> system default encoding. Thus, if a string is deserialized on a node > >>>>> with a different encoding from the one that serialized it, the > >>>>> deserialized string can be different from the original one. > >>>>> > >>>>> For example: > >>>>> > >>>>> Serialization/deserialization of string in communication messages may > >>>>> be > >>>>> broken for some strings on nodes running in a JVM with a different > >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > >>>>> serialize strings - [1] > >>>>> > >>>>> Or the IgniteAuthenticationProcessor can compute different security > >>>>> IDs > >>>>> for the user on different nodes in this case - [2] > >>>>> > >>>>> What do you think, if we solve this problem globally, by rejecting to > >>>>> join nodes that run on JVMs with different encodings? > >>>>> > >>>>> As a result, we will be sure that all cluster nodes have the same > >>>>> encoding and all related problems will be solved. > >>>>> > >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > >>>>> > >>>>> -- > >>>>> Mikhail > >>>>> > >>>>> > >>>> > >>>> -- > >>>> Sincerely yours, Ivan Daschinskiy > >>> > >>> > >> > >> -- > >> Sincerely yours, Ivan Daschinskiy > >> > > > > > > -- > > > > Best regards, > > Ivan Pavlukhin > > -- Sincerely yours, Ivan Daschinskiy