> I guess Nikolay is talking about the problem with UTF-8 in case string > contains unpaired surrogate symbols
Folks, give me a clue why it is a problem? Naively it seems to be a good restriction rather than problem. What problems can it cause in practice? 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev <ilya.kasnach...@gmail.com>: > Hello! > > We already have a warning about this, see IgniteKernal.checkFileEncoding() > > Regards, > -- > Ilya Kasnacheev > > > пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <ivanda...@gmail.com>: > >> >> But now multiple components >> >> independently serialize strings for their needs and use default >> >> encoding >> >> for this. >> >> For example DirectByteBufferStreamImplV2#writeString, >> >> MetaStorage#writeRaw and so on >> We should fix all of them. >> >> >> BinaryUtils#utf8BytesToStr >> Lets use this everywhere. >> >> As for me, I'm expecting a way more problem with enforcing rule to fail, >> rather than enforcing all components to use UTF-8 >> Some weird cases (surrogate pairs) we can (I strongly believe it is OK) >> simply do not consider at all. >> >> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>: >> >> > > Does Java String support all unicode characters and particularly does >> it >> > support more characters than UTF-8 >> > >> > It’s not about Java, it’s about UTF-8 standard. >> > >> > Please, take a look at [1] >> > >> > > In November 2003, UTF-8 was restricted by RFC 3629 to match the >> > constraints of the UTF-16 character encoding: explicitly prohibiting >> > code >> > points corresponding to the high and low surrogate characters removed >> more >> > than 3% of the three-byte sequences, and ending at U+10FFFF removed >> > more >> > than 48% of the four-byte sequences and all five- and six-byte >> > sequences. >> > >> > And [2] >> > >> > > The definition of UTF-8 prohibits encoding character numbers between >> > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding >> form >> > (as surrogate pairs) and do not directly represent characters. >> > >> > Actually, we already has some modes to support this restriction of >> > UTF-8. >> > Please, take a look at BinaryUtils#utf8BytesToStr [3] >> > >> > >> > [1] https://en.wikipedia.org/wiki/UTF-8 >> > [2] https://datatracker.ietf.org/doc/html/rfc3629 >> > [3] >> > >> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 >> > >> > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com> >> > написал(а): >> > > >> > >> UTF-8 can’t encode all UNICODE characters. >> > > >> > > Nikolay, could you please elaborate? My understanding is that >> > > encoding >> > > we speak about matters for conversion from byte arrays to strings. >> > > Does Java String support all unicode characters and particularly does >> > > it support more characters than UTF-8 (I am not saying here that java >> > > String uses UTF-8)? >> > > >> > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>: >> > >> UTF-8 is already a default encoding in our BinaryObject format. >> > >> So.... >> > I am >> > >> for unification. >> > >> >> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>: >> > >> >> > >>> Hello, Ivan. >> > >>> >> > >>> UTF-8 can’t encode all UNICODE characters. >> > >>> >> > >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com> >> > >>> написал(а): >> > >>>> >> > >>>> Khm, maybe a better variant is to enforce all strings to be >> > >>>> encoded >> > in >> > >>>> UTF-8? >> > >>>> AFAIK multi OS cluster is a quite common case. >> > >>>> >> > >>>> >> > >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com >> >: >> > >>>> >> > >>>>> Igniters, >> > >>>>> >> > >>>>> Recently we faced the problem that if the cluster consists of >> > >>>>> nodes >> > >>>>> running in the JVM with different encodings, many issues arise. >> > >>>>> The root cause of the mentioned issues is components that use >> > >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies >> > >>>>> on >> > >>>>> the >> > >>>>> system default encoding. Thus, if a string is deserialized on a >> node >> > >>>>> with a different encoding from the one that serialized it, the >> > >>>>> deserialized string can be different from the original one. >> > >>>>> >> > >>>>> For example: >> > >>>>> >> > >>>>> Serialization/deserialization of string in communication messages >> may >> > >>>>> be >> > >>>>> broken for some strings on nodes running in a JVM with a >> > >>>>> different >> > >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() >> > >>>>> to >> > >>>>> serialize strings - [1] >> > >>>>> >> > >>>>> Or the IgniteAuthenticationProcessor can compute different >> > >>>>> security >> > >>>>> IDs >> > >>>>> for the user on different nodes in this case - [2] >> > >>>>> >> > >>>>> What do you think, if we solve this problem globally, by >> > >>>>> rejecting >> to >> > >>>>> join nodes that run on JVMs with different encodings? >> > >>>>> >> > >>>>> As a result, we will be sure that all cluster nodes have the same >> > >>>>> encoding and all related problems will be solved. >> > >>>>> >> > >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 >> > >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 >> > >>>>> >> > >>>>> -- >> > >>>>> Mikhail >> > >>>>> >> > >>>>> >> > >>>> >> > >>>> -- >> > >>>> Sincerely yours, Ivan Daschinskiy >> > >>> >> > >>> >> > >> >> > >> -- >> > >> Sincerely yours, Ivan Daschinskiy >> > >> >> > > >> > > >> > > -- >> > > >> > > Best regards, >> > > Ivan Pavlukhin >> > >> > >> >> -- >> Sincerely yours, Ivan Daschinskiy >> > -- Best regards, Ivan Pavlukhin