Unpaited surrogates are emoji symbols. One should be completely insane to use emojis in login.
пн, 13 дек. 2021 г., 21:30 Mikhail Petrov <pmgheap....@gmail.com>: > Ivan, string with unpaired surrogates symbols are serialized and > deserialized by java UTF-8 decoder successfully but the result does not > match the initial string. It may result in that if the user's login > contains these symbols, it will be distorted after deserialization and > the user will not be able to log in. I understand that it is a quite > rare case. > Anyway, the way to solve this problem was introduced here - > https://issues.apache.org/jira/browse/IGNITE-3098 > > Frankly, it is not the topic I would like to discuss now. The main > question is - should we restrict the join of nodes with different > encodings or just fix all places where implicit default encoding is used > and specify the explicit one as Ivan Daschinsky suggested? > > From my point of view, it is better to reject nodes with different > encodings (especially after Ilya Kasnacheev mentioned that we already > have a warning "Differing character encodings across cluster may lead > to erratic behavior"). It will help to avoid "erratic behavior", not > just warn about it. It is important since the problems related to string > encoding can occur in different components and the cause of them is not > always obvious. > > WDYT? > > On 13.12.2021 20:01, Ivan Pavlukhin wrote: > >> I guess Nikolay is talking about the problem with UTF-8 in case string > contains unpaired surrogate symbols > > Folks, give me a clue why it is a problem? Naively it seems to be a > > good restriction rather than problem. What problems can it cause in > > practice? > > > > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev <ilya.kasnach...@gmail.com>: > >> Hello! > >> > >> We already have a warning about this, see > IgniteKernal.checkFileEncoding() > >> > >> Regards, > >> -- > >> Ilya Kasnacheev > >> > >> > >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <ivanda...@gmail.com>: > >> > >>>>> But now multiple components > >>>>> independently serialize strings for their needs and use default > >>>>> encoding > >>>>> for this. > >>>>> For example DirectByteBufferStreamImplV2#writeString, > >>>>> MetaStorage#writeRaw and so on > >>> We should fix all of them. > >>> > >>>>> BinaryUtils#utf8BytesToStr > >>> Lets use this everywhere. > >>> > >>> As for me, I'm expecting a way more problem with enforcing rule to > fail, > >>> rather than enforcing all components to use UTF-8 > >>> Some weird cases (surrogate pairs) we can (I strongly believe it is > OK) > >>> simply do not consider at all. > >>> > >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>: > >>> > >>>>> Does Java String support all unicode characters and particularly does > >>> it > >>>> support more characters than UTF-8 > >>>> > >>>> It’s not about Java, it’s about UTF-8 standard. > >>>> > >>>> Please, take a look at [1] > >>>> > >>>>> In November 2003, UTF-8 was restricted by RFC 3629 to match the > >>>> constraints of the UTF-16 character encoding: explicitly prohibiting > >>>> code > >>>> points corresponding to the high and low surrogate characters removed > >>> more > >>>> than 3% of the three-byte sequences, and ending at U+10FFFF removed > >>>> more > >>>> than 48% of the four-byte sequences and all five- and six-byte > >>>> sequences. > >>>> > >>>> And [2] > >>>> > >>>>> The definition of UTF-8 prohibits encoding character numbers between > >>>> U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding > >>> form > >>>> (as surrogate pairs) and do not directly represent characters. > >>>> > >>>> Actually, we already has some modes to support this restriction of > >>>> UTF-8. > >>>> Please, take a look at BinaryUtils#utf8BytesToStr [3] > >>>> > >>>> > >>>> [1] https://en.wikipedia.org/wiki/UTF-8 > >>>> [2] https://datatracker.ietf.org/doc/html/rfc3629 > >>>> [3] > >>>> > >>> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > >>>>> 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com> > >>>> написал(а): > >>>>>> UTF-8 can’t encode all UNICODE characters. > >>>>> Nikolay, could you please elaborate? My understanding is that > >>>>> encoding > >>>>> we speak about matters for conversion from byte arrays to strings. > >>>>> Does Java String support all unicode characters and particularly does > >>>>> it support more characters than UTF-8 (I am not saying here that java > >>>>> String uses UTF-8)? > >>>>> > >>>>> 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>: > >>>>>> UTF-8 is already a default encoding in our BinaryObject format. > >>>>>> So.... > >>>> I am > >>>>>> for unification. > >>>>>> > >>>>>> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>: > >>>>>> > >>>>>>> Hello, Ivan. > >>>>>>> > >>>>>>> UTF-8 can’t encode all UNICODE characters. > >>>>>>> > >>>>>>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com> > >>>>>>> написал(а): > >>>>>>>> Khm, maybe a better variant is to enforce all strings to be > >>>>>>>> encoded > >>>> in > >>>>>>>> UTF-8? > >>>>>>>> AFAIK multi OS cluster is a quite common case. > >>>>>>>> > >>>>>>>> > >>>>>>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov < > pmgheap....@gmail.com > >>>> : > >>>>>>>>> Igniters, > >>>>>>>>> > >>>>>>>>> Recently we faced the problem that if the cluster consists of > >>>>>>>>> nodes > >>>>>>>>> running in the JVM with different encodings, many issues arise. > >>>>>>>>> The root cause of the mentioned issues is components that use > >>>>>>>>> `String#getBytes()` and `new String(<byte array>)`, which relies > >>>>>>>>> on > >>>>>>>>> the > >>>>>>>>> system default encoding. Thus, if a string is deserialized on a > >>> node > >>>>>>>>> with a different encoding from the one that serialized it, the > >>>>>>>>> deserialized string can be different from the original one. > >>>>>>>>> > >>>>>>>>> For example: > >>>>>>>>> > >>>>>>>>> Serialization/deserialization of string in communication messages > >>> may > >>>>>>>>> be > >>>>>>>>> broken for some strings on nodes running in a JVM with a > >>>>>>>>> different > >>>>>>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() > >>>>>>>>> to > >>>>>>>>> serialize strings - [1] > >>>>>>>>> > >>>>>>>>> Or the IgniteAuthenticationProcessor can compute different > >>>>>>>>> security > >>>>>>>>> IDs > >>>>>>>>> for the user on different nodes in this case - [2] > >>>>>>>>> > >>>>>>>>> What do you think, if we solve this problem globally, by > >>>>>>>>> rejecting > >>> to > >>>>>>>>> join nodes that run on JVMs with different encodings? > >>>>>>>>> > >>>>>>>>> As a result, we will be sure that all cluster nodes have the same > >>>>>>>>> encoding and all related problems will be solved. > >>>>>>>>> > >>>>>>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > >>>>>>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Mikhail > >>>>>>>>> > >>>>>>>>> > >>>>>>>> -- > >>>>>>>> Sincerely yours, Ivan Daschinskiy > >>>>>>> > >>>>>> -- > >>>>>> Sincerely yours, Ivan Daschinskiy > >>>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> Best regards, > >>>>> Ivan Pavlukhin > >>>> > >>> -- > >>> Sincerely yours, Ivan Daschinskiy > >>> > > >