Re: [DISCUSSION] Reject join of nodes with different character encodings

Ivan Daschinsky Mon, 13 Dec 2021 12:07:21 -0800

Unpaited surrogates are emoji symbols. One should be completely insane to
use emojis in login.


пн, 13 дек. 2021 г., 21:30 Mikhail Petrov <[email protected]>:

> Ivan, string with unpaired surrogates symbols are serialized and
> deserialized by java UTF-8 decoder successfully but the result does not
> match the initial string. It may result in that if the user's login
> contains these symbols, it will be distorted after deserialization and
> the user will not be able to log in. I understand that it is a quite
> rare case.
> Anyway, the way to solve this problem was introduced here -
> https://issues.apache.org/jira/browse/IGNITE-3098
>
> Frankly, it is not the topic I would like to discuss now. The main
> question is - should we restrict the join of nodes with different
> encodings or just fix all places where implicit default encoding is used
> and specify the explicit one as Ivan Daschinsky suggested?
>
>  From my point of view, it is better to reject nodes with different
> encodings (especially after Ilya Kasnacheev mentioned that we already
> have a warning  "Differing character encodings across cluster may lead
> to erratic behavior"). It will help to avoid "erratic behavior", not
> just warn about it. It is important since the problems related to string
> encoding can occur in different components and the cause of them is not
> always obvious.
>
> WDYT?
>
> On 13.12.2021 20:01, Ivan Pavlukhin wrote:
> >> I guess Nikolay is talking about the problem with UTF-8 in case string
> contains unpaired surrogate symbols
> > Folks, give me a clue why it is a problem? Naively it seems to be a
> > good restriction rather than problem. What problems can it cause in
> > practice?
> >
> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev <[email protected]>:
> >> Hello!
> >>
> >> We already have a warning about this, see
> IgniteKernal.checkFileEncoding()
> >>
> >> Regards,
> >> --
> >> Ilya Kasnacheev
> >>
> >>
> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <[email protected]>:
> >>
> >>>>> But now multiple components
> >>>>> independently serialize strings for their needs and use default
> >>>>> encoding
> >>>>> for this.
> >>>>> For example  DirectByteBufferStreamImplV2#writeString,
> >>>>> MetaStorage#writeRaw and so on
> >>> We should fix all of them.
> >>>
> >>>>> BinaryUtils#utf8BytesToStr
> >>> Lets use this everywhere.
> >>>
> >>> As for me, I'm expecting a way more problem with enforcing rule to
> fail,
> >>> rather than enforcing all components to use UTF-8
> >>> Some weird cases  (surrogate pairs) we can (I strongly believe it is
> OK)
> >>> simply do not consider at all.
> >>>
> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <[email protected]>:
> >>>
> >>>>> Does Java String support all unicode characters and particularly does
> >>> it
> >>>> support more characters than UTF-8
> >>>>
> >>>> It’s not about Java, it’s about UTF-8 standard.
> >>>>
> >>>> Please, take a look at [1]
> >>>>
> >>>>> In November 2003, UTF-8 was restricted by RFC 3629 to match the
> >>>> constraints of the UTF-16 character encoding: explicitly prohibiting
> >>>> code
> >>>> points corresponding to the high and low surrogate characters removed
> >>> more
> >>>> than 3% of the three-byte sequences, and ending at U+10FFFF removed
> >>>> more
> >>>> than 48% of the four-byte sequences and all five- and six-byte
> >>>> sequences.
> >>>>
> >>>> And [2]
> >>>>
> >>>>> The definition of UTF-8 prohibits encoding character numbers between
> >>>> U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
> >>> form
> >>>> (as surrogate pairs) and do not directly represent characters.
> >>>>
> >>>> Actually, we already has some modes to support this restriction of
> >>>> UTF-8.
> >>>> Please, take a look at BinaryUtils#utf8BytesToStr [3]
> >>>>
> >>>>
> >>>> [1] https://en.wikipedia.org/wiki/UTF-8
> >>>> [2] https://datatracker.ietf.org/doc/html/rfc3629
> >>>> [3]
> >>>>
> >>>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
> >>>>> 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <[email protected]>
> >>>> написал(а):
> >>>>>> UTF-8 can’t encode all UNICODE characters.
> >>>>> Nikolay, could you please elaborate? My understanding is that
> >>>>> encoding
> >>>>> we speak about matters for conversion from byte arrays to strings.
> >>>>> Does Java String support all unicode characters and particularly does
> >>>>> it support more characters than UTF-8 (I am not saying here that java
> >>>>> String uses UTF-8)?
> >>>>>
> >>>>> 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <[email protected]>:
> >>>>>> UTF-8 is already a default encoding in our BinaryObject format.
> >>>>>> So....
> >>>> I am
> >>>>>> for unification.
> >>>>>>
> >>>>>> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <[email protected]>:
> >>>>>>
> >>>>>>> Hello, Ivan.
> >>>>>>>
> >>>>>>> UTF-8 can’t encode all UNICODE characters.
> >>>>>>>
> >>>>>>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <[email protected]>
> >>>>>>> написал(а):
> >>>>>>>> Khm, maybe a better variant is  to enforce all strings to be
> >>>>>>>> encoded
> >>>> in
> >>>>>>>> UTF-8?
> >>>>>>>> AFAIK multi OS cluster is a quite common case.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <
> [email protected]
> >>>> :
> >>>>>>>>> Igniters,
> >>>>>>>>>
> >>>>>>>>> Recently we faced the problem that if the cluster consists of
> >>>>>>>>> nodes
> >>>>>>>>> running in the JVM with different encodings, many issues arise.
> >>>>>>>>> The root cause of the mentioned issues is components that use
> >>>>>>>>> `String#getBytes()` and `new String(<byte array>)`, which relies
> >>>>>>>>> on
> >>>>>>>>> the
> >>>>>>>>> system default encoding. Thus, if a string is deserialized on a
> >>> node
> >>>>>>>>> with a different encoding from the one that serialized it, the
> >>>>>>>>> deserialized string can be different from the original one.
> >>>>>>>>>
> >>>>>>>>> For example:
> >>>>>>>>>
> >>>>>>>>> Serialization/deserialization of string in communication messages
> >>> may
> >>>>>>>>> be
> >>>>>>>>> broken for some strings on nodes running in a JVM with a
> >>>>>>>>> different
> >>>>>>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes()
> >>>>>>>>> to
> >>>>>>>>> serialize strings - [1]
> >>>>>>>>>
> >>>>>>>>> Or the IgniteAuthenticationProcessor can compute different
> >>>>>>>>> security
> >>>>>>>>> IDs
> >>>>>>>>> for the user on different nodes in this case - [2]
> >>>>>>>>>
> >>>>>>>>> What do you think, if we solve this problem globally, by
> >>>>>>>>> rejecting
> >>> to
> >>>>>>>>> join nodes that run on JVMs with different encodings?
> >>>>>>>>>
> >>>>>>>>> As a result, we will be sure that all cluster nodes have the same
> >>>>>>>>> encoding and all related problems will be solved.
> >>>>>>>>>
> >>>>>>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> >>>>>>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Mikhail
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> Sincerely yours, Ivan Daschinskiy
> >>>>>>>
> >>>>>> --
> >>>>>> Sincerely yours, Ivan Daschinskiy
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Best regards,
> >>>>> Ivan Pavlukhin
> >>>>
> >>> --
> >>> Sincerely yours, Ivan Daschinskiy
> >>>
> >
>

Re: [DISCUSSION] Reject join of nodes with different character encodings

Reply via email to