Re: [DISCUSSION] Reject join of nodes with different character encodings

Ivan Pavlukhin Mon, 13 Dec 2021 09:02:21 -0800

> I guess Nikolay is talking about the problem with UTF-8 in case string 
> contains unpaired surrogate symbols


Folks, give me a clue why it is a problem? Naively it seems to be a
good restriction rather than problem. What problems can it cause in
practice?

2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev <[email protected]>:
> Hello!
>
> We already have a warning about this, see IgniteKernal.checkFileEncoding()
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <[email protected]>:
>
>> >> But now multiple components
>> >> independently serialize strings for their needs and use default
>> >> encoding
>> >> for this.
>> >> For example  DirectByteBufferStreamImplV2#writeString,
>> >> MetaStorage#writeRaw and so on
>> We should fix all of them.
>>
>> >> BinaryUtils#utf8BytesToStr
>> Lets use this everywhere.
>>
>> As for me, I'm expecting a way more problem with enforcing rule to fail,
>> rather than enforcing all components to use UTF-8
>> Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
>> simply do not consider at all.
>>
>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <[email protected]>:
>>
>> > > Does Java String support all unicode characters and particularly does
>> it
>> > support more characters than UTF-8
>> >
>> > It’s not about Java, it’s about UTF-8 standard.
>> >
>> > Please, take a look at [1]
>> >
>> > > In November 2003, UTF-8 was restricted by RFC 3629 to match the
>> > constraints of the UTF-16 character encoding: explicitly prohibiting
>> > code
>> > points corresponding to the high and low surrogate characters removed
>> more
>> > than 3% of the three-byte sequences, and ending at U+10FFFF removed
>> > more
>> > than 48% of the four-byte sequences and all five- and six-byte
>> > sequences.
>> >
>> > And [2]
>> >
>> > > The definition of UTF-8 prohibits encoding character numbers between
>> > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
>> form
>> > (as surrogate pairs) and do not directly represent characters.
>> >
>> > Actually, we already has some modes to support this restriction of
>> > UTF-8.
>> > Please, take a look at BinaryUtils#utf8BytesToStr [3]
>> >
>> >
>> > [1] https://en.wikipedia.org/wiki/UTF-8
>> > [2] https://datatracker.ietf.org/doc/html/rfc3629
>> > [3]
>> >
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
>> >
>> > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <[email protected]>
>> > написал(а):
>> > >
>> > >> UTF-8 can’t encode all UNICODE characters.
>> > >
>> > > Nikolay, could you please elaborate? My understanding is that
>> > > encoding
>> > > we speak about matters for conversion from byte arrays to strings.
>> > > Does Java String support all unicode characters and particularly does
>> > > it support more characters than UTF-8 (I am not saying here that java
>> > > String uses UTF-8)?
>> > >
>> > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <[email protected]>:
>> > >> UTF-8 is already a default encoding in our BinaryObject format.
>> > >> So....
>> > I am
>> > >> for unification.
>> > >>
>> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <[email protected]>:
>> > >>
>> > >>> Hello, Ivan.
>> > >>>
>> > >>> UTF-8 can’t encode all UNICODE characters.
>> > >>>
>> > >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <[email protected]>
>> > >>> написал(а):
>> > >>>>
>> > >>>> Khm, maybe a better variant is  to enforce all strings to be
>> > >>>> encoded
>> > in
>> > >>>> UTF-8?
>> > >>>> AFAIK multi OS cluster is a quite common case.
>> > >>>>
>> > >>>>
>> > >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <[email protected]
>> >:
>> > >>>>
>> > >>>>> Igniters,
>> > >>>>>
>> > >>>>> Recently we faced the problem that if the cluster consists of
>> > >>>>> nodes
>> > >>>>> running in the JVM with different encodings, many issues arise.
>> > >>>>> The root cause of the mentioned issues is components that use
>> > >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies
>> > >>>>> on
>> > >>>>> the
>> > >>>>> system default encoding. Thus, if a string is deserialized on a
>> node
>> > >>>>> with a different encoding from the one that serialized it, the
>> > >>>>> deserialized string can be different from the original one.
>> > >>>>>
>> > >>>>> For example:
>> > >>>>>
>> > >>>>> Serialization/deserialization of string in communication messages
>> may
>> > >>>>> be
>> > >>>>> broken for some strings on nodes running in a JVM with a
>> > >>>>> different
>> > >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes()
>> > >>>>> to
>> > >>>>> serialize strings - [1]
>> > >>>>>
>> > >>>>> Or the IgniteAuthenticationProcessor can compute different
>> > >>>>> security
>> > >>>>> IDs
>> > >>>>> for the user on different nodes in this case - [2]
>> > >>>>>
>> > >>>>> What do you think, if we solve this problem globally, by
>> > >>>>> rejecting
>> to
>> > >>>>> join nodes that run on JVMs with different encodings?
>> > >>>>>
>> > >>>>> As a result, we will be sure that all cluster nodes have the same
>> > >>>>> encoding and all related problems will be solved.
>> > >>>>>
>> > >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
>> > >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
>> > >>>>>
>> > >>>>> --
>> > >>>>> Mikhail
>> > >>>>>
>> > >>>>>
>> > >>>>
>> > >>>> --
>> > >>>> Sincerely yours, Ivan Daschinskiy
>> > >>>
>> > >>>
>> > >>
>> > >> --
>> > >> Sincerely yours, Ivan Daschinskiy
>> > >>
>> > >
>> > >
>> > > --
>> > >
>> > > Best regards,
>> > > Ivan Pavlukhin
>> >
>> >
>>
>> --
>> Sincerely yours, Ivan Daschinskiy
>>
>


-- 

Best regards,
Ivan Pavlukhin

Re: [DISCUSSION] Reject join of nodes with different character encodings

Reply via email to