Re: [DISCUSSION] Reject join of nodes with different character encodings
We copy values unchanged as is in bytes representation. Could you please specify what could be done wrong? I see only one possibility: 1. Start cluster with default encoding (This is only the windows case :)). Set some metastorage values with non ASCII chars. 2. Stop it and restart with specifying encoding to different one. I suppose that this is very rare case. And all that user should do -- just erase metastore. Another variant -- make all users to erase metastore in order to use UTF-8. пн, 20 дек. 2021 г. в 17:59, Andrey Mashenkov : > Ivan, > > I'm still not sure it is a good idea to upgrade metastorage automatically. > Because we can't detect the correct charset the metastorage was created > with, and > at the same time we can't be sure the current charset is the correct one. > > So, is there any guarantee the metastorage is consistent even if it was > "upgraded" successfully? > > As I see, we just copy metastorage keys to a temporary one in key-by-key > manner... and then do write-back to the original one. > Seems, if smth goes wrong, the user may get both (original and temporary) > stores broken. > > On Mon, Dec 20, 2021 at 5:27 PM Ivan Daschinsky > wrote: > > > Andrey, I believe that we already have all machinery to do migration > safe. > > See for > > example > > > org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage#init > > and > > > org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.TmpStorage. > > This machinery was introduced for slightly different task, but we can > reuse > > this for the current purpose. > > > > пн, 20 дек. 2021 г. в 11:53, Mikhail Petrov : > > > > > Thank you all for your replies! > > > I got the idea and agreed with it. Based on the results of the > > > discussion, I have filed a ticket [1]. > > > I will try to investigate it. > > > > > > [1] - https://issues.apache.org/jira/browse/IGNITE-16157 > > > > > > On 16.12.2021 20:11, Ivan Daschinsky wrote: > > > > Andrey, agree with you, good point. > > > > > > > > чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov < > > andrey.mashen...@gmail.com > > > >: > > > > > > > >> Guys, > > > >> > > > >> I like the idea with a flag, but for a different purpose. > > > >> I think it is easy to detect the issue (using the flag) when > > > >> metastorage was created on a new version with a fixed charset, or on > > an > > > >> older version with the user-defined default. > > > >> Regarding the flag, we can choose a new strategy forcing UTF-8, or > > > fallback > > > >> to the old one with defaultCharset and print a warning and > > > recommendation > > > >> in log. > > > >> > > > >> Adding any compatibility stuff is absolutely error-prone because if > > you > > > >> fail in the middle of restoring process, you will get broken > > metastorage > > > >> with keys in different charsets. > > > >> At this point, there is no way to detect broken keys anymore. > > > >> > > > > > > > > > -- > > Sincerely yours, Ivan Daschinskiy > > > > > -- > Best regards, > Andrey V. Mashenkov > -- Sincerely yours, Ivan Daschinskiy
Re: [DISCUSSION] Reject join of nodes with different character encodings
Ivan, I'm still not sure it is a good idea to upgrade metastorage automatically. Because we can't detect the correct charset the metastorage was created with, and at the same time we can't be sure the current charset is the correct one. So, is there any guarantee the metastorage is consistent even if it was "upgraded" successfully? As I see, we just copy metastorage keys to a temporary one in key-by-key manner... and then do write-back to the original one. Seems, if smth goes wrong, the user may get both (original and temporary) stores broken. On Mon, Dec 20, 2021 at 5:27 PM Ivan Daschinsky wrote: > Andrey, I believe that we already have all machinery to do migration safe. > See for > example > org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage#init > and > org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.TmpStorage. > This machinery was introduced for slightly different task, but we can reuse > this for the current purpose. > > пн, 20 дек. 2021 г. в 11:53, Mikhail Petrov : > > > Thank you all for your replies! > > I got the idea and agreed with it. Based on the results of the > > discussion, I have filed a ticket [1]. > > I will try to investigate it. > > > > [1] - https://issues.apache.org/jira/browse/IGNITE-16157 > > > > On 16.12.2021 20:11, Ivan Daschinsky wrote: > > > Andrey, agree with you, good point. > > > > > > чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov < > andrey.mashen...@gmail.com > > >: > > > > > >> Guys, > > >> > > >> I like the idea with a flag, but for a different purpose. > > >> I think it is easy to detect the issue (using the flag) when > > >> metastorage was created on a new version with a fixed charset, or on > an > > >> older version with the user-defined default. > > >> Regarding the flag, we can choose a new strategy forcing UTF-8, or > > fallback > > >> to the old one with defaultCharset and print a warning and > > recommendation > > >> in log. > > >> > > >> Adding any compatibility stuff is absolutely error-prone because if > you > > >> fail in the middle of restoring process, you will get broken > metastorage > > >> with keys in different charsets. > > >> At this point, there is no way to detect broken keys anymore. > > >> > > > > > -- > Sincerely yours, Ivan Daschinskiy > -- Best regards, Andrey V. Mashenkov
Re: [DISCUSSION] Reject join of nodes with different character encodings
Andrey, I believe that we already have all machinery to do migration safe. See for example org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage#init and org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.TmpStorage. This machinery was introduced for slightly different task, but we can reuse this for the current purpose. пн, 20 дек. 2021 г. в 11:53, Mikhail Petrov : > Thank you all for your replies! > I got the idea and agreed with it. Based on the results of the > discussion, I have filed a ticket [1]. > I will try to investigate it. > > [1] - https://issues.apache.org/jira/browse/IGNITE-16157 > > On 16.12.2021 20:11, Ivan Daschinsky wrote: > > Andrey, agree with you, good point. > > > > чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov >: > > > >> Guys, > >> > >> I like the idea with a flag, but for a different purpose. > >> I think it is easy to detect the issue (using the flag) when > >> metastorage was created on a new version with a fixed charset, or on an > >> older version with the user-defined default. > >> Regarding the flag, we can choose a new strategy forcing UTF-8, or > fallback > >> to the old one with defaultCharset and print a warning and > recommendation > >> in log. > >> > >> Adding any compatibility stuff is absolutely error-prone because if you > >> fail in the middle of restoring process, you will get broken metastorage > >> with keys in different charsets. > >> At this point, there is no way to detect broken keys anymore. > >> > -- Sincerely yours, Ivan Daschinskiy
Re: [DISCUSSION] Reject join of nodes with different character encodings
Thank you all for your replies! I got the idea and agreed with it. Based on the results of the discussion, I have filed a ticket [1]. I will try to investigate it. [1] - https://issues.apache.org/jira/browse/IGNITE-16157 On 16.12.2021 20:11, Ivan Daschinsky wrote: Andrey, agree with you, good point. чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov : Guys, I like the idea with a flag, but for a different purpose. I think it is easy to detect the issue (using the flag) when metastorage was created on a new version with a fixed charset, or on an older version with the user-defined default. Regarding the flag, we can choose a new strategy forcing UTF-8, or fallback to the old one with defaultCharset and print a warning and recommendation in log. Adding any compatibility stuff is absolutely error-prone because if you fail in the middle of restoring process, you will get broken metastorage with keys in different charsets. At this point, there is no way to detect broken keys anymore.
Re: [DISCUSSION] Reject join of nodes with different character encodings
Andrey, agree with you, good point. чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov : > Guys, > > I like the idea with a flag, but for a different purpose. > I think it is easy to detect the issue (using the flag) when > metastorage was created on a new version with a fixed charset, or on an > older version with the user-defined default. > Regarding the flag, we can choose a new strategy forcing UTF-8, or fallback > to the old one with defaultCharset and print a warning and recommendation > in log. > > Adding any compatibility stuff is absolutely error-prone because if you > fail in the middle of restoring process, you will get broken metastorage > with keys in different charsets. > At this point, there is no way to detect broken keys anymore. >
Re: [DISCUSSION] Reject join of nodes with different character encodings
Guys, I like the idea with a flag, but for a different purpose. I think it is easy to detect the issue (using the flag) when metastorage was created on a new version with a fixed charset, or on an older version with the user-defined default. Regarding the flag, we can choose a new strategy forcing UTF-8, or fallback to the old one with defaultCharset and print a warning and recommendation in log. Adding any compatibility stuff is absolutely error-prone because if you fail in the middle of restoring process, you will get broken metastorage with keys in different charsets. At this point, there is no way to detect broken keys anymore.
Re: [DISCUSSION] Reject join of nodes with different character encodings
Slava, great ticket! I suppose, that we can add feature flag to BPlusMetaIO and if it doesn't present or it is value is false, we can rebuild metastore during recovery and decode strings to default system encoding and save all of them back to UTF-8. After recovery, we should use UTF-8 by default. чт, 16 дек. 2021 г. в 13:35, Вячеслав Коптилин : > Hi folks, > > IMHO, we should do our best to fix all these places and should avoid using > the default charset. In my understanding, this is only > > > The main question is - should we restrict the join of nodes with > different encodings or just fix all places where implicit default encoding > is used and specify the explicit one as Ivan Daschinsky suggested? > Restricting the join of nodes is not a solution for all cases. You are in > trouble even though you use a one-node cluster. Just change the default > charset on your system and restart the node with existing PDS [1] > > > As for me, I'm expecting a way more problem with enforcing rule to fail, > rather than enforcing all components to use UTF-8 > Absolutely agree with Ivan. > > [1] https://issues.apache.org/jira/browse/IGNITE-16080 > > Thanks, > S. > > вт, 14 дек. 2021 г. в 10:52, Ivan Pavlukhin : > > > Do encodings in question somehow influence on actual stored data > > (bytes)? If so, using an implicit platform encoding sounds quite > > dangerous. Moving data between servers (or perhaps even rebalancing) > > can lead to bad consequences. Anyways, IMHO an implicit encoding is > > not good, but sensible default is quite robust. > > > > 2021-12-13 23:07 GMT+03:00, Ivan Daschinsky : > > > Unpaited surrogates are emoji symbols. One should be completely insane > to > > > use emojis in login. > > > > > > пн, 13 дек. 2021 г., 21:30 Mikhail Petrov : > > > > > >> Ivan, string with unpaired surrogates symbols are serialized and > > >> deserialized by java UTF-8 decoder successfully but the result does > not > > >> match the initial string. It may result in that if the user's login > > >> contains these symbols, it will be distorted after deserialization and > > >> the user will not be able to log in. I understand that it is a quite > > >> rare case. > > >> Anyway, the way to solve this problem was introduced here - > > >> https://issues.apache.org/jira/browse/IGNITE-3098 > > >> > > >> Frankly, it is not the topic I would like to discuss now. The main > > >> question is - should we restrict the join of nodes with different > > >> encodings or just fix all places where implicit default encoding is > used > > >> and specify the explicit one as Ivan Daschinsky suggested? > > >> > > >> From my point of view, it is better to reject nodes with different > > >> encodings (especially after Ilya Kasnacheev mentioned that we already > > >> have a warning "Differing character encodings across cluster may lead > > >> to erratic behavior"). It will help to avoid "erratic behavior", not > > >> just warn about it. It is important since the problems related to > string > > >> encoding can occur in different components and the cause of them is > not > > >> always obvious. > > >> > > >> WDYT? > > >> > > >> On 13.12.2021 20:01, Ivan Pavlukhin wrote: > > >> >> I guess Nikolay is talking about the problem with UTF-8 in case > > string > > >> contains unpaired surrogate symbols > > >> > Folks, give me a clue why it is a problem? Naively it seems to be a > > >> > good restriction rather than problem. What problems can it cause in > > >> > practice? > > >> > > > >> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev > > >> > : > > >> >> Hello! > > >> >> > > >> >> We already have a warning about this, see > > >> IgniteKernal.checkFileEncoding() > > >> >> > > >> >> Regards, > > >> >> -- > > >> >> Ilya Kasnacheev > > >> >> > > >> >> > > >> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky >: > > >> >> > > >> > But now multiple components > > >> > independently serialize strings for their needs and use default > > >> > encoding > > >> > for this. > > >> > For example DirectByteBufferStreamImplV2#writeString, > > >> > MetaStorage#writeRaw and so on > > >> >>> We should fix all of them. > > >> >>> > > >> > BinaryUtils#utf8BytesToStr > > >> >>> Lets use this everywhere. > > >> >>> > > >> >>> As for me, I'm expecting a way more problem with enforcing rule to > > >> fail, > > >> >>> rather than enforcing all components to use UTF-8 > > >> >>> Some weird cases (surrogate pairs) we can (I strongly believe it > is > > >> OK) > > >> >>> simply do not consider at all. > > >> >>> > > >> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov >: > > >> >>> > > >> > Does Java String support all unicode characters and particularly > > >> > does > > >> >>> it > > >> support more characters than UTF-8 > > >> > > >> It’s not about Java, it’s about UTF-8 standard. > > >> > > >> Please, take a look at [1] > > >> > > >> > In November 2003, UTF-8 was restricted by RFC 3629 to match the > > >>
Re: [DISCUSSION] Reject join of nodes with different character encodings
Hi folks, IMHO, we should do our best to fix all these places and should avoid using the default charset. In my understanding, this is only > The main question is - should we restrict the join of nodes with different encodings or just fix all places where implicit default encoding is used and specify the explicit one as Ivan Daschinsky suggested? Restricting the join of nodes is not a solution for all cases. You are in trouble even though you use a one-node cluster. Just change the default charset on your system and restart the node with existing PDS [1] > As for me, I'm expecting a way more problem with enforcing rule to fail, rather than enforcing all components to use UTF-8 Absolutely agree with Ivan. [1] https://issues.apache.org/jira/browse/IGNITE-16080 Thanks, S. вт, 14 дек. 2021 г. в 10:52, Ivan Pavlukhin : > Do encodings in question somehow influence on actual stored data > (bytes)? If so, using an implicit platform encoding sounds quite > dangerous. Moving data between servers (or perhaps even rebalancing) > can lead to bad consequences. Anyways, IMHO an implicit encoding is > not good, but sensible default is quite robust. > > 2021-12-13 23:07 GMT+03:00, Ivan Daschinsky : > > Unpaited surrogates are emoji symbols. One should be completely insane to > > use emojis in login. > > > > пн, 13 дек. 2021 г., 21:30 Mikhail Petrov : > > > >> Ivan, string with unpaired surrogates symbols are serialized and > >> deserialized by java UTF-8 decoder successfully but the result does not > >> match the initial string. It may result in that if the user's login > >> contains these symbols, it will be distorted after deserialization and > >> the user will not be able to log in. I understand that it is a quite > >> rare case. > >> Anyway, the way to solve this problem was introduced here - > >> https://issues.apache.org/jira/browse/IGNITE-3098 > >> > >> Frankly, it is not the topic I would like to discuss now. The main > >> question is - should we restrict the join of nodes with different > >> encodings or just fix all places where implicit default encoding is used > >> and specify the explicit one as Ivan Daschinsky suggested? > >> > >> From my point of view, it is better to reject nodes with different > >> encodings (especially after Ilya Kasnacheev mentioned that we already > >> have a warning "Differing character encodings across cluster may lead > >> to erratic behavior"). It will help to avoid "erratic behavior", not > >> just warn about it. It is important since the problems related to string > >> encoding can occur in different components and the cause of them is not > >> always obvious. > >> > >> WDYT? > >> > >> On 13.12.2021 20:01, Ivan Pavlukhin wrote: > >> >> I guess Nikolay is talking about the problem with UTF-8 in case > string > >> contains unpaired surrogate symbols > >> > Folks, give me a clue why it is a problem? Naively it seems to be a > >> > good restriction rather than problem. What problems can it cause in > >> > practice? > >> > > >> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev > >> > : > >> >> Hello! > >> >> > >> >> We already have a warning about this, see > >> IgniteKernal.checkFileEncoding() > >> >> > >> >> Regards, > >> >> -- > >> >> Ilya Kasnacheev > >> >> > >> >> > >> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky : > >> >> > >> > But now multiple components > >> > independently serialize strings for their needs and use default > >> > encoding > >> > for this. > >> > For example DirectByteBufferStreamImplV2#writeString, > >> > MetaStorage#writeRaw and so on > >> >>> We should fix all of them. > >> >>> > >> > BinaryUtils#utf8BytesToStr > >> >>> Lets use this everywhere. > >> >>> > >> >>> As for me, I'm expecting a way more problem with enforcing rule to > >> fail, > >> >>> rather than enforcing all components to use UTF-8 > >> >>> Some weird cases (surrogate pairs) we can (I strongly believe it is > >> OK) > >> >>> simply do not consider at all. > >> >>> > >> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : > >> >>> > >> > Does Java String support all unicode characters and particularly > >> > does > >> >>> it > >> support more characters than UTF-8 > >> > >> It’s not about Java, it’s about UTF-8 standard. > >> > >> Please, take a look at [1] > >> > >> > In November 2003, UTF-8 was restricted by RFC 3629 to match the > >> constraints of the UTF-16 character encoding: explicitly > prohibiting > >> code > >> points corresponding to the high and low surrogate characters > >> removed > >> >>> more > >> than 3% of the three-byte sequences, and ending at U+10 removed > >> more > >> than 48% of the four-byte sequences and all five- and six-byte > >> sequences. > >> > >> And [2] > >> > >> > The definition of UTF-8 prohibits encoding character numbers > >> > between > >> U+D800 and U+DFFF, which are reserved for use with the UTF-16 > >>
Re: [DISCUSSION] Reject join of nodes with different character encodings
Do encodings in question somehow influence on actual stored data (bytes)? If so, using an implicit platform encoding sounds quite dangerous. Moving data between servers (or perhaps even rebalancing) can lead to bad consequences. Anyways, IMHO an implicit encoding is not good, but sensible default is quite robust. 2021-12-13 23:07 GMT+03:00, Ivan Daschinsky : > Unpaited surrogates are emoji symbols. One should be completely insane to > use emojis in login. > > пн, 13 дек. 2021 г., 21:30 Mikhail Petrov : > >> Ivan, string with unpaired surrogates symbols are serialized and >> deserialized by java UTF-8 decoder successfully but the result does not >> match the initial string. It may result in that if the user's login >> contains these symbols, it will be distorted after deserialization and >> the user will not be able to log in. I understand that it is a quite >> rare case. >> Anyway, the way to solve this problem was introduced here - >> https://issues.apache.org/jira/browse/IGNITE-3098 >> >> Frankly, it is not the topic I would like to discuss now. The main >> question is - should we restrict the join of nodes with different >> encodings or just fix all places where implicit default encoding is used >> and specify the explicit one as Ivan Daschinsky suggested? >> >> From my point of view, it is better to reject nodes with different >> encodings (especially after Ilya Kasnacheev mentioned that we already >> have a warning "Differing character encodings across cluster may lead >> to erratic behavior"). It will help to avoid "erratic behavior", not >> just warn about it. It is important since the problems related to string >> encoding can occur in different components and the cause of them is not >> always obvious. >> >> WDYT? >> >> On 13.12.2021 20:01, Ivan Pavlukhin wrote: >> >> I guess Nikolay is talking about the problem with UTF-8 in case string >> contains unpaired surrogate symbols >> > Folks, give me a clue why it is a problem? Naively it seems to be a >> > good restriction rather than problem. What problems can it cause in >> > practice? >> > >> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev >> > : >> >> Hello! >> >> >> >> We already have a warning about this, see >> IgniteKernal.checkFileEncoding() >> >> >> >> Regards, >> >> -- >> >> Ilya Kasnacheev >> >> >> >> >> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky : >> >> >> > But now multiple components >> > independently serialize strings for their needs and use default >> > encoding >> > for this. >> > For example DirectByteBufferStreamImplV2#writeString, >> > MetaStorage#writeRaw and so on >> >>> We should fix all of them. >> >>> >> > BinaryUtils#utf8BytesToStr >> >>> Lets use this everywhere. >> >>> >> >>> As for me, I'm expecting a way more problem with enforcing rule to >> fail, >> >>> rather than enforcing all components to use UTF-8 >> >>> Some weird cases (surrogate pairs) we can (I strongly believe it is >> OK) >> >>> simply do not consider at all. >> >>> >> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : >> >>> >> > Does Java String support all unicode characters and particularly >> > does >> >>> it >> support more characters than UTF-8 >> >> It’s not about Java, it’s about UTF-8 standard. >> >> Please, take a look at [1] >> >> > In November 2003, UTF-8 was restricted by RFC 3629 to match the >> constraints of the UTF-16 character encoding: explicitly prohibiting >> code >> points corresponding to the high and low surrogate characters >> removed >> >>> more >> than 3% of the three-byte sequences, and ending at U+10 removed >> more >> than 48% of the four-byte sequences and all five- and six-byte >> sequences. >> >> And [2] >> >> > The definition of UTF-8 prohibits encoding character numbers >> > between >> U+D800 and U+DFFF, which are reserved for use with the UTF-16 >> encoding >> >>> form >> (as surrogate pairs) and do not directly represent characters. >> >> Actually, we already has some modes to support this restriction of >> UTF-8. >> Please, take a look at BinaryUtils#utf8BytesToStr [3] >> >> >> [1] https://en.wikipedia.org/wiki/UTF-8 >> [2] https://datatracker.ietf.org/doc/html/rfc3629 >> [3] >> >> >>> >> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 >> > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin >> написал(а): >> >> UTF-8 can’t encode all UNICODE characters. >> > Nikolay, could you please elaborate? My understanding is that >> > encoding >> > we speak about matters for conversion from byte arrays to strings. >> > Does Java String support all unicode characters and particularly >> > does >> > it support more characters than UTF-8 (I am not saying here that >> > java >> > String uses UTF-8)? >> > >>
Re: [DISCUSSION] Reject join of nodes with different character encodings
Unpaited surrogates are emoji symbols. One should be completely insane to use emojis in login. пн, 13 дек. 2021 г., 21:30 Mikhail Petrov : > Ivan, string with unpaired surrogates symbols are serialized and > deserialized by java UTF-8 decoder successfully but the result does not > match the initial string. It may result in that if the user's login > contains these symbols, it will be distorted after deserialization and > the user will not be able to log in. I understand that it is a quite > rare case. > Anyway, the way to solve this problem was introduced here - > https://issues.apache.org/jira/browse/IGNITE-3098 > > Frankly, it is not the topic I would like to discuss now. The main > question is - should we restrict the join of nodes with different > encodings or just fix all places where implicit default encoding is used > and specify the explicit one as Ivan Daschinsky suggested? > > From my point of view, it is better to reject nodes with different > encodings (especially after Ilya Kasnacheev mentioned that we already > have a warning "Differing character encodings across cluster may lead > to erratic behavior"). It will help to avoid "erratic behavior", not > just warn about it. It is important since the problems related to string > encoding can occur in different components and the cause of them is not > always obvious. > > WDYT? > > On 13.12.2021 20:01, Ivan Pavlukhin wrote: > >> I guess Nikolay is talking about the problem with UTF-8 in case string > contains unpaired surrogate symbols > > Folks, give me a clue why it is a problem? Naively it seems to be a > > good restriction rather than problem. What problems can it cause in > > practice? > > > > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev : > >> Hello! > >> > >> We already have a warning about this, see > IgniteKernal.checkFileEncoding() > >> > >> Regards, > >> -- > >> Ilya Kasnacheev > >> > >> > >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky : > >> > > But now multiple components > > independently serialize strings for their needs and use default > > encoding > > for this. > > For example DirectByteBufferStreamImplV2#writeString, > > MetaStorage#writeRaw and so on > >>> We should fix all of them. > >>> > > BinaryUtils#utf8BytesToStr > >>> Lets use this everywhere. > >>> > >>> As for me, I'm expecting a way more problem with enforcing rule to > fail, > >>> rather than enforcing all components to use UTF-8 > >>> Some weird cases (surrogate pairs) we can (I strongly believe it is > OK) > >>> simply do not consider at all. > >>> > >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : > >>> > > Does Java String support all unicode characters and particularly does > >>> it > support more characters than UTF-8 > > It’s not about Java, it’s about UTF-8 standard. > > Please, take a look at [1] > > > In November 2003, UTF-8 was restricted by RFC 3629 to match the > constraints of the UTF-16 character encoding: explicitly prohibiting > code > points corresponding to the high and low surrogate characters removed > >>> more > than 3% of the three-byte sequences, and ending at U+10 removed > more > than 48% of the four-byte sequences and all five- and six-byte > sequences. > > And [2] > > > The definition of UTF-8 prohibits encoding character numbers between > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding > >>> form > (as surrogate pairs) and do not directly represent characters. > > Actually, we already has some modes to support this restriction of > UTF-8. > Please, take a look at BinaryUtils#utf8BytesToStr [3] > > > [1] https://en.wikipedia.org/wiki/UTF-8 > [2] https://datatracker.ietf.org/doc/html/rfc3629 > [3] > > >>> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin > написал(а): > >> UTF-8 can’t encode all UNICODE characters. > > Nikolay, could you please elaborate? My understanding is that > > encoding > > we speak about matters for conversion from byte arrays to strings. > > Does Java String support all unicode characters and particularly does > > it support more characters than UTF-8 (I am not saying here that java > > String uses UTF-8)? > > > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : > >> UTF-8 is already a default encoding in our BinaryObject format. > >> So > I am > >> for unification. > >> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : > >> > >>> Hello, Ivan. > >>> > >>> UTF-8 can’t encode all UNICODE characters. > >>> > 13 дек. 2021 г., в 12:49, Ivan Daschinsky > >>> написал(а): > Khm, maybe a better variant is to enforce all strings to be > encoded > in >
Re: [DISCUSSION] Reject join of nodes with different character encodings
Ivan, string with unpaired surrogates symbols are serialized and deserialized by java UTF-8 decoder successfully but the result does not match the initial string. It may result in that if the user's login contains these symbols, it will be distorted after deserialization and the user will not be able to log in. I understand that it is a quite rare case. Anyway, the way to solve this problem was introduced here - https://issues.apache.org/jira/browse/IGNITE-3098 Frankly, it is not the topic I would like to discuss now. The main question is - should we restrict the join of nodes with different encodings or just fix all places where implicit default encoding is used and specify the explicit one as Ivan Daschinsky suggested? From my point of view, it is better to reject nodes with different encodings (especially after Ilya Kasnacheev mentioned that we already have a warning "Differing character encodings across cluster may lead to erratic behavior"). It will help to avoid "erratic behavior", not just warn about it. It is important since the problems related to string encoding can occur in different components and the cause of them is not always obvious. WDYT? On 13.12.2021 20:01, Ivan Pavlukhin wrote: I guess Nikolay is talking about the problem with UTF-8 in case string contains unpaired surrogate symbols Folks, give me a clue why it is a problem? Naively it seems to be a good restriction rather than problem. What problems can it cause in practice? 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev : Hello! We already have a warning about this, see IgniteKernal.checkFileEncoding() Regards, -- Ilya Kasnacheev пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky : But now multiple components independently serialize strings for their needs and use default encoding for this. For example DirectByteBufferStreamImplV2#writeString, MetaStorage#writeRaw and so on We should fix all of them. BinaryUtils#utf8BytesToStr Lets use this everywhere. As for me, I'm expecting a way more problem with enforcing rule to fail, rather than enforcing all components to use UTF-8 Some weird cases (surrogate pairs) we can (I strongly believe it is OK) simply do not consider at all. пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : Does Java String support all unicode characters and particularly does it support more characters than UTF-8 It’s not about Java, it’s about UTF-8 standard. Please, take a look at [1] In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10 removed more than 48% of the four-byte sequences and all five- and six-byte sequences. And [2] The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. Actually, we already has some modes to support this restriction of UTF-8. Please, take a look at BinaryUtils#utf8BytesToStr [3] [1] https://en.wikipedia.org/wiki/UTF-8 [2] https://datatracker.ietf.org/doc/html/rfc3629 [3] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 13 дек. 2021 г., в 13:57, Ivan Pavlukhin написал(а): UTF-8 can’t encode all UNICODE characters. Nikolay, could you please elaborate? My understanding is that encoding we speak about matters for conversion from byte arrays to strings. Does Java String support all unicode characters and particularly does it support more characters than UTF-8 (I am not saying here that java String uses UTF-8)? 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : UTF-8 is already a default encoding in our BinaryObject format. So I am for unification. пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : Hello, Ivan. UTF-8 can’t encode all UNICODE characters. 13 дек. 2021 г., в 12:49, Ivan Daschinsky написал(а): Khm, maybe a better variant is to enforce all strings to be encoded in UTF-8? AFAIK multi OS cluster is a quite common case. пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : Igniters, Recently we faced the problem that if the cluster consists of nodes running in the JVM with different encodings, many issues arise. The root cause of the mentioned issues is components that use `String#getBytes()` and `new String()`, which relies on the system default encoding. Thus, if a string is deserialized on a node with a different encoding from the one that serialized it, the deserialized string can be different from the original one. For example: Serialization/deserialization of string in communication messages may be broken for some strings on nodes running in a JVM with a different encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to serialize strings - [1] Or the
Re: [DISCUSSION] Reject join of nodes with different character encodings
> I guess Nikolay is talking about the problem with UTF-8 in case string > contains unpaired surrogate symbols Folks, give me a clue why it is a problem? Naively it seems to be a good restriction rather than problem. What problems can it cause in practice? 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev : > Hello! > > We already have a warning about this, see IgniteKernal.checkFileEncoding() > > Regards, > -- > Ilya Kasnacheev > > > пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky : > >> >> But now multiple components >> >> independently serialize strings for their needs and use default >> >> encoding >> >> for this. >> >> For example DirectByteBufferStreamImplV2#writeString, >> >> MetaStorage#writeRaw and so on >> We should fix all of them. >> >> >> BinaryUtils#utf8BytesToStr >> Lets use this everywhere. >> >> As for me, I'm expecting a way more problem with enforcing rule to fail, >> rather than enforcing all components to use UTF-8 >> Some weird cases (surrogate pairs) we can (I strongly believe it is OK) >> simply do not consider at all. >> >> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : >> >> > > Does Java String support all unicode characters and particularly does >> it >> > support more characters than UTF-8 >> > >> > It’s not about Java, it’s about UTF-8 standard. >> > >> > Please, take a look at [1] >> > >> > > In November 2003, UTF-8 was restricted by RFC 3629 to match the >> > constraints of the UTF-16 character encoding: explicitly prohibiting >> > code >> > points corresponding to the high and low surrogate characters removed >> more >> > than 3% of the three-byte sequences, and ending at U+10 removed >> > more >> > than 48% of the four-byte sequences and all five- and six-byte >> > sequences. >> > >> > And [2] >> > >> > > The definition of UTF-8 prohibits encoding character numbers between >> > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding >> form >> > (as surrogate pairs) and do not directly represent characters. >> > >> > Actually, we already has some modes to support this restriction of >> > UTF-8. >> > Please, take a look at BinaryUtils#utf8BytesToStr [3] >> > >> > >> > [1] https://en.wikipedia.org/wiki/UTF-8 >> > [2] https://datatracker.ietf.org/doc/html/rfc3629 >> > [3] >> > >> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 >> > >> > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin >> > написал(а): >> > > >> > >> UTF-8 can’t encode all UNICODE characters. >> > > >> > > Nikolay, could you please elaborate? My understanding is that >> > > encoding >> > > we speak about matters for conversion from byte arrays to strings. >> > > Does Java String support all unicode characters and particularly does >> > > it support more characters than UTF-8 (I am not saying here that java >> > > String uses UTF-8)? >> > > >> > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : >> > >> UTF-8 is already a default encoding in our BinaryObject format. >> > >> So >> > I am >> > >> for unification. >> > >> >> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : >> > >> >> > >>> Hello, Ivan. >> > >>> >> > >>> UTF-8 can’t encode all UNICODE characters. >> > >>> >> > 13 дек. 2021 г., в 12:49, Ivan Daschinsky >> > >>> написал(а): >> > >> > Khm, maybe a better variant is to enforce all strings to be >> > encoded >> > in >> > UTF-8? >> > AFAIK multi OS cluster is a quite common case. >> > >> > >> > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov > >: >> > >> > > Igniters, >> > > >> > > Recently we faced the problem that if the cluster consists of >> > > nodes >> > > running in the JVM with different encodings, many issues arise. >> > > The root cause of the mentioned issues is components that use >> > > `String#getBytes()` and `new String()`, which relies >> > > on >> > > the >> > > system default encoding. Thus, if a string is deserialized on a >> node >> > > with a different encoding from the one that serialized it, the >> > > deserialized string can be different from the original one. >> > > >> > > For example: >> > > >> > > Serialization/deserialization of string in communication messages >> may >> > > be >> > > broken for some strings on nodes running in a JVM with a >> > > different >> > > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() >> > > to >> > > serialize strings - [1] >> > > >> > > Or the IgniteAuthenticationProcessor can compute different >> > > security >> > > IDs >> > > for the user on different nodes in this case - [2] >> > > >> > > What do you think, if we solve this problem globally, by >> > > rejecting >> to >> > > join nodes that run on JVMs with different encodings? >> > > >> > > As a result, we will be sure that all cluster nodes have the same >> > > encoding and all related problems will be
Re: [DISCUSSION] Reject join of nodes with different character encodings
Hello! We already have a warning about this, see IgniteKernal.checkFileEncoding() Regards, -- Ilya Kasnacheev пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky : > >> But now multiple components > >> independently serialize strings for their needs and use default encoding > >> for this. > >> For example DirectByteBufferStreamImplV2#writeString, > >> MetaStorage#writeRaw and so on > We should fix all of them. > > >> BinaryUtils#utf8BytesToStr > Lets use this everywhere. > > As for me, I'm expecting a way more problem with enforcing rule to fail, > rather than enforcing all components to use UTF-8 > Some weird cases (surrogate pairs) we can (I strongly believe it is OK) > simply do not consider at all. > > пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : > > > > Does Java String support all unicode characters and particularly does > it > > support more characters than UTF-8 > > > > It’s not about Java, it’s about UTF-8 standard. > > > > Please, take a look at [1] > > > > > In November 2003, UTF-8 was restricted by RFC 3629 to match the > > constraints of the UTF-16 character encoding: explicitly prohibiting code > > points corresponding to the high and low surrogate characters removed > more > > than 3% of the three-byte sequences, and ending at U+10 removed more > > than 48% of the four-byte sequences and all five- and six-byte sequences. > > > > And [2] > > > > > The definition of UTF-8 prohibits encoding character numbers between > > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding > form > > (as surrogate pairs) and do not directly represent characters. > > > > Actually, we already has some modes to support this restriction of UTF-8. > > Please, take a look at BinaryUtils#utf8BytesToStr [3] > > > > > > [1] https://en.wikipedia.org/wiki/UTF-8 > > [2] https://datatracker.ietf.org/doc/html/rfc3629 > > [3] > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > > > > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin > > написал(а): > > > > > >> UTF-8 can’t encode all UNICODE characters. > > > > > > Nikolay, could you please elaborate? My understanding is that encoding > > > we speak about matters for conversion from byte arrays to strings. > > > Does Java String support all unicode characters and particularly does > > > it support more characters than UTF-8 (I am not saying here that java > > > String uses UTF-8)? > > > > > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : > > >> UTF-8 is already a default encoding in our BinaryObject format. So > > I am > > >> for unification. > > >> > > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : > > >> > > >>> Hello, Ivan. > > >>> > > >>> UTF-8 can’t encode all UNICODE characters. > > >>> > > 13 дек. 2021 г., в 12:49, Ivan Daschinsky > > >>> написал(а): > > > > Khm, maybe a better variant is to enforce all strings to be encoded > > in > > UTF-8? > > AFAIK multi OS cluster is a quite common case. > > > > > > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov >: > > > > > Igniters, > > > > > > Recently we faced the problem that if the cluster consists of nodes > > > running in the JVM with different encodings, many issues arise. > > > The root cause of the mentioned issues is components that use > > > `String#getBytes()` and `new String()`, which relies on > > > the > > > system default encoding. Thus, if a string is deserialized on a > node > > > with a different encoding from the one that serialized it, the > > > deserialized string can be different from the original one. > > > > > > For example: > > > > > > Serialization/deserialization of string in communication messages > may > > > be > > > broken for some strings on nodes running in a JVM with a different > > > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > > > serialize strings - [1] > > > > > > Or the IgniteAuthenticationProcessor can compute different security > > > IDs > > > for the user on different nodes in this case - [2] > > > > > > What do you think, if we solve this problem globally, by rejecting > to > > > join nodes that run on JVMs with different encodings? > > > > > > As a result, we will be sure that all cluster nodes have the same > > > encoding and all related problems will be solved. > > > > > > [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > > > [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > > > > > > -- > > > Mikhail > > > > > > > > > > -- > > Sincerely yours, Ivan Daschinskiy > > >>> > > >>> > > >> > > >> -- > > >> Sincerely yours, Ivan Daschinskiy > > >> > > > > > > > > > -- > > > > > > Best regards, > > > Ivan Pavlukhin > > > > > > -- > Sincerely yours, Ivan Daschinskiy >
Re: [DISCUSSION] Reject join of nodes with different character encodings
>> But now multiple components >> independently serialize strings for their needs and use default encoding >> for this. >> For example DirectByteBufferStreamImplV2#writeString, >> MetaStorage#writeRaw and so on We should fix all of them. >> BinaryUtils#utf8BytesToStr Lets use this everywhere. As for me, I'm expecting a way more problem with enforcing rule to fail, rather than enforcing all components to use UTF-8 Some weird cases (surrogate pairs) we can (I strongly believe it is OK) simply do not consider at all. пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov : > > Does Java String support all unicode characters and particularly does it > support more characters than UTF-8 > > It’s not about Java, it’s about UTF-8 standard. > > Please, take a look at [1] > > > In November 2003, UTF-8 was restricted by RFC 3629 to match the > constraints of the UTF-16 character encoding: explicitly prohibiting code > points corresponding to the high and low surrogate characters removed more > than 3% of the three-byte sequences, and ending at U+10 removed more > than 48% of the four-byte sequences and all five- and six-byte sequences. > > And [2] > > > The definition of UTF-8 prohibits encoding character numbers between > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form > (as surrogate pairs) and do not directly represent characters. > > Actually, we already has some modes to support this restriction of UTF-8. > Please, take a look at BinaryUtils#utf8BytesToStr [3] > > > [1] https://en.wikipedia.org/wiki/UTF-8 > [2] https://datatracker.ietf.org/doc/html/rfc3629 > [3] > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin > написал(а): > > > >> UTF-8 can’t encode all UNICODE characters. > > > > Nikolay, could you please elaborate? My understanding is that encoding > > we speak about matters for conversion from byte arrays to strings. > > Does Java String support all unicode characters and particularly does > > it support more characters than UTF-8 (I am not saying here that java > > String uses UTF-8)? > > > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : > >> UTF-8 is already a default encoding in our BinaryObject format. So > I am > >> for unification. > >> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : > >> > >>> Hello, Ivan. > >>> > >>> UTF-8 can’t encode all UNICODE characters. > >>> > 13 дек. 2021 г., в 12:49, Ivan Daschinsky > >>> написал(а): > > Khm, maybe a better variant is to enforce all strings to be encoded > in > UTF-8? > AFAIK multi OS cluster is a quite common case. > > > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : > > > Igniters, > > > > Recently we faced the problem that if the cluster consists of nodes > > running in the JVM with different encodings, many issues arise. > > The root cause of the mentioned issues is components that use > > `String#getBytes()` and `new String()`, which relies on > > the > > system default encoding. Thus, if a string is deserialized on a node > > with a different encoding from the one that serialized it, the > > deserialized string can be different from the original one. > > > > For example: > > > > Serialization/deserialization of string in communication messages may > > be > > broken for some strings on nodes running in a JVM with a different > > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > > serialize strings - [1] > > > > Or the IgniteAuthenticationProcessor can compute different security > > IDs > > for the user on different nodes in this case - [2] > > > > What do you think, if we solve this problem globally, by rejecting to > > join nodes that run on JVMs with different encodings? > > > > As a result, we will be sure that all cluster nodes have the same > > encoding and all related problems will be solved. > > > > [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > > [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > > > > -- > > Mikhail > > > > > > -- > Sincerely yours, Ivan Daschinskiy > >>> > >>> > >> > >> -- > >> Sincerely yours, Ivan Daschinskiy > >> > > > > > > -- > > > > Best regards, > > Ivan Pavlukhin > > -- Sincerely yours, Ivan Daschinskiy
Re: [DISCUSSION] Reject join of nodes with different character encodings
> Does Java String support all unicode characters and particularly does it > support more characters than UTF-8 It’s not about Java, it’s about UTF-8 standard. Please, take a look at [1] > In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints > of the UTF-16 character encoding: explicitly prohibiting code points > corresponding to the high and low surrogate characters removed more than 3% > of the three-byte sequences, and ending at U+10 removed more than 48% of > the four-byte sequences and all five- and six-byte sequences. And [2] > The definition of UTF-8 prohibits encoding character numbers between U+D800 > and U+DFFF, which are reserved for use with the UTF-16 encoding form (as > surrogate pairs) and do not directly represent characters. Actually, we already has some modes to support this restriction of UTF-8. Please, take a look at BinaryUtils#utf8BytesToStr [3] [1] https://en.wikipedia.org/wiki/UTF-8 [2] https://datatracker.ietf.org/doc/html/rfc3629 [3] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387 > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin написал(а): > >> UTF-8 can’t encode all UNICODE characters. > > Nikolay, could you please elaborate? My understanding is that encoding > we speak about matters for conversion from byte arrays to strings. > Does Java String support all unicode characters and particularly does > it support more characters than UTF-8 (I am not saying here that java > String uses UTF-8)? > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : >> UTF-8 is already a default encoding in our BinaryObject format. So I am >> for unification. >> >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : >> >>> Hello, Ivan. >>> >>> UTF-8 can’t encode all UNICODE characters. >>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky >>> написал(а): Khm, maybe a better variant is to enforce all strings to be encoded in UTF-8? AFAIK multi OS cluster is a quite common case. пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : > Igniters, > > Recently we faced the problem that if the cluster consists of nodes > running in the JVM with different encodings, many issues arise. > The root cause of the mentioned issues is components that use > `String#getBytes()` and `new String()`, which relies on > the > system default encoding. Thus, if a string is deserialized on a node > with a different encoding from the one that serialized it, the > deserialized string can be different from the original one. > > For example: > > Serialization/deserialization of string in communication messages may > be > broken for some strings on nodes running in a JVM with a different > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > serialize strings - [1] > > Or the IgniteAuthenticationProcessor can compute different security > IDs > for the user on different nodes in this case - [2] > > What do you think, if we solve this problem globally, by rejecting to > join nodes that run on JVMs with different encodings? > > As a result, we will be sure that all cluster nodes have the same > encoding and all related problems will be solved. > > [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > > -- > Mikhail > > -- Sincerely yours, Ivan Daschinskiy >>> >>> >> >> -- >> Sincerely yours, Ivan Daschinskiy >> > > > -- > > Best regards, > Ivan Pavlukhin
Re: [DISCUSSION] Reject join of nodes with different character encodings
Ivan Daschinsky, better variant is to enforce all strings to be encoded in UTF-8 I agree that it is possible way to go. But now multiple components independently serialize strings for their needs and use default encoding for this. For example DirectByteBufferStreamImplV2#writeString, MetaStorage#writeRaw and so on. Even if we fix all this cases we cannot guarantee that described above problem will not arise again. Also it seems to be easy for the user to specify encoding for the Ignite Java process manually - through `file.encoding` system property. Ivan Pavlukhin, I guess Nikolay is talking about the problem with UTF-8 in case string contains unpaired surrogate symbols (e.g. used for encoding in UTF-16). In this case UTF-8 fails to serialize this string correctly since unpaired surrogates characters are forbidden in UTF-8. Though this problem was solved for binary marshaller - see `BinaryWriterExImpl#doWriteString` and `BinaryUtils#strToUtf8Bytes` On 13.12.2021 13:57, Ivan Pavlukhin wrote: UTF-8 can’t encode all UNICODE characters. Nikolay, could you please elaborate? My understanding is that encoding we speak about matters for conversion from byte arrays to strings. Does Java String support all unicode characters and particularly does it support more characters than UTF-8 (I am not saying here that java String uses UTF-8)? 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : UTF-8 is already a default encoding in our BinaryObject format. So I am for unification. пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : Hello, Ivan. UTF-8 can’t encode all UNICODE characters. 13 дек. 2021 г., в 12:49, Ivan Daschinsky написал(а): Khm, maybe a better variant is to enforce all strings to be encoded in UTF-8? AFAIK multi OS cluster is a quite common case. пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : Igniters, Recently we faced the problem that if the cluster consists of nodes running in the JVM with different encodings, many issues arise. The root cause of the mentioned issues is components that use `String#getBytes()` and `new String()`, which relies on the system default encoding. Thus, if a string is deserialized on a node with a different encoding from the one that serialized it, the deserialized string can be different from the original one. For example: Serialization/deserialization of string in communication messages may be broken for some strings on nodes running in a JVM with a different encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to serialize strings - [1] Or the IgniteAuthenticationProcessor can compute different security IDs for the user on different nodes in this case - [2] What do you think, if we solve this problem globally, by rejecting to join nodes that run on JVMs with different encodings? As a result, we will be sure that all cluster nodes have the same encoding and all related problems will be solved. [1] - https://issues.apache.org/jira/browse/IGNITE-16106 [2] - https://issues.apache.org/jira/browse/IGNITE-16068 -- Mikhail -- Sincerely yours, Ivan Daschinskiy -- Sincerely yours, Ivan Daschinskiy
Re: [DISCUSSION] Reject join of nodes with different character encodings
> UTF-8 can’t encode all UNICODE characters. Nikolay, could you please elaborate? My understanding is that encoding we speak about matters for conversion from byte arrays to strings. Does Java String support all unicode characters and particularly does it support more characters than UTF-8 (I am not saying here that java String uses UTF-8)? 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky : > UTF-8 is already a default encoding in our BinaryObject format. So I am > for unification. > > пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : > >> Hello, Ivan. >> >> UTF-8 can’t encode all UNICODE characters. >> >> > 13 дек. 2021 г., в 12:49, Ivan Daschinsky >> написал(а): >> > >> > Khm, maybe a better variant is to enforce all strings to be encoded in >> > UTF-8? >> > AFAIK multi OS cluster is a quite common case. >> > >> > >> > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : >> > >> >> Igniters, >> >> >> >> Recently we faced the problem that if the cluster consists of nodes >> >> running in the JVM with different encodings, many issues arise. >> >> The root cause of the mentioned issues is components that use >> >> `String#getBytes()` and `new String()`, which relies on >> >> the >> >> system default encoding. Thus, if a string is deserialized on a node >> >> with a different encoding from the one that serialized it, the >> >> deserialized string can be different from the original one. >> >> >> >> For example: >> >> >> >> Serialization/deserialization of string in communication messages may >> >> be >> >> broken for some strings on nodes running in a JVM with a different >> >> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to >> >> serialize strings - [1] >> >> >> >> Or the IgniteAuthenticationProcessor can compute different security >> >> IDs >> >> for the user on different nodes in this case - [2] >> >> >> >> What do you think, if we solve this problem globally, by rejecting to >> >> join nodes that run on JVMs with different encodings? >> >> >> >> As a result, we will be sure that all cluster nodes have the same >> >> encoding and all related problems will be solved. >> >> >> >> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 >> >> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 >> >> >> >> -- >> >> Mikhail >> >> >> >> >> > >> > -- >> > Sincerely yours, Ivan Daschinskiy >> >> > > -- > Sincerely yours, Ivan Daschinskiy > -- Best regards, Ivan Pavlukhin
Re: [DISCUSSION] Reject join of nodes with different character encodings
UTF-8 is already a default encoding in our BinaryObject format. So I am for unification. пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov : > Hello, Ivan. > > UTF-8 can’t encode all UNICODE characters. > > > 13 дек. 2021 г., в 12:49, Ivan Daschinsky > написал(а): > > > > Khm, maybe a better variant is to enforce all strings to be encoded in > > UTF-8? > > AFAIK multi OS cluster is a quite common case. > > > > > > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : > > > >> Igniters, > >> > >> Recently we faced the problem that if the cluster consists of nodes > >> running in the JVM with different encodings, many issues arise. > >> The root cause of the mentioned issues is components that use > >> `String#getBytes()` and `new String()`, which relies on the > >> system default encoding. Thus, if a string is deserialized on a node > >> with a different encoding from the one that serialized it, the > >> deserialized string can be different from the original one. > >> > >> For example: > >> > >> Serialization/deserialization of string in communication messages may be > >> broken for some strings on nodes running in a JVM with a different > >> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > >> serialize strings - [1] > >> > >> Or the IgniteAuthenticationProcessor can compute different security IDs > >> for the user on different nodes in this case - [2] > >> > >> What do you think, if we solve this problem globally, by rejecting to > >> join nodes that run on JVMs with different encodings? > >> > >> As a result, we will be sure that all cluster nodes have the same > >> encoding and all related problems will be solved. > >> > >> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > >> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > >> > >> -- > >> Mikhail > >> > >> > > > > -- > > Sincerely yours, Ivan Daschinskiy > > -- Sincerely yours, Ivan Daschinskiy
Re: [DISCUSSION] Reject join of nodes with different character encodings
Hello, Ivan. UTF-8 can’t encode all UNICODE characters. > 13 дек. 2021 г., в 12:49, Ivan Daschinsky написал(а): > > Khm, maybe a better variant is to enforce all strings to be encoded in > UTF-8? > AFAIK multi OS cluster is a quite common case. > > > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : > >> Igniters, >> >> Recently we faced the problem that if the cluster consists of nodes >> running in the JVM with different encodings, many issues arise. >> The root cause of the mentioned issues is components that use >> `String#getBytes()` and `new String()`, which relies on the >> system default encoding. Thus, if a string is deserialized on a node >> with a different encoding from the one that serialized it, the >> deserialized string can be different from the original one. >> >> For example: >> >> Serialization/deserialization of string in communication messages may be >> broken for some strings on nodes running in a JVM with a different >> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to >> serialize strings - [1] >> >> Or the IgniteAuthenticationProcessor can compute different security IDs >> for the user on different nodes in this case - [2] >> >> What do you think, if we solve this problem globally, by rejecting to >> join nodes that run on JVMs with different encodings? >> >> As a result, we will be sure that all cluster nodes have the same >> encoding and all related problems will be solved. >> >> [1] - https://issues.apache.org/jira/browse/IGNITE-16106 >> [2] - https://issues.apache.org/jira/browse/IGNITE-16068 >> >> -- >> Mikhail >> >> > > -- > Sincerely yours, Ivan Daschinskiy
Re: [DISCUSSION] Reject join of nodes with different character encodings
Khm, maybe a better variant is to enforce all strings to be encoded in UTF-8? AFAIK multi OS cluster is a quite common case. пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov : > Igniters, > > Recently we faced the problem that if the cluster consists of nodes > running in the JVM with different encodings, many issues arise. > The root cause of the mentioned issues is components that use > `String#getBytes()` and `new String()`, which relies on the > system default encoding. Thus, if a string is deserialized on a node > with a different encoding from the one that serialized it, the > deserialized string can be different from the original one. > > For example: > > Serialization/deserialization of string in communication messages may be > broken for some strings on nodes running in a JVM with a different > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to > serialize strings - [1] > > Or the IgniteAuthenticationProcessor can compute different security IDs > for the user on different nodes in this case - [2] > > What do you think, if we solve this problem globally, by rejecting to > join nodes that run on JVMs with different encodings? > > As a result, we will be sure that all cluster nodes have the same > encoding and all related problems will be solved. > > [1] - https://issues.apache.org/jira/browse/IGNITE-16106 > [2] - https://issues.apache.org/jira/browse/IGNITE-16068 > > -- > Mikhail > > -- Sincerely yours, Ivan Daschinskiy
[DISCUSSION] Reject join of nodes with different character encodings
Igniters, Recently we faced the problem that if the cluster consists of nodes running in the JVM with different encodings, many issues arise. The root cause of the mentioned issues is components that use `String#getBytes()` and `new String()`, which relies on the system default encoding. Thus, if a string is deserialized on a node with a different encoding from the one that serialized it, the deserialized string can be different from the original one. For example: Serialization/deserialization of string in communication messages may be broken for some strings on nodes running in a JVM with a different encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to serialize strings - [1] Or the IgniteAuthenticationProcessor can compute different security IDs for the user on different nodes in this case - [2] What do you think, if we solve this problem globally, by rejecting to join nodes that run on JVMs with different encodings? As a result, we will be sure that all cluster nodes have the same encoding and all related problems will be solved. [1] - https://issues.apache.org/jira/browse/IGNITE-16106 [2] - https://issues.apache.org/jira/browse/IGNITE-16068 -- Mikhail