[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409254#comment-15409254 ] ASF GitHub Bot commented on IGNITE-3140: Github user isapego closed the pull request at: https://github.com/apache/ignite/pull/829 > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.7 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348191#comment-15348191 ] ASF GitHub Bot commented on IGNITE-3140: GitHub user isapego opened a pull request: https://github.com/apache/ignite/pull/829 IGNITE-3140: Added tests for string format validity. You can merge this pull request into a Git repository by running: $ git pull https://github.com/isapego/ignite ignite-3140 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/829.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #829 commit 9ca608aa8cf91d546dfa08dd7f560d42047b1c3d Author: isapego Date: 2016-06-23T16:53:32Z IGNITE-3140: Added tests for string format validity. commit 41e8d703d45b532c0169f523672fe7ab5291bb87 Author: isapego Date: 2016-06-23T17:00:55Z IGNTIE-3140: Minor fix for test. commit f70ef622cb05159da99786697a533260bae57929 Author: isapego Date: 2016-06-24T12:14:25Z IGNITE-3140: Fix for JVM reloading. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Igor Sapego > Fix For: 1.7 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286750#comment-15286750 ] ASF GitHub Bot commented on IGNITE-3140: Github user asfgit closed the pull request at: https://github.com/apache/ignite/pull/723 > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.7 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286603#comment-15286603 ] Igor Sapego commented on IGNITE-3140: - Ok, it seems like {{BinaryUtils.utf8BytesToStr}} does not implement conversion of 4-byte UTF-8 to UTF-16 surrogate pairs. I believe we should implement it on Java side. Except for that, everything seems to be OK from C++ point of view. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.7 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286443#comment-15286443 ] Igor Sapego commented on IGNITE-3140: - Ok, I get it. It seems like our new method {{BinaryUtils.utf8BytesToStr}} does not really support all valid UTF-8 strings. I'll add related test and see how C++ can deal with that. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.7 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286334#comment-15286334 ] Igor Sapego commented on IGNITE-3140: - Denis, According to [wikipedia|https://en.wikipedia.org/wiki/UTF-8#Description], code points between {{U+0800}} and {{U+}} are serialized using 3 bytes in UTF-8, so everything seems to be according to specification in our case. Though these code points themselves may be considered invalid by some of the implementations, encoding is still valid. C++ standard itself does not specify string encoding in any way and does not include functions to operate encodings so there is no such thing as serialization in encoding sense on C++ side. It means that if you put something (no matter what) in C++ string it is going to be operable as C++ standard does not specify string encoding. In C++ string is just a sequence of characters of a specified size. So I simply can't serialize UTF-16 string on the C++ side unless I write serialization algorithm by myself or if I'm not going to use some third party implementation. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.6 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286028#comment-15286028 ] Denis Magda commented on IGNITE-3140: - Igor, The new serialization algorithm on Java side serializes all symbols that are bigger than 0x07FF in 3 bytes. It means that if there is a valid surrogate pair in a String like this one {{0xD801, 0xDC37}} then the new algorithm will use 6 bytes to code it while basic UTF-8 coders/decoders will use only 4 bytes. C++ side won't be able to properly deserialize {{0xD801, 0xDC37}} on its side because it will be encoded in 6 bytes. Try to serialize this String on C++ side. It should be encoded in 4 bytes while the new Java algorithm encodes it in 6 bytes. {noformat} str = new String(new char[] {0xD801, 0xDC37}); {noformat} > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.6 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285021#comment-15285021 ] Igor Sapego commented on IGNITE-3140: - Denis, C++ uses UTF-8 encoding right now and as long as Java and .NET nodes would write strings in UTF-8 we are not going to have any problems with deserialization. On the C++ side we just copy those received string bytes without performing any complex processing and use it as string. As long as it is valid UTF-8 data (and in our Binary protocol it is) everything is going to work just fine. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.6 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284919#comment-15284919 ] Denis Magda commented on IGNITE-3140: - Igor, In case of heterogeneous cluster (Java and C++ nodes) C++ side won't be able to properly deserialize surrogate symbols from a String serialized on Java side if the new algo (from IGNITE-3098) is used. This is the reason why we should rewrite strings serialization on logic on C++ side as well. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Vladimir Ozerov > Fix For: 1.6 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (IGNITE-3140) C++: UTF-16 surrogate symbols are not serialized properly
[ https://issues.apache.org/jira/browse/IGNITE-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284659#comment-15284659 ] Igor Sapego commented on IGNITE-3140: - We do not deal with the UTF-16 in C++ code. We expect user to provide us strings in a valid UTF-8 format. Added test with malformed UTF-8 string where we are expecting an exception. > C++: UTF-16 surrogate symbols are not serialized properly > - > > Key: IGNITE-3140 > URL: https://issues.apache.org/jira/browse/IGNITE-3140 > Project: Ignite > Issue Type: Bug > Components: platforms >Affects Versions: 1.5.0.final >Reporter: Denis Magda >Assignee: Igor Sapego > Fix For: 1.6 > > > There is an issue with serialization of a surrogate symbol with > {{BinaryMarshaller}}. On Java side String's serialization logic was improved > to support all the cases. Refer to IGNITE-3098. > C++ serialization logic has to be updated as well. Please refer to the > algorithm located in ignite-3098 branch in the following places: > - {{BinaryUtils.utf8BytesToStr}} - serialization > - {{BinaryUtils.strToUtf8Bytes}} - deserialization > - > {{IgniteSystemProperties.IGNITE_BINARY_MARSHALLER_USE_STRING_SERIALIZATION_VER_2}} > controls which version of serialization logic to use (old or new). -- This message was sent by Atlassian JIRA (v6.3.4#6332)