[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder
[ https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283464#comment-16283464 ] Andrey Kuznetsov commented on IGNITE-5655: -- I'd like to conduct more subtle investigation on impact of this change on some big enough data. Until this is done I won't suggest to merge such a significant change to master. > Introduce pluggable string encoder/decoder > -- > > Key: IGNITE-5655 > URL: https://issues.apache.org/jira/browse/IGNITE-5655 > Project: Ignite > Issue Type: New Feature > Components: binary >Affects Versions: 2.0 >Reporter: Valentin Kulichenko >Assignee: Andrey Kuznetsov > Labels: iep-2, important > Fix For: 2.4 > > > Currently binary marshaller encodes strings in UTF-8. However, sometimes it > makes sense to serialize strings with different encodings to save space. > Let's add global property to control String encoding and customize our binary > protocol to support it. For instance, we can add another flag > {{ENCODED_STRING}}, which will write strings as follows: > [flag][encoding_flag][str_len][str_bytes] > First implementation should set preferred encoding for strings in > BinaryConfiguration. This setting is optional, default encoding is UTF-8. > Currently, the same BinaryConfiguration is used for all cluster nodes, thus > no encoding clashes are possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder
[ https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185807#comment-16185807 ] Vladimir Ozerov commented on IGNITE-5655: - [~andrey-kuznetsov], I'll have a look thanks. Meanwhile I'd like to move it AI 2.4. We haven't merged a single fix to binary protocol for many months so far. Mainly because we added a lot of other platforms and clients, and persistence. So every protocol change require cooperation of many contributors. I am in the process of preparing separate WIKI page with all protocol-related changes [1]. Once ready I will start a new thread on the dev list where I'd like to discuss on how to manage such changes - either through a single ticket or multiple tickets; whether every change should be propagated to all platforms/clients in a single version, or it is possible to spread them between versions (e.g. Java part in 2.3, .NET in 2.4, etc). Once we agree on the process, we will try doing our best to accumulate all the changes in 2.4 version. [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-2%3A+Binary+object+format+improvements > Introduce pluggable string encoder/decoder > -- > > Key: IGNITE-5655 > URL: https://issues.apache.org/jira/browse/IGNITE-5655 > Project: Ignite > Issue Type: New Feature > Components: binary >Affects Versions: 2.0 >Reporter: Valentin Kulichenko >Assignee: Andrey Kuznetsov > Labels: iep-2, important > Fix For: 2.3 > > > Currently binary marshaller encodes strings in UTF-8. However, sometimes it > makes sense to serialize strings with different encodings to save space. > Let's add global property to control String encoding and customize our binary > protocol to support it. For instance, we can add another flag > {{ENCODED_STRING}}, which will write strings as follows: > [flag][encoding_flag][str_len][str_bytes] > First implementation should set preferred encoding for strings in > BinaryConfiguration. This setting is optional, default encoding is UTF-8. > Currently, the same BinaryConfiguration is used for all cluster nodes, thus > no encoding clashes are possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder
[ https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140009#comment-16140009 ] Andrey Kuznetsov commented on IGNITE-5655: -- [~vozerov], thanks for your comments. As for ignoring encoding byte in binary fields comparison (#7), it works without changes right now, since we have only one non-UTF-8 encoding. But it's fragile, and I'll fix it, of course. As for #7.2, we don't need to deserialize fields when encodings match, since we just test for equality (as opposed to {{c1.compareTo(c2)}}). > Introduce pluggable string encoder/decoder > -- > > Key: IGNITE-5655 > URL: https://issues.apache.org/jira/browse/IGNITE-5655 > Project: Ignite > Issue Type: New Feature > Components: binary >Affects Versions: 2.0 >Reporter: Valentin Kulichenko >Assignee: Andrey Kuznetsov > Fix For: 2.2 > > > Currently binary marshaller encodes strings in UTF-8. However, sometimes it > makes sense to serialize strings with different encodings to save space. > Let's add global property to control String encoding and customize our binary > protocol to support it. For instance, we can add another flag > {{ENCODED_STRING}}, which will write strings as follows: > [flag][encoding_flag][str_len][str_bytes] > First implementation should set preferred encoding for strings in > BinaryConfiguration. This setting is optional, default encoding is UTF-8. > Currently, the same BinaryConfiguration is used for all cluster nodes, thus > no encoding clashes are possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder
[ https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138271#comment-16138271 ] Vladimir Ozerov commented on IGNITE-5655: - [~andrey-kuznetsov], my comments: 1) {{BinaryConfiguration.encoding}} - passing it as a {{byte}} is bad idea as it limits total number of available encodings. We should pass it as {{String}}, in the same way as it is done in Java classes. In case encoding is not supported, an exception should be thrown. 2) {{OdbcColumnMeta, OdbcMessageParser}} - unnecessary changes; ODBC doesn't distinguish between String types, this is our internal implementation detail. 3) {{SqlListenerNioListener}} - another unnecessary change. We encode strings to save space in storage and page memory. Encoding strings when sending error message doesn't make sense. 4) {{SqlListenerUtils}} - unnecessary changes, as encoded strings are not supported in JDBC/ODBC at the moment (I do not see any changes in drivers). Hence, we should always talk to drivers using {{UTF-8}} strings. 5) {{BinaryStringEncoding}} - should be a part of public API, so let's move it to {{org.apache.ignite.binary}} package. 6) {{BinaryReaderExImpl.fieldFlagNames}} - looks like an overkill to me. Just concatenate two well-known flags by hand. 7) {{BinarySerializedFieldComparator}} - it is illegal to ignore encoding byte. You may endup with situation where two different strings are encoded with different encoding, but have similar binary representation. Your code will report {{true}}, while {{false}} should be returned. Correct flow here: 7.1) If objects have different encodings - deserialize them and compare using String.compareTo 7.2) If encodings are same, we need to understand whether encoded representation is comparable. For example, suppose that in certain encoding {{A}} represented as {{2}}, and {{B}} is represented as {{1}}. Then {{A.compareTo(B) == -1}}, but {{compareArrays(serialize(A), serialize(B) == 1}}, which is wrong. E.g. see \[1\]. We need to think on how to expose this properly, but I think this is a different task. For now let's just restrict any binary comparison of non-UTF8 strings and always deserialize them. 8) Feature is definitely undertested. At the very least we need the following tests: 8.1) Run SQL queries with custom encoding 8.2) Start two nodes with different encoding and see how they interact with each other. Not covered cases which I propose to implement separately: ODBC, JDBC, C++, .NET, efficient ocmparison for inlined indexes and BinarySerializedFieldComparator. The rest looks good to me. Thanks for contribution! \[1\] https://dev.mysql.com/doc/refman/5.7/en/charset-general.html > Introduce pluggable string encoder/decoder > -- > > Key: IGNITE-5655 > URL: https://issues.apache.org/jira/browse/IGNITE-5655 > Project: Ignite > Issue Type: New Feature > Components: binary >Affects Versions: 2.0 >Reporter: Valentin Kulichenko >Assignee: Andrey Kuznetsov > Fix For: 2.2 > > > Currently binary marshaller encodes strings in UTF-8. However, sometimes it > makes sense to serialize strings with different encodings to save space. > Let's add global property to control String encoding and customize our binary > protocol to support it. For instance, we can add another flag > {{ENCODED_STRING}}, which will write strings as follows: > [flag][encoding_flag][str_len][str_bytes] > First implementation should set preferred encoding for strings in > BinaryConfiguration. This setting is optional, default encoding is UTF-8. > Currently, the same BinaryConfiguration is used for all cluster nodes, thus > no encoding clashes are possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder
[ https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120193#comment-16120193 ] Andrey Kuznetsov commented on IGNITE-5655: -- [~devozerov], please review my changes. > Introduce pluggable string encoder/decoder > -- > > Key: IGNITE-5655 > URL: https://issues.apache.org/jira/browse/IGNITE-5655 > Project: Ignite > Issue Type: New Feature > Components: binary >Affects Versions: 2.0 >Reporter: Valentin Kulichenko >Assignee: Andrey Kuznetsov > Fix For: 2.2 > > > Currently binary marshaller encodes strings in UTF-8. However, sometimes it > makes sense to serialize strings with different encodings to save space. > Let's add global property to control String encoding and customize our binary > protocol to support it. For instance, we can add another flag > {{ENCODED_STRING}}, which will write strings as follows: > [flag][encoding_flag][str_len][str_bytes] > First implementation should set preferred encoding for strings in > BinaryConfiguration. This setting is optional, default encoding is UTF-8. > Currently, the same BinaryConfiguration is used for all cluster nodes, thus > no encoding clashes are possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029)