[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder

2017-12-08 Thread Andrey Kuznetsov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283464#comment-16283464
 ] 

Andrey Kuznetsov commented on IGNITE-5655:
--

I'd like to conduct more subtle investigation on impact of this change on some 
big enough data. Until this is done I won't suggest to merge such a significant 
change to master.

> Introduce pluggable string encoder/decoder
> --
>
> Key: IGNITE-5655
> URL: https://issues.apache.org/jira/browse/IGNITE-5655
> Project: Ignite
>  Issue Type: New Feature
>  Components: binary
>Affects Versions: 2.0
>Reporter: Valentin Kulichenko
>Assignee: Andrey Kuznetsov
>  Labels: iep-2, important
> Fix For: 2.4
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it 
> makes sense to serialize strings with different encodings to save space. 
> Let's add global property to control String encoding and customize our binary 
> protocol to support it. For instance, we can add another flag 
> {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in 
> BinaryConfiguration. This setting is optional, default encoding is UTF-8. 
> Currently, the same BinaryConfiguration is used for all cluster nodes, thus 
> no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder

2017-09-29 Thread Vladimir Ozerov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185807#comment-16185807
 ] 

Vladimir Ozerov commented on IGNITE-5655:
-

[~andrey-kuznetsov], I'll have a look thanks. 

Meanwhile I'd like to move it AI 2.4. We haven't merged a single fix to binary 
protocol for many months so far. Mainly because we added a lot of other 
platforms and clients, and persistence. So every protocol change require 
cooperation of many contributors. I am in the process of preparing separate 
WIKI page with all protocol-related changes [1]. Once ready I will start a new 
thread on the dev list where I'd like to discuss on how to manage such changes 
- either through a single ticket or multiple tickets; whether every change 
should be propagated to all platforms/clients in a single version, or it is 
possible to spread them between versions (e.g. Java part in 2.3, .NET in 2.4, 
etc). Once we agree on the process, we will try doing our best to accumulate 
all the changes in 2.4 version.

[1] 
https://cwiki.apache.org/confluence/display/IGNITE/IEP-2%3A+Binary+object+format+improvements

> Introduce pluggable string encoder/decoder
> --
>
> Key: IGNITE-5655
> URL: https://issues.apache.org/jira/browse/IGNITE-5655
> Project: Ignite
>  Issue Type: New Feature
>  Components: binary
>Affects Versions: 2.0
>Reporter: Valentin Kulichenko
>Assignee: Andrey Kuznetsov
>  Labels: iep-2, important
> Fix For: 2.3
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it 
> makes sense to serialize strings with different encodings to save space. 
> Let's add global property to control String encoding and customize our binary 
> protocol to support it. For instance, we can add another flag 
> {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in 
> BinaryConfiguration. This setting is optional, default encoding is UTF-8. 
> Currently, the same BinaryConfiguration is used for all cluster nodes, thus 
> no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder

2017-08-24 Thread Andrey Kuznetsov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140009#comment-16140009
 ] 

Andrey Kuznetsov commented on IGNITE-5655:
--

[~vozerov], thanks for your comments.

As for ignoring encoding byte in binary fields comparison (#7), it works 
without changes right now, since we have only one non-UTF-8 encoding. But it's 
fragile, and I'll fix it, of course.

As for #7.2, we don't need to deserialize fields when encodings match, since we 
just test for equality (as opposed to {{c1.compareTo(c2)}}).

> Introduce pluggable string encoder/decoder
> --
>
> Key: IGNITE-5655
> URL: https://issues.apache.org/jira/browse/IGNITE-5655
> Project: Ignite
>  Issue Type: New Feature
>  Components: binary
>Affects Versions: 2.0
>Reporter: Valentin Kulichenko
>Assignee: Andrey Kuznetsov
> Fix For: 2.2
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it 
> makes sense to serialize strings with different encodings to save space. 
> Let's add global property to control String encoding and customize our binary 
> protocol to support it. For instance, we can add another flag 
> {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in 
> BinaryConfiguration. This setting is optional, default encoding is UTF-8. 
> Currently, the same BinaryConfiguration is used for all cluster nodes, thus 
> no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder

2017-08-23 Thread Vladimir Ozerov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138271#comment-16138271
 ] 

Vladimir Ozerov commented on IGNITE-5655:
-

[~andrey-kuznetsov], my comments:
1) {{BinaryConfiguration.encoding}} - passing it as a {{byte}} is bad idea as 
it limits total number of available encodings. We should pass it as {{String}}, 
in the same way as it is done in Java classes. In case encoding is not 
supported, an exception should be thrown.
2) {{OdbcColumnMeta, OdbcMessageParser}} - unnecessary changes; ODBC doesn't 
distinguish between String types, this is our internal implementation detail.
3) {{SqlListenerNioListener}} - another unnecessary change. We encode strings 
to save space in storage and page memory. Encoding strings when sending error 
message doesn't make sense.
4) {{SqlListenerUtils}} - unnecessary changes, as encoded strings are not 
supported in JDBC/ODBC at the moment (I do not see any changes in drivers). 
Hence, we should always talk to drivers using {{UTF-8}} strings.
5) {{BinaryStringEncoding}} - should be a part of public API, so let's move it 
to {{org.apache.ignite.binary}} package.
6) {{BinaryReaderExImpl.fieldFlagNames}} - looks like an overkill to me. Just 
concatenate two well-known flags by hand.
7) {{BinarySerializedFieldComparator}} - it is illegal to ignore encoding byte. 
You may endup with situation where two different strings are encoded with 
different encoding, but have similar binary representation. Your code will 
report {{true}}, while {{false}} should be returned. Correct flow here:
7.1) If objects have different encodings - deserialize them and compare using 
String.compareTo
7.2) If encodings are same, we need to understand whether encoded 
representation is comparable. For example, suppose that in certain encoding 
{{A}} represented as {{2}}, and {{B}} is represented as {{1}}. Then 
{{A.compareTo(B) == -1}}, but {{compareArrays(serialize(A), serialize(B) == 
1}}, which is wrong. E.g. see \[1\]. We need to think on how to expose this 
properly, but I think this is a different task. For now let's just restrict any 
binary comparison of non-UTF8 strings and always deserialize them.
8) Feature is definitely undertested. At the very least we need the following 
tests:
8.1) Run SQL queries with custom encoding
8.2) Start two nodes with different encoding and see how they interact with 
each other.

Not covered cases which I propose to implement separately: ODBC, JDBC, C++, 
.NET, efficient ocmparison for inlined indexes and 
BinarySerializedFieldComparator.

The rest looks good to me. Thanks for contribution!

\[1\] https://dev.mysql.com/doc/refman/5.7/en/charset-general.html

> Introduce pluggable string encoder/decoder
> --
>
> Key: IGNITE-5655
> URL: https://issues.apache.org/jira/browse/IGNITE-5655
> Project: Ignite
>  Issue Type: New Feature
>  Components: binary
>Affects Versions: 2.0
>Reporter: Valentin Kulichenko
>Assignee: Andrey Kuznetsov
> Fix For: 2.2
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it 
> makes sense to serialize strings with different encodings to save space. 
> Let's add global property to control String encoding and customize our binary 
> protocol to support it. For instance, we can add another flag 
> {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in 
> BinaryConfiguration. This setting is optional, default encoding is UTF-8. 
> Currently, the same BinaryConfiguration is used for all cluster nodes, thus 
> no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder

2017-08-09 Thread Andrey Kuznetsov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120193#comment-16120193
 ] 

Andrey Kuznetsov commented on IGNITE-5655:
--

[~devozerov], please review my changes.

> Introduce pluggable string encoder/decoder
> --
>
> Key: IGNITE-5655
> URL: https://issues.apache.org/jira/browse/IGNITE-5655
> Project: Ignite
>  Issue Type: New Feature
>  Components: binary
>Affects Versions: 2.0
>Reporter: Valentin Kulichenko
>Assignee: Andrey Kuznetsov
> Fix For: 2.2
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it 
> makes sense to serialize strings with different encodings to save space. 
> Let's add global property to control String encoding and customize our binary 
> protocol to support it. For instance, we can add another flag 
> {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in 
> BinaryConfiguration. This setting is optional, default encoding is UTF-8. 
> Currently, the same BinaryConfiguration is used for all cluster nodes, thus 
> no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)