Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-11 Thread Dmitriy Setrakyan
Vova, generally agree, but why not also support per-cache (per-table)
settings?


On Mon, Sep 11, 2017 at 1:16 AM, Vladimir Ozerov 
wrote:

> OK, here is implementation plan I propose:
> 1) Add global character set configuration -
> IgniteConfiguration.characterSet. Note, it is located in
> IgniteConfiguration, not BinaryConfiguration.
> 2) All cluster nodes must have the same character set.
> 3) Once defined, character set cannot be changed ever. In future we will
> probably have import/export utilities, which will help users migrate
> between character sets. Such strict behavior is normal for other major DBMS
> vendors (e.g. Oracle), so it should work for us as well.
> 4) We will add "characterSet" property to all clients (ODBC, JDBC, thin
> client). It will be validated during handshake phase. Exception is thrown
> in case of mismatch.
> 5) In future we will work on relaxing these restrictions in favor of
> runtime conversions on fly.
>
> Thoughts?
>
>
>
> On Mon, Sep 11, 2017 at 11:01 AM, Vladimir Ozerov 
> wrote:
>
> > Dima,
> >
> > You contradict yourself - vote for per-column encoding on the one hand,
> > but telling that it is "over-architected" on the other. This is exactly
> > what I am talking about - anything more that hard-coded cluster-wide
> > encoding is complex. You cannot simply define per-column encoding. In
> > addition you should either pass information about this encoding too all
> > cluster members and to all clients, so that they construct correct binary
> > object in the first place, or you should re-convert binary object on fly,
> > this is what I suggested. No simple solution here.
> >
> > I vote for cluster-wide encoding for now, but with transparent conversion
> > when needed.
> >
> >
> > On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan  >
> > wrote:
> >
> >> I would agree with Andrey, it does look a bit over-architected to me.
> Why
> >> would anyone try to move data from one encoding to another? Is it a real
> >> use case that needs to be handled automatically?
> >>
> >> Here is what I think we should handle:
> >>
> >>1. Ability to set cluster-wide encoding. This should be easy.
> >>2. Ability to set per-column encoding. Such encoding should be set on
> >>per-column level, perhaps at cache creation or table creation. For
> >> example,
> >>at the cache creation time, we could let user define all column names
> >> that
> >>will have non-default encodings.
> >>
> >> Thoughts?
> >>
> >> D.
> >>
> >> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov 
> >> wrote:
> >>
> >> > As of option #1, it's not so bad. Currently we've implemented global
> >> level
> >> > encoding switch, and this looks similar to DBMS: if server works with
> >> > certain encoding, then all clients should be configured to use the
> same
> >> > encoding for correct string processing.
> >> >
> >> > Option #2 provokes a number of questions.
> >> >
> >> > What are performance implications of such hidden binary reencoding?
> >> >
> >> > Who will check for possible data loss on transparent reencoding (when
> >> > object walks between caches/fields with distinct encodings)?
> >> >
> >> > How should we handle nested binary objects? On the one hand, they
> >> should be
> >> > reencoded in a way described by Vladimir. On the other hand,
> >> BinaryObject
> >> > is an independent entity, that can be serialized/deserialized freely,
> >> moved
> >> > between various data structures, etc. It will be frustrating for user
> to
> >> > find its binary state changed after storing in a grid, with possible
> >> data
> >> > corruption.
> >> >
> >> >
> >> > As far as I can see, we are trying to couple orthogonal APIs:
> >> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
> >> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and
> >> > their binary representations, and now we are trying to map two binary
> >> types
> >> > (STRING and ENCODED_STRING) to single String class. IgniteCache is
> much
> >> > more flexible API, than SQL, but it lacks encoded string datatype,
> that
> >> > exists in SQLs of some RDBMSs: `varchar(n) character set
> some_charset`.
> >> > It's not a popular idea, but many problems could be solved by adding
> >> such
> >> > type. Those IgniteCache API users who don't need it won't use it, but
> it
> >> > could become a bridge between SQL and BinaryMarshaller encoded-string
> >> > types.
> >> >
> >> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov :
> >> >
> >> > > What we tried to achieve is that several encoding could co-exist in
> a
> >> > > single cluster or even single cache. This would be great from UX
> >> > > perspective. However, from what Andrey wrote, I understand that this
> >> > would
> >> > > be pretty hard to achieve as we rely heavily on similar binary
> >> > > representation of objects being compared. That said, while this
> could
> >> > work

Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-11 Thread Vladimir Ozerov
OK, here is implementation plan I propose:
1) Add global character set configuration -
IgniteConfiguration.characterSet. Note, it is located in
IgniteConfiguration, not BinaryConfiguration.
2) All cluster nodes must have the same character set.
3) Once defined, character set cannot be changed ever. In future we will
probably have import/export utilities, which will help users migrate
between character sets. Such strict behavior is normal for other major DBMS
vendors (e.g. Oracle), so it should work for us as well.
4) We will add "characterSet" property to all clients (ODBC, JDBC, thin
client). It will be validated during handshake phase. Exception is thrown
in case of mismatch.
5) In future we will work on relaxing these restrictions in favor of
runtime conversions on fly.

Thoughts?



On Mon, Sep 11, 2017 at 11:01 AM, Vladimir Ozerov 
wrote:

> Dima,
>
> You contradict yourself - vote for per-column encoding on the one hand,
> but telling that it is "over-architected" on the other. This is exactly
> what I am talking about - anything more that hard-coded cluster-wide
> encoding is complex. You cannot simply define per-column encoding. In
> addition you should either pass information about this encoding too all
> cluster members and to all clients, so that they construct correct binary
> object in the first place, or you should re-convert binary object on fly,
> this is what I suggested. No simple solution here.
>
> I vote for cluster-wide encoding for now, but with transparent conversion
> when needed.
>
>
> On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan 
> wrote:
>
>> I would agree with Andrey, it does look a bit over-architected to me. Why
>> would anyone try to move data from one encoding to another? Is it a real
>> use case that needs to be handled automatically?
>>
>> Here is what I think we should handle:
>>
>>1. Ability to set cluster-wide encoding. This should be easy.
>>2. Ability to set per-column encoding. Such encoding should be set on
>>per-column level, perhaps at cache creation or table creation. For
>> example,
>>at the cache creation time, we could let user define all column names
>> that
>>will have non-default encodings.
>>
>> Thoughts?
>>
>> D.
>>
>> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov 
>> wrote:
>>
>> > As of option #1, it's not so bad. Currently we've implemented global
>> level
>> > encoding switch, and this looks similar to DBMS: if server works with
>> > certain encoding, then all clients should be configured to use the same
>> > encoding for correct string processing.
>> >
>> > Option #2 provokes a number of questions.
>> >
>> > What are performance implications of such hidden binary reencoding?
>> >
>> > Who will check for possible data loss on transparent reencoding (when
>> > object walks between caches/fields with distinct encodings)?
>> >
>> > How should we handle nested binary objects? On the one hand, they
>> should be
>> > reencoded in a way described by Vladimir. On the other hand,
>> BinaryObject
>> > is an independent entity, that can be serialized/deserialized freely,
>> moved
>> > between various data structures, etc. It will be frustrating for user to
>> > find its binary state changed after storing in a grid, with possible
>> data
>> > corruption.
>> >
>> >
>> > As far as I can see, we are trying to couple orthogonal APIs:
>> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
>> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and
>> > their binary representations, and now we are trying to map two binary
>> types
>> > (STRING and ENCODED_STRING) to single String class. IgniteCache is much
>> > more flexible API, than SQL, but it lacks encoded string datatype, that
>> > exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`.
>> > It's not a popular idea, but many problems could be solved by adding
>> such
>> > type. Those IgniteCache API users who don't need it won't use it, but it
>> > could become a bridge between SQL and BinaryMarshaller encoded-string
>> > types.
>> >
>> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov :
>> >
>> > > What we tried to achieve is that several encoding could co-exist in a
>> > > single cluster or even single cache. This would be great from UX
>> > > perspective. However, from what Andrey wrote, I understand that this
>> > would
>> > > be pretty hard to achieve as we rely heavily on similar binary
>> > > representation of objects being compared. That said, while this could
>> > work
>> > > for SQL with some adjustments, we will have severe problems with
>> > > BinaryObject.equals().
>> > >
>> > > Let's think on how we can resolve this. I see two options:
>> > > 1) Allow only single encoding in the whole cluster. Easy to implement,
>> > but
>> > > very bad from usability perspective. Especially this would affect
>> > clients -
>> > > client nodes, and what is worse, 

Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-11 Thread Vladimir Ozerov
Dima,

You contradict yourself - vote for per-column encoding on the one hand, but
telling that it is "over-architected" on the other. This is exactly what I
am talking about - anything more that hard-coded cluster-wide encoding is
complex. You cannot simply define per-column encoding. In addition you
should either pass information about this encoding too all cluster members
and to all clients, so that they construct correct binary object in the
first place, or you should re-convert binary object on fly, this is what I
suggested. No simple solution here.

I vote for cluster-wide encoding for now, but with transparent conversion
when needed.


On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan 
wrote:

> I would agree with Andrey, it does look a bit over-architected to me. Why
> would anyone try to move data from one encoding to another? Is it a real
> use case that needs to be handled automatically?
>
> Here is what I think we should handle:
>
>1. Ability to set cluster-wide encoding. This should be easy.
>2. Ability to set per-column encoding. Such encoding should be set on
>per-column level, perhaps at cache creation or table creation. For
> example,
>at the cache creation time, we could let user define all column names
> that
>will have non-default encodings.
>
> Thoughts?
>
> D.
>
> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov 
> wrote:
>
> > As of option #1, it's not so bad. Currently we've implemented global
> level
> > encoding switch, and this looks similar to DBMS: if server works with
> > certain encoding, then all clients should be configured to use the same
> > encoding for correct string processing.
> >
> > Option #2 provokes a number of questions.
> >
> > What are performance implications of such hidden binary reencoding?
> >
> > Who will check for possible data loss on transparent reencoding (when
> > object walks between caches/fields with distinct encodings)?
> >
> > How should we handle nested binary objects? On the one hand, they should
> be
> > reencoded in a way described by Vladimir. On the other hand, BinaryObject
> > is an independent entity, that can be serialized/deserialized freely,
> moved
> > between various data structures, etc. It will be frustrating for user to
> > find its binary state changed after storing in a grid, with possible data
> > corruption.
> >
> >
> > As far as I can see, we are trying to couple orthogonal APIs:
> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and
> > their binary representations, and now we are trying to map two binary
> types
> > (STRING and ENCODED_STRING) to single String class. IgniteCache is much
> > more flexible API, than SQL, but it lacks encoded string datatype, that
> > exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`.
> > It's not a popular idea, but many problems could be solved by adding such
> > type. Those IgniteCache API users who don't need it won't use it, but it
> > could become a bridge between SQL and BinaryMarshaller encoded-string
> > types.
> >
> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov :
> >
> > > What we tried to achieve is that several encoding could co-exist in a
> > > single cluster or even single cache. This would be great from UX
> > > perspective. However, from what Andrey wrote, I understand that this
> > would
> > > be pretty hard to achieve as we rely heavily on similar binary
> > > representation of objects being compared. That said, while this could
> > work
> > > for SQL with some adjustments, we will have severe problems with
> > > BinaryObject.equals().
> > >
> > > Let's think on how we can resolve this. I see two options:
> > > 1) Allow only single encoding in the whole cluster. Easy to implement,
> > but
> > > very bad from usability perspective. Especially this would affect
> > clients -
> > > client nodes, and what is worse, drivers and thin clients! They all
> would
> > > have to bother about which encoding to use. But may be we can share
> this
> > > information during handshake (as every client has a handshake).
> > >
> > > 2) Add custom eocnding flag/ID to object header if non-standard
> enconding
> > > appears somewhere inside the object (even in nested objects). This way,
> > we
> > > will be able to re-create the object if needed if expected and actual
> > > encoding doesn't match. For example, consider we have two caches/tables
> > > with different encoding (not implemented in current iteration, but we
> may
> > > decide to implement per-cache encodings in future, as this any RDBMS
> > > support it). And then I decide to move object A from cache 1 with UTF-8
> > > encoding to cache 2 with Cp1251 encoding. In this case I will detect
> > > encoding mismatch through object header (or footer) and re-build it
> > > transparently for user.
> > >
> > > Second option is more preferable to me as a 

Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-06 Thread Dmitriy Setrakyan
I would agree with Andrey, it does look a bit over-architected to me. Why
would anyone try to move data from one encoding to another? Is it a real
use case that needs to be handled automatically?

Here is what I think we should handle:

   1. Ability to set cluster-wide encoding. This should be easy.
   2. Ability to set per-column encoding. Such encoding should be set on
   per-column level, perhaps at cache creation or table creation. For example,
   at the cache creation time, we could let user define all column names that
   will have non-default encodings.

Thoughts?

D.

On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov  wrote:

> As of option #1, it's not so bad. Currently we've implemented global level
> encoding switch, and this looks similar to DBMS: if server works with
> certain encoding, then all clients should be configured to use the same
> encoding for correct string processing.
>
> Option #2 provokes a number of questions.
>
> What are performance implications of such hidden binary reencoding?
>
> Who will check for possible data loss on transparent reencoding (when
> object walks between caches/fields with distinct encodings)?
>
> How should we handle nested binary objects? On the one hand, they should be
> reencoded in a way described by Vladimir. On the other hand, BinaryObject
> is an independent entity, that can be serialized/deserialized freely, moved
> between various data structures, etc. It will be frustrating for user to
> find its binary state changed after storing in a grid, with possible data
> corruption.
>
>
> As far as I can see, we are trying to couple orthogonal APIs:
> BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
> Java-datatype-driven, it creates 1-to-1 mapping between Java types and
> their binary representations, and now we are trying to map two binary types
> (STRING and ENCODED_STRING) to single String class. IgniteCache is much
> more flexible API, than SQL, but it lacks encoded string datatype, that
> exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`.
> It's not a popular idea, but many problems could be solved by adding such
> type. Those IgniteCache API users who don't need it won't use it, but it
> could become a bridge between SQL and BinaryMarshaller encoded-string
> types.
>
> 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov :
>
> > What we tried to achieve is that several encoding could co-exist in a
> > single cluster or even single cache. This would be great from UX
> > perspective. However, from what Andrey wrote, I understand that this
> would
> > be pretty hard to achieve as we rely heavily on similar binary
> > representation of objects being compared. That said, while this could
> work
> > for SQL with some adjustments, we will have severe problems with
> > BinaryObject.equals().
> >
> > Let's think on how we can resolve this. I see two options:
> > 1) Allow only single encoding in the whole cluster. Easy to implement,
> but
> > very bad from usability perspective. Especially this would affect
> clients -
> > client nodes, and what is worse, drivers and thin clients! They all would
> > have to bother about which encoding to use. But may be we can share this
> > information during handshake (as every client has a handshake).
> >
> > 2) Add custom eocnding flag/ID to object header if non-standard enconding
> > appears somewhere inside the object (even in nested objects). This way,
> we
> > will be able to re-create the object if needed if expected and actual
> > encoding doesn't match. For example, consider we have two caches/tables
> > with different encoding (not implemented in current iteration, but we may
> > decide to implement per-cache encodings in future, as this any RDBMS
> > support it). And then I decide to move object A from cache 1 with UTF-8
> > encoding to cache 2 with Cp1251 encoding. In this case I will detect
> > encoding mismatch through object header (or footer) and re-build it
> > transparently for user.
> >
> > Second option is more preferable to me as a long-term solution, but would
> > require =more efforts.
> >
> > Thoughts?
> >
> > --
> Best regards,
>   Andrey Kuznetsov.
>


Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-06 Thread Vladimir Ozerov
What we tried to achieve is that several encoding could co-exist in a
single cluster or even single cache. This would be great from UX
perspective. However, from what Andrey wrote, I understand that this would
be pretty hard to achieve as we rely heavily on similar binary
representation of objects being compared. That said, while this could work
for SQL with some adjustments, we will have severe problems with
BinaryObject.equals().

Let's think on how we can resolve this. I see two options:
1) Allow only single encoding in the whole cluster. Easy to implement, but
very bad from usability perspective. Especially this would affect clients -
client nodes, and what is worse, drivers and thin clients! They all would
have to bother about which encoding to use. But may be we can share this
information during handshake (as every client has a handshake).

2) Add custom eocnding flag/ID to object header if non-standard enconding
appears somewhere inside the object (even in nested objects). This way, we
will be able to re-create the object if needed if expected and actual
encoding doesn't match. For example, consider we have two caches/tables
with different encoding (not implemented in current iteration, but we may
decide to implement per-cache encodings in future, as this any RDBMS
support it). And then I decide to move object A from cache 1 with UTF-8
encoding to cache 2 with Cp1251 encoding. In this case I will detect
encoding mismatch through object header (or footer) and re-build it
transparently for user.

Second option is more preferable to me as a long-term solution, but would
require =more efforts.

Thoughts?

On Wed, Sep 6, 2017 at 3:33 AM, Dmitriy Setrakyan 
wrote:

> Can we just detect the encoding at cache, or at least column level? This
> way if the encoding does not match, we throw an exception immediately.
>
> Will it work?
>
> D.
>
> On Tue, Sep 5, 2017 at 9:16 AM, Andrey Kuznetsov 
> wrote:
>
> > Hi Igniters!
> >
> > I met a couple of issues related to different binary string encoding
> > settings on different cluster nodes.
> >
> > Let cluster has two nodes. Node0 uses win-1251 to marshal strings with
> > BinaryMarshaller and Node1 uses default utf-8 encoding. Let's create
> > replicated cache and add some entry to Node0:
> >
> > node0.cache("myCache").put("k", "v");
> >
> > Then
> >
> > node1.cache("myCache").get("k")
> >
> > returns null.
> >
> > Let me describe the cause. First, string key comes to Node1 as binary
> > payload of DHT update request, it has win-1251 encoding. This
> > representation stays in offheap area of Node1. Then GetTask comes with
> the
> > same key, plain (Serializable) Java object; BinaryMarshaller encodes the
> > key using utf-8 (Node1 setting). Finally, B+Tree lookup fails for this
> > binary key due to different encodings.
> >
> > When the key is just a string then this can be fixed by decoding binary
> > strings entirely on B+Tree lookups. But when the key is an arbitrary
> object
> > with some strings inside this way is too expensive.
> >
> > The second issue relates to lossy string encodings. Mixed-encoding
> cluster
> > does not guarantee string data integrity when "lossless" node goes down
> for
> > a while.
> >
> > Any ideas on addressing these issues?
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
> >
>


Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-05 Thread Dmitriy Setrakyan
Can we just detect the encoding at cache, or at least column level? This
way if the encoding does not match, we throw an exception immediately.

Will it work?

D.

On Tue, Sep 5, 2017 at 9:16 AM, Andrey Kuznetsov  wrote:

> Hi Igniters!
>
> I met a couple of issues related to different binary string encoding
> settings on different cluster nodes.
>
> Let cluster has two nodes. Node0 uses win-1251 to marshal strings with
> BinaryMarshaller and Node1 uses default utf-8 encoding. Let's create
> replicated cache and add some entry to Node0:
>
> node0.cache("myCache").put("k", "v");
>
> Then
>
> node1.cache("myCache").get("k")
>
> returns null.
>
> Let me describe the cause. First, string key comes to Node1 as binary
> payload of DHT update request, it has win-1251 encoding. This
> representation stays in offheap area of Node1. Then GetTask comes with the
> same key, plain (Serializable) Java object; BinaryMarshaller encodes the
> key using utf-8 (Node1 setting). Finally, B+Tree lookup fails for this
> binary key due to different encodings.
>
> When the key is just a string then this can be fixed by decoding binary
> strings entirely on B+Tree lookups. But when the key is an arbitrary object
> with some strings inside this way is too expensive.
>
> The second issue relates to lossy string encodings. Mixed-encoding cluster
> does not guarantee string data integrity when "lossless" node goes down for
> a while.
>
> Any ideas on addressing these issues?
>
> --
> Best regards,
>   Andrey Kuznetsov.
>


IGNITE-5655: Mixing binary string encodings in Ignite cluster

2017-09-05 Thread Andrey Kuznetsov
Hi Igniters!

I met a couple of issues related to different binary string encoding
settings on different cluster nodes.

Let cluster has two nodes. Node0 uses win-1251 to marshal strings with
BinaryMarshaller and Node1 uses default utf-8 encoding. Let's create
replicated cache and add some entry to Node0:

node0.cache("myCache").put("k", "v");

Then

node1.cache("myCache").get("k")

returns null.

Let me describe the cause. First, string key comes to Node1 as binary
payload of DHT update request, it has win-1251 encoding. This
representation stays in offheap area of Node1. Then GetTask comes with the
same key, plain (Serializable) Java object; BinaryMarshaller encodes the
key using utf-8 (Node1 setting). Finally, B+Tree lookup fails for this
binary key due to different encodings.

When the key is just a string then this can be fixed by decoding binary
strings entirely on B+Tree lookups. But when the key is an arbitrary object
with some strings inside this way is too expensive.

The second issue relates to lossy string encodings. Mixed-encoding cluster
does not guarantee string data integrity when "lossless" node goes down for
a while.

Any ideas on addressing these issues?

-- 
Best regards,
  Andrey Kuznetsov.