What we tried to achieve is that several encoding could co-exist in a
single cluster or even single cache. This would be great from UX
perspective. However, from what Andrey wrote, I understand that this would
be pretty hard to achieve as we rely heavily on similar binary
representation of objects being compared. That said, while this could work
for SQL with some adjustments, we will have severe problems with
BinaryObject.equals().

Let's think on how we can resolve this. I see two options:
1) Allow only single encoding in the whole cluster. Easy to implement, but
very bad from usability perspective. Especially this would affect clients -
client nodes, and what is worse, drivers and thin clients! They all would
have to bother about which encoding to use. But may be we can share this
information during handshake (as every client has a handshake).

2) Add custom eocnding flag/ID to object header if non-standard enconding
appears somewhere inside the object (even in nested objects). This way, we
will be able to re-create the object if needed if expected and actual
encoding doesn't match. For example, consider we have two caches/tables
with different encoding (not implemented in current iteration, but we may
decide to implement per-cache encodings in future, as this any RDBMS
support it). And then I decide to move object A from cache 1 with UTF-8
encoding to cache 2 with Cp1251 encoding. In this case I will detect
encoding mismatch through object header (or footer) and re-build it
transparently for user.

Second option is more preferable to me as a long-term solution, but would
require =more efforts.

Thoughts?

On Wed, Sep 6, 2017 at 3:33 AM, Dmitriy Setrakyan <[email protected]>
wrote:

> Can we just detect the encoding at cache, or at least column level? This
> way if the encoding does not match, we throw an exception immediately.
>
> Will it work?
>
> D.
>
> On Tue, Sep 5, 2017 at 9:16 AM, Andrey Kuznetsov <[email protected]>
> wrote:
>
> > Hi Igniters!
> >
> > I met a couple of issues related to different binary string encoding
> > settings on different cluster nodes.
> >
> > Let cluster has two nodes. Node0 uses win-1251 to marshal strings with
> > BinaryMarshaller and Node1 uses default utf-8 encoding. Let's create
> > replicated cache and add some entry to Node0:
> >
> > node0.cache("myCache").put("k", "v");
> >
> > Then
> >
> > node1.cache("myCache").get("k")
> >
> > returns null.
> >
> > Let me describe the cause. First, string key comes to Node1 as binary
> > payload of DHT update request, it has win-1251 encoding. This
> > representation stays in offheap area of Node1. Then GetTask comes with
> the
> > same key, plain (Serializable) Java object; BinaryMarshaller encodes the
> > key using utf-8 (Node1 setting). Finally, B+Tree lookup fails for this
> > binary key due to different encodings.
> >
> > When the key is just a string then this can be fixed by decoding binary
> > strings entirely on B+Tree lookups. But when the key is an arbitrary
> object
> > with some strings inside this way is too expensive.
> >
> > The second issue relates to lossy string encodings. Mixed-encoding
> cluster
> > does not guarantee string data integrity when "lossless" node goes down
> for
> > a while.
> >
> > Any ideas on addressing these issues?
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
> >
>

Reply via email to