Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

Dmitriy Setrakyan Wed, 06 Sep 2017 18:51:10 -0700

I would agree with Andrey, it does look a bit over-architected to me. Why
would anyone try to move data from one encoding to another? Is it a real
use case that needs to be handled automatically?


Here is what I think we should handle:

   1. Ability to set cluster-wide encoding. This should be easy.
   2. Ability to set per-column encoding. Such encoding should be set on
   per-column level, perhaps at cache creation or table creation. For example,
   at the cache creation time, we could let user define all column names that
   will have non-default encodings.

Thoughts?

D.

On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov <[email protected]> wrote:

> As of option #1, it's not so bad. Currently we've implemented global level
> encoding switch, and this looks similar to DBMS: if server works with
> certain encoding, then all clients should be configured to use the same
> encoding for correct string processing.
>
> Option #2 provokes a number of questions.
>
> What are performance implications of such hidden binary reencoding?
>
> Who will check for possible data loss on transparent reencoding (when
> object walks between caches/fields with distinct encodings)?
>
> How should we handle nested binary objects? On the one hand, they should be
> reencoded in a way described by Vladimir. On the other hand, BinaryObject
> is an independent entity, that can be serialized/deserialized freely, moved
> between various data structures, etc. It will be frustrating for user to
> find its binary state changed after storing in a grid, with possible data
> corruption.
>
>
> As far as I can see, we are trying to couple orthogonal APIs:
> BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
> Java-datatype-driven, it creates 1-to-1 mapping between Java types and
> their binary representations, and now we are trying to map two binary types
> (STRING and ENCODED_STRING) to single String class. IgniteCache is much
> more flexible API, than SQL, but it lacks encoded string datatype, that
> exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`.
> It's not a popular idea, but many problems could be solved by adding such
> type. Those IgniteCache API users who don't need it won't use it, but it
> could become a bridge between SQL and BinaryMarshaller encoded-string
> types.
>
> 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov <[email protected]>:
>
> > What we tried to achieve is that several encoding could co-exist in a
> > single cluster or even single cache. This would be great from UX
> > perspective. However, from what Andrey wrote, I understand that this
> would
> > be pretty hard to achieve as we rely heavily on similar binary
> > representation of objects being compared. That said, while this could
> work
> > for SQL with some adjustments, we will have severe problems with
> > BinaryObject.equals().
> >
> > Let's think on how we can resolve this. I see two options:
> > 1) Allow only single encoding in the whole cluster. Easy to implement,
> but
> > very bad from usability perspective. Especially this would affect
> clients -
> > client nodes, and what is worse, drivers and thin clients! They all would
> > have to bother about which encoding to use. But may be we can share this
> > information during handshake (as every client has a handshake).
> >
> > 2) Add custom eocnding flag/ID to object header if non-standard enconding
> > appears somewhere inside the object (even in nested objects). This way,
> we
> > will be able to re-create the object if needed if expected and actual
> > encoding doesn't match. For example, consider we have two caches/tables
> > with different encoding (not implemented in current iteration, but we may
> > decide to implement per-cache encodings in future, as this any RDBMS
> > support it). And then I decide to move object A from cache 1 with UTF-8
> > encoding to cache 2 with Cp1251 encoding. In this case I will detect
> > encoding mismatch through object header (or footer) and re-build it
> > transparently for user.
> >
> > Second option is more preferable to me as a long-term solution, but would
> > require =more efforts.
> >
> > Thoughts?
> >
> > --
> Best regards,
>   Andrey Kuznetsov.
>

Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

Reply via email to