Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

Dmitriy Setrakyan Mon, 11 Sep 2017 15:01:01 -0700

Vova, generally agree, but why not also support per-cache (per-table)
settings?



On Mon, Sep 11, 2017 at 1:16 AM, Vladimir Ozerov <[email protected]>
wrote:

> OK, here is implementation plan I propose:
> 1) Add global character set configuration -
> IgniteConfiguration.characterSet. Note, it is located in
> IgniteConfiguration, not BinaryConfiguration.
> 2) All cluster nodes must have the same character set.
> 3) Once defined, character set cannot be changed ever. In future we will
> probably have import/export utilities, which will help users migrate
> between character sets. Such strict behavior is normal for other major DBMS
> vendors (e.g. Oracle), so it should work for us as well.
> 4) We will add "characterSet" property to all clients (ODBC, JDBC, thin
> client). It will be validated during handshake phase. Exception is thrown
> in case of mismatch.
> 5) In future we will work on relaxing these restrictions in favor of
> runtime conversions on fly.
>
> Thoughts?
>
>
>
> On Mon, Sep 11, 2017 at 11:01 AM, Vladimir Ozerov <[email protected]>
> wrote:
>
> > Dima,
> >
> > You contradict yourself - vote for per-column encoding on the one hand,
> > but telling that it is "over-architected" on the other. This is exactly
> > what I am talking about - anything more that hard-coded cluster-wide
> > encoding is complex. You cannot simply define per-column encoding. In
> > addition you should either pass information about this encoding too all
> > cluster members and to all clients, so that they construct correct binary
> > object in the first place, or you should re-convert binary object on fly,
> > this is what I suggested. No simple solution here.
> >
> > I vote for cluster-wide encoding for now, but with transparent conversion
> > when needed.
> >
> >
> > On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan <[email protected]
> >
> > wrote:
> >
> >> I would agree with Andrey, it does look a bit over-architected to me.
> Why
> >> would anyone try to move data from one encoding to another? Is it a real
> >> use case that needs to be handled automatically?
> >>
> >> Here is what I think we should handle:
> >>
> >>    1. Ability to set cluster-wide encoding. This should be easy.
> >>    2. Ability to set per-column encoding. Such encoding should be set on
> >>    per-column level, perhaps at cache creation or table creation. For
> >> example,
> >>    at the cache creation time, we could let user define all column names
> >> that
> >>    will have non-default encodings.
> >>
> >> Thoughts?
> >>
> >> D.
> >>
> >> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov <[email protected]>
> >> wrote:
> >>
> >> > As of option #1, it's not so bad. Currently we've implemented global
> >> level
> >> > encoding switch, and this looks similar to DBMS: if server works with
> >> > certain encoding, then all clients should be configured to use the
> same
> >> > encoding for correct string processing.
> >> >
> >> > Option #2 provokes a number of questions.
> >> >
> >> > What are performance implications of such hidden binary reencoding?
> >> >
> >> > Who will check for possible data loss on transparent reencoding (when
> >> > object walks between caches/fields with distinct encodings)?
> >> >
> >> > How should we handle nested binary objects? On the one hand, they
> >> should be
> >> > reencoded in a way described by Vladimir. On the other hand,
> >> BinaryObject
> >> > is an independent entity, that can be serialized/deserialized freely,
> >> moved
> >> > between various data structures, etc. It will be frustrating for user
> to
> >> > find its binary state changed after storing in a grid, with possible
> >> data
> >> > corruption.
> >> >
> >> >
> >> > As far as I can see, we are trying to couple orthogonal APIs:
> >> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
> >> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and
> >> > their binary representations, and now we are trying to map two binary
> >> types
> >> > (STRING and ENCODED_STRING) to single String class. IgniteCache is
> much
> >> > more flexible API, than SQL, but it lacks encoded string datatype,
> that
> >> > exists in SQLs of some RDBMSs: `varchar(n) character set
> some_charset`.
> >> > It's not a popular idea, but many problems could be solved by adding
> >> such
> >> > type. Those IgniteCache API users who don't need it won't use it, but
> it
> >> > could become a bridge between SQL and BinaryMarshaller encoded-string
> >> > types.
> >> >
> >> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov <[email protected]>:
> >> >
> >> > > What we tried to achieve is that several encoding could co-exist in
> a
> >> > > single cluster or even single cache. This would be great from UX
> >> > > perspective. However, from what Andrey wrote, I understand that this
> >> > would
> >> > > be pretty hard to achieve as we rely heavily on similar binary
> >> > > representation of objects being compared. That said, while this
> could
> >> > work
> >> > > for SQL with some adjustments, we will have severe problems with
> >> > > BinaryObject.equals().
> >> > >
> >> > > Let's think on how we can resolve this. I see two options:
> >> > > 1) Allow only single encoding in the whole cluster. Easy to
> implement,
> >> > but
> >> > > very bad from usability perspective. Especially this would affect
> >> > clients -
> >> > > client nodes, and what is worse, drivers and thin clients! They all
> >> would
> >> > > have to bother about which encoding to use. But may be we can share
> >> this
> >> > > information during handshake (as every client has a handshake).
> >> > >
> >> > > 2) Add custom eocnding flag/ID to object header if non-standard
> >> enconding
> >> > > appears somewhere inside the object (even in nested objects). This
> >> way,
> >> > we
> >> > > will be able to re-create the object if needed if expected and
> actual
> >> > > encoding doesn't match. For example, consider we have two
> >> caches/tables
> >> > > with different encoding (not implemented in current iteration, but
> we
> >> may
> >> > > decide to implement per-cache encodings in future, as this any RDBMS
> >> > > support it). And then I decide to move object A from cache 1 with
> >> UTF-8
> >> > > encoding to cache 2 with Cp1251 encoding. In this case I will detect
> >> > > encoding mismatch through object header (or footer) and re-build it
> >> > > transparently for user.
> >> > >
> >> > > Second option is more preferable to me as a long-term solution, but
> >> would
> >> > > require =more efforts.
> >> > >
> >> > > Thoughts?
> >> > >
> >> > > --
> >> > Best regards,
> >> >   Andrey Kuznetsov.
> >> >
> >>
> >
> >
>

Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster

Reply via email to