OK, here is implementation plan I propose: 1) Add global character set configuration - IgniteConfiguration.characterSet. Note, it is located in IgniteConfiguration, not BinaryConfiguration. 2) All cluster nodes must have the same character set. 3) Once defined, character set cannot be changed ever. In future we will probably have import/export utilities, which will help users migrate between character sets. Such strict behavior is normal for other major DBMS vendors (e.g. Oracle), so it should work for us as well. 4) We will add "characterSet" property to all clients (ODBC, JDBC, thin client). It will be validated during handshake phase. Exception is thrown in case of mismatch. 5) In future we will work on relaxing these restrictions in favor of runtime conversions on fly.
Thoughts? On Mon, Sep 11, 2017 at 11:01 AM, Vladimir Ozerov <[email protected]> wrote: > Dima, > > You contradict yourself - vote for per-column encoding on the one hand, > but telling that it is "over-architected" on the other. This is exactly > what I am talking about - anything more that hard-coded cluster-wide > encoding is complex. You cannot simply define per-column encoding. In > addition you should either pass information about this encoding too all > cluster members and to all clients, so that they construct correct binary > object in the first place, or you should re-convert binary object on fly, > this is what I suggested. No simple solution here. > > I vote for cluster-wide encoding for now, but with transparent conversion > when needed. > > > On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan <[email protected]> > wrote: > >> I would agree with Andrey, it does look a bit over-architected to me. Why >> would anyone try to move data from one encoding to another? Is it a real >> use case that needs to be handled automatically? >> >> Here is what I think we should handle: >> >> 1. Ability to set cluster-wide encoding. This should be easy. >> 2. Ability to set per-column encoding. Such encoding should be set on >> per-column level, perhaps at cache creation or table creation. For >> example, >> at the cache creation time, we could let user define all column names >> that >> will have non-default encodings. >> >> Thoughts? >> >> D. >> >> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov <[email protected]> >> wrote: >> >> > As of option #1, it's not so bad. Currently we've implemented global >> level >> > encoding switch, and this looks similar to DBMS: if server works with >> > certain encoding, then all clients should be configured to use the same >> > encoding for correct string processing. >> > >> > Option #2 provokes a number of questions. >> > >> > What are performance implications of such hidden binary reencoding? >> > >> > Who will check for possible data loss on transparent reencoding (when >> > object walks between caches/fields with distinct encodings)? >> > >> > How should we handle nested binary objects? On the one hand, they >> should be >> > reencoded in a way described by Vladimir. On the other hand, >> BinaryObject >> > is an independent entity, that can be serialized/deserialized freely, >> moved >> > between various data structures, etc. It will be frustrating for user to >> > find its binary state changed after storing in a grid, with possible >> data >> > corruption. >> > >> > >> > As far as I can see, we are trying to couple orthogonal APIs: >> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is >> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and >> > their binary representations, and now we are trying to map two binary >> types >> > (STRING and ENCODED_STRING) to single String class. IgniteCache is much >> > more flexible API, than SQL, but it lacks encoded string datatype, that >> > exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`. >> > It's not a popular idea, but many problems could be solved by adding >> such >> > type. Those IgniteCache API users who don't need it won't use it, but it >> > could become a bridge between SQL and BinaryMarshaller encoded-string >> > types. >> > >> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov <[email protected]>: >> > >> > > What we tried to achieve is that several encoding could co-exist in a >> > > single cluster or even single cache. This would be great from UX >> > > perspective. However, from what Andrey wrote, I understand that this >> > would >> > > be pretty hard to achieve as we rely heavily on similar binary >> > > representation of objects being compared. That said, while this could >> > work >> > > for SQL with some adjustments, we will have severe problems with >> > > BinaryObject.equals(). >> > > >> > > Let's think on how we can resolve this. I see two options: >> > > 1) Allow only single encoding in the whole cluster. Easy to implement, >> > but >> > > very bad from usability perspective. Especially this would affect >> > clients - >> > > client nodes, and what is worse, drivers and thin clients! They all >> would >> > > have to bother about which encoding to use. But may be we can share >> this >> > > information during handshake (as every client has a handshake). >> > > >> > > 2) Add custom eocnding flag/ID to object header if non-standard >> enconding >> > > appears somewhere inside the object (even in nested objects). This >> way, >> > we >> > > will be able to re-create the object if needed if expected and actual >> > > encoding doesn't match. For example, consider we have two >> caches/tables >> > > with different encoding (not implemented in current iteration, but we >> may >> > > decide to implement per-cache encodings in future, as this any RDBMS >> > > support it). And then I decide to move object A from cache 1 with >> UTF-8 >> > > encoding to cache 2 with Cp1251 encoding. In this case I will detect >> > > encoding mismatch through object header (or footer) and re-build it >> > > transparently for user. >> > > >> > > Second option is more preferable to me as a long-term solution, but >> would >> > > require =more efforts. >> > > >> > > Thoughts? >> > > >> > > -- >> > Best regards, >> > Andrey Kuznetsov. >> > >> > >
