Vova, generally agree, but why not also support per-cache (per-table) settings?
On Mon, Sep 11, 2017 at 1:16 AM, Vladimir Ozerov <[email protected]> wrote: > OK, here is implementation plan I propose: > 1) Add global character set configuration - > IgniteConfiguration.characterSet. Note, it is located in > IgniteConfiguration, not BinaryConfiguration. > 2) All cluster nodes must have the same character set. > 3) Once defined, character set cannot be changed ever. In future we will > probably have import/export utilities, which will help users migrate > between character sets. Such strict behavior is normal for other major DBMS > vendors (e.g. Oracle), so it should work for us as well. > 4) We will add "characterSet" property to all clients (ODBC, JDBC, thin > client). It will be validated during handshake phase. Exception is thrown > in case of mismatch. > 5) In future we will work on relaxing these restrictions in favor of > runtime conversions on fly. > > Thoughts? > > > > On Mon, Sep 11, 2017 at 11:01 AM, Vladimir Ozerov <[email protected]> > wrote: > > > Dima, > > > > You contradict yourself - vote for per-column encoding on the one hand, > > but telling that it is "over-architected" on the other. This is exactly > > what I am talking about - anything more that hard-coded cluster-wide > > encoding is complex. You cannot simply define per-column encoding. In > > addition you should either pass information about this encoding too all > > cluster members and to all clients, so that they construct correct binary > > object in the first place, or you should re-convert binary object on fly, > > this is what I suggested. No simple solution here. > > > > I vote for cluster-wide encoding for now, but with transparent conversion > > when needed. > > > > > > On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan <[email protected] > > > > wrote: > > > >> I would agree with Andrey, it does look a bit over-architected to me. > Why > >> would anyone try to move data from one encoding to another? Is it a real > >> use case that needs to be handled automatically? > >> > >> Here is what I think we should handle: > >> > >> 1. Ability to set cluster-wide encoding. This should be easy. > >> 2. Ability to set per-column encoding. Such encoding should be set on > >> per-column level, perhaps at cache creation or table creation. For > >> example, > >> at the cache creation time, we could let user define all column names > >> that > >> will have non-default encodings. > >> > >> Thoughts? > >> > >> D. > >> > >> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov <[email protected]> > >> wrote: > >> > >> > As of option #1, it's not so bad. Currently we've implemented global > >> level > >> > encoding switch, and this looks similar to DBMS: if server works with > >> > certain encoding, then all clients should be configured to use the > same > >> > encoding for correct string processing. > >> > > >> > Option #2 provokes a number of questions. > >> > > >> > What are performance implications of such hidden binary reencoding? > >> > > >> > Who will check for possible data loss on transparent reencoding (when > >> > object walks between caches/fields with distinct encodings)? > >> > > >> > How should we handle nested binary objects? On the one hand, they > >> should be > >> > reencoded in a way described by Vladimir. On the other hand, > >> BinaryObject > >> > is an independent entity, that can be serialized/deserialized freely, > >> moved > >> > between various data structures, etc. It will be frustrating for user > to > >> > find its binary state changed after storing in a grid, with possible > >> data > >> > corruption. > >> > > >> > > >> > As far as I can see, we are trying to couple orthogonal APIs: > >> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is > >> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and > >> > their binary representations, and now we are trying to map two binary > >> types > >> > (STRING and ENCODED_STRING) to single String class. IgniteCache is > much > >> > more flexible API, than SQL, but it lacks encoded string datatype, > that > >> > exists in SQLs of some RDBMSs: `varchar(n) character set > some_charset`. > >> > It's not a popular idea, but many problems could be solved by adding > >> such > >> > type. Those IgniteCache API users who don't need it won't use it, but > it > >> > could become a bridge between SQL and BinaryMarshaller encoded-string > >> > types. > >> > > >> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov <[email protected]>: > >> > > >> > > What we tried to achieve is that several encoding could co-exist in > a > >> > > single cluster or even single cache. This would be great from UX > >> > > perspective. However, from what Andrey wrote, I understand that this > >> > would > >> > > be pretty hard to achieve as we rely heavily on similar binary > >> > > representation of objects being compared. That said, while this > could > >> > work > >> > > for SQL with some adjustments, we will have severe problems with > >> > > BinaryObject.equals(). > >> > > > >> > > Let's think on how we can resolve this. I see two options: > >> > > 1) Allow only single encoding in the whole cluster. Easy to > implement, > >> > but > >> > > very bad from usability perspective. Especially this would affect > >> > clients - > >> > > client nodes, and what is worse, drivers and thin clients! They all > >> would > >> > > have to bother about which encoding to use. But may be we can share > >> this > >> > > information during handshake (as every client has a handshake). > >> > > > >> > > 2) Add custom eocnding flag/ID to object header if non-standard > >> enconding > >> > > appears somewhere inside the object (even in nested objects). This > >> way, > >> > we > >> > > will be able to re-create the object if needed if expected and > actual > >> > > encoding doesn't match. For example, consider we have two > >> caches/tables > >> > > with different encoding (not implemented in current iteration, but > we > >> may > >> > > decide to implement per-cache encodings in future, as this any RDBMS > >> > > support it). And then I decide to move object A from cache 1 with > >> UTF-8 > >> > > encoding to cache 2 with Cp1251 encoding. In this case I will detect > >> > > encoding mismatch through object header (or footer) and re-build it > >> > > transparently for user. > >> > > > >> > > Second option is more preferable to me as a long-term solution, but > >> would > >> > > require =more efforts. > >> > > > >> > > Thoughts? > >> > > > >> > > -- > >> > Best regards, > >> > Andrey Kuznetsov. > >> > > >> > > > > >
