@Ryan & Jon - thanks again for pursuing this - I think it'll be a big improvement.
IMHO, it'd be good to add a Requirements section to the doc. If the current Phoenix type system meets those requirements, then why not just go with that? I think we need a binary serialization spec that includes compound keys in the row key plus all the SQL primitive data types that we want to support (minimally all the SQL types that Phoenix currently supports). @Nick - I like the abstraction of the DataType, but that doesn't solve the problem for non Java usage. I'm also a bit worried that it might become a bottleneck for implementors of the serialization spec as there are many different platform specific operations that will likely be done on the row key. We can try to get everything necessary in the DataType interface, but I suspect that implementors will need to go under-the-covers at times (rather than waiting for another release of the module that defines the DataType interface) - might become a bottleneck. Thanks, James On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <rb...@cloudera.com> wrote: > > > > I think there's a little confusion in what we are trying to accomplish. > > What I want to do is to write a minimal specification for how to store a > > set of types. I'm not trying to leave much flexibility, what I want is > > clarity and simplicity. > > > > This is admirable and was my initial goal as well. The trouble is, you > cannot please everyone, current users and new. So, we decided it was better > to provide a pluggable framework for extension + some basic implementations > than to implement a closed system. > > This is similar to OrderedBytes work, but a subset of it. A good example is > > that while it's possible to use different encodings (avro, protobuf, > > thrift, ...) it isn't practical for an application to support all of > those > > encodings. So for interoperability between Kite, Phoenix, and others, I > > want a set of requirements that is as small as possible. > > > > Minimal is good. The surface area of o.a.h.h.types is as large as it is > because there was always "just one more" type to support or encoding to > provide. > > To make the requirements small, I used off-the-shelf protobuf [1] plus a > > small set of memcmp encodings: ints, floats, and binary. That way, we > don't > > have to talk about how to make a memcmp Date in bytes, for example. A > Date > > is an int, which we know how to encode, and we can agree separately on > how > > to a Date is represented (e.g., Julian vs unix epoch). [2] The same > applies > > to binary, where the encoding handles sorting and nulls, but not > charsets. > > > > I think you should focus on the primitives you want to support. The > compound type stuff (ie, "rowkey encodings") is a can of worms because you > need to support existing users, new users, novice users, and advanced > users. Hence the interop between the DataType interface and the Struct > classes. These work together to support all of these use-cases with the > same basic code. For example, the protobuf encoding of postion|wire-type + > encoded value is easily implemented using Struct. > > I firmly believe that we cannot dictate rowkey composition. Applications, > however, are free to implement their own. By using the common DataType > interface, they can all interoperate. > > This is the largest reason why I didn't include OrderedBytes directly in > > the spec. For example, OB includes a varint that I don't think is > needed. I > > don't object to its inclusion in OB, but I think it isn't a necessary > > requirement for implementing this spec. > > > > Again, the surface area is as it is because of community consensus during > the first phase of implementation. That consensus disagrees with you. > > I think there are 3 things to clear up: > > 1. What types from OB are not included, and why? > > 2. Why not use OB-style structs? > > 3. Why choose protobuf for complex records? > > > > Does that sound like a reasonable direction to head with this discussion? > > > > Yes, sounds great! > > As far as the DataType API, I think that works great with what I'm trying > > to do. We'd build a DataType implementation for the encoding and the API > > will applications handle the underlying encoding. And other encoding > > strategies can be swapped in as well, if we want to address shortcomings > in > > this one, or have another for a different use case. > > > > I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji are > the target audience of the DataType API. > > Thank you for picking back up this baton. It's sat for too long. > > -n > > On 05/13/2014 02:33 PM, Nick Dimiduk wrote: > > > >> Breaking off hackathon thread. > >> > >> The conversation around HBASE-8089 concluded with two points: > >> - HBase should provide support for order-preserving encodings while > >> not dropping support for the existing encoding formats. > >> - HBase is not in the business of schema management; that is a > >> responsibility left to application developers. > >> > >> To handle the first point, OrderedBytes is provided. For the supporting > >> the second, the DataType API is introduced. By introducing this layer > >> above specific encoding formats, it gives us a hook for plugging in > >> different implementations and for helper utilities to ship with HBase, > >> such as HBASE-10091. > >> > >> Things get fuzzy around complex data types: pojos, compound rowkeys (a > >> special case of pojo), maps/dicts, and lists/arrays. These types are > >> composed of other types and have different requirements based on where > >> in the schema they're used. Again, by falling back on the DataType API, > >> we give application developers an "out" for doing what makes the most > >> sense for them. > >> > >> For compound rowkeys, the Struct class is designed to fill in this gap, > >> sitting between data encoding and schema expression. It gives the > >> application implementer, the person managing the schema, enough > >> flexibility express the key encoding in terms of the component types. > >> These components are not limited to the simple primitives already > >> defined, but any DataType implementation. Order preservation is likely > >> important here. > >> > >> For arrays/lists, there's no implementation yet, but you can see how it > >> might be done if you have a look at struct. Order preservation may or > >> may not be important for arrays/list. > >> > >> The situation for maps/dicts is similar to arrays/lists. The one > >> complication is the case where you want to map to a column family. How > >> can these APIs support this thing? > >> > >> Pojos are a little more complicated. Probably Struct is sufficient for > >> basic cases, but it doesn't support nice features like versioning -- > >> these are sacrificed in favor of order preservation. Luckily, there's > >> plenty of tools out there for this already: Avro, MessagePack, Protobuf, > >> Thrift, &c. There's no need to reinvent the wheel here. Application > >> developers can implement the DataType API backed by their management > >> tool of choice. I created HBASE-11161 and will post a patch shortly. > >> > >> Specific comments about the Hackathon notes inline. > >> > >> Thanks, > >> Nick > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Cloudera, Inc. > > >