I agree that a user-extensible interface is a required feature here. Personally, I'd love to ship a set of standard GIS tools on HBase. Let's keep in mind, though, that SQL and user applications are not the only consumers of this interface. A big motivation is allowing interop with the other higher MR languages. *cough* Where are my Pig and Hive peeps in this thread?
On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtay...@salesforce.com>wrote: > Maybe if we can keep nullability separate from the > serialization/deserialization, we can come up with a solution that works? > We're able to essentially infer that a column is null based on its value > being missing or empty. So if an iterator through the row key bytes could > detect/indicate that, then an application could "infer" the value is null. > > We're definitely planning on keeping byte[] accessors for use cases that > need it. I'm curious on the geographic data case, though, could you use a > fixed length long with a couple of new SQL built-ins to encode/decode the > latitude/longitude? > > > On 04/01/2013 11:29 PM, Jesse Yates wrote: > >> Actually, that isn't all that far-fetched of a format Matt - pretty common >> anytime anyone wants to do sortable lat/long (*cough* three letter >> agencies >> cough*). >> >> Wouldn't we get the same by providing a simple set of libraries (ala >> orderly + other HBase useful things) and then still give access to the >> underlying byte array? Perhaps a nullable key type in that lib makes sense >> if lots of people need it and it would be nice to have standard libraries >> so tools could interop much more easily. >> ------------------- >> Jesse Yates >> @jesse_yates >> jyates.github.com >> >> >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcor...@hotpads.com> wrote: >> >> Ah, I didn't even realize sql allowed null key parts. Maybe a goal of >>> the >>> interfaces should be to provide first-class support for custom user types >>> in addition to the standard ones included. Part of the power of hbase's >>> plain byte[] keys is that users can concoct the perfect key for their >>> data >>> type. For example, I have a lot of geographic data where I interleave >>> latitude/longitude bits into a sortable 64 bit value that would probably >>> never be included in a standard library. >>> >>> >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis....@gmail.com> >>> wrote: >>> >>> I think having Int32, and NullableInt32 would support minimum overhead, >>>> >>> as >>> >>>> well as allowing SQL semantics. >>>> >>>> >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimi...@gmail.com> >>>> wrote: >>>> >>>> Furthermore, is is more important to support null values than squeeze >>>>> >>>> all >>> >>>> representations into minimum size (4-bytes for int32, &c.)? >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote: >>>>> >>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtay...@salesforce.com >>>>>> wrote: >>>>>> >>>>>> From the SQL perspective, handling null is important. >>>>>>> >>>>>> >>>>>> From your perspective, it is critical to support NULLs, even at the >>>>>> expense of fixed-width encodings at all or supporting representation >>>>>> >>>>> of a >>>> >>>>> full range of values. That is, you'd rather be able to represent NULL >>>>>> >>>>> than >>>>> >>>>>> -2^31? >>>>>> >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote: >>>>>> >>>>>>> Thanks for the thoughtful response (and code!). >>>>>>>> >>>>>>>> I'm thinking I will press forward with a base implementation that >>>>>>>> >>>>>>> does >>>> >>>>> not >>>>>>>> support nulls. The idea is to provide an extensible set of >>>>>>>> >>>>>>> interfaces, >>>> >>>>> so I >>>>>>>> think this will not box us into a corner later. That is, a >>>>>>>> >>>>>>> mirroring >>> >>>> package could be implemented that supports null values and accepts >>>>>>>> the relevant trade-offs. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Nick >>>>>>>> >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcor...@hotpads.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I spent some time this weekend extracting bits of our >>>>>>>> >>>>>>> serialization >>> >>>> code to >>>>>>>>> a public github repo at >>>>>>>>> http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools> >>>>>>>>> < >>>>>>>>> >>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools> >>>>> > >>>>> >>>>>> . >>>>>>>>> Contributions are welcome - i'm sure we all have this stuff >>>>>>>>> >>>>>>>> laying >>> >>>> around. >>>>>>>>> >>>>>>>>> You can see I've bumped into the NULL problem in a few places: >>>>>>>>> * >>>>>>>>> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.** >>>>>>>>> **java< >>>>>>>>> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/** >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java> >>> >>>> * >>>>>>>>> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.**** >>>>>>>>> java< >>>>>>>>> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/** >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java> >>> >>>> Looking back, I think my latest opinion on the topic is to reject >>>>>>>>> nullability as the rule since it can cause unexpected behavior and >>>>>>>>> confusion. It's cleaner to provide a wrapper class (so both >>>>>>>>> LongArrayList >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior, >>>>>>>>> >>>>>>>> and >>>> >>>>> costs >>>>>>>>> a little more in performance. If the user can't find a pre-made >>>>>>>>> >>>>>>>> wrapper >>>>> >>>>>> class, it's not very difficult for each user to provide their own >>>>>>>>> interpretation of null and check for it themselves. >>>>>>>>> >>>>>>>>> If you reject nullability, the question becomes what to do in >>>>>>>>> >>>>>>>> situations >>>>> >>>>>> where you're implementing existing interfaces that accept nullable >>>>>>>>> params. >>>>>>>>> The LongArrayList above implements List<Long> which requires an >>>>>>>>> add(Long) >>>>>>>>> method. In the above implementation I chose to swap nulls with >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user >>>>>>>>> >>>>>>>> to >>>> >>>>> make >>>>>>>>> that swap and then throw IllegalArgumentException if they pass >>>>>>>>> >>>>>>>> null. >>> >>>> >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < >>>>>>>>> doug.m...@explorysmedical.com >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>>> HmmmŠ good question. >>>>>>>>>> >>>>>>>>>> I think that fixed width support is important for a great many >>>>>>>>>> >>>>>>>>> rowkey >>>> >>>>> constructs cases, so I'd rather see something like losing >>>>>>>>>> >>>>>>>>> MIN_VALUE >>> >>>> and >>>>> >>>>>> keeping fixed width. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Heya, >>>>>>>>>> >>>>>>>>>>> Thinking about data types and serialization. I think null >>>>>>>>>>> >>>>>>>>>> support >>> >>>> is >>>> >>>>> an >>>>>>>>>>> important characteristic for the serialized representations, >>>>>>>>>>> especially >>>>>>>>>>> when considering the compound type. However, doing so in >>>>>>>>>>> >>>>>>>>>> directly >>> >>>> incompatible with fixed-width representations for numerics. For >>>>>>>>>>> >>>>>>>>>>> instance, >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes, >>>>>>>>>> >>>>>>>>> where >>>> >>>>> do >>>>>>>>>>> you put null? float and double types can cheat a little by >>>>>>>>>>> >>>>>>>>>> folding >>> >>>> negative >>>>>>>>>>> and positive NaN's into a single representation (this isn't >>>>>>>>>>> >>>>>>>>>> strictly >>>> >>>>> correct!), leaving a place to represent null. In the long >>>>>>>>>>> >>>>>>>>>> example >>> >>>> case, >>>>>>>>>>> the >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by >>>>>>>>>>> >>>>>>>>>> one. >>>> >>>>> This >>>>>>>>>>> will allocate an additional encoding which can be used for null. >>>>>>>>>>> >>>>>>>>>> My >>>> >>>>> experience working with scientific data, however, makes me wince >>>>>>>>>>> >>>>>>>>>> at >>>> >>>>> the >>>>>>>>>>> idea. >>>>>>>>>>> >>>>>>>>>>> The variable-width encodings have it a little easier. There's >>>>>>>>>>> >>>>>>>>>> already >>>>> >>>>>> enough going on that it's simpler to make room. >>>>>>>>>>> >>>>>>>>>>> Remember, the final goal is to support order-preserving >>>>>>>>>>> >>>>>>>>>> serialization. >>>>> >>>>>> This >>>>>>>>>>> imposes some limitations on our encoding strategies. For >>>>>>>>>>> >>>>>>>>>> instance, >>> >>>> it's >>>>>>>>>>> not >>>>>>>>>>> enough to simply encode null, it really needs to be encoded as >>>>>>>>>>> >>>>>>>>>> 0x00 >>>> >>>>> so >>>>> >>>>>> as >>>>>>>>>> to sort lexicographically earlier than any other value. >>>>>>>>>> >>>>>>>>>>> What do you think? Any ideas, experiences, etc? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Nick >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >