Re: HBase Types: Explicit Null Support

Dmitriy Ryaboy Wed, 03 Apr 2013 11:29:47 -0700

Hiya Nick,
Pig converts data for HBase storage using this class:
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
is mostly just calling into HBase's Bytes class). As long as Bytes
handles the null stuff, we'll just inherit the behavior.



On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <[email protected]> wrote:

> I agree that a user-extensible interface is a required feature here.
> Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> keep in mind, though, that SQL and user applications are not the only
> consumers of this interface. A big motivation is allowing interop with the
> other higher MR languages. *cough* Where are my Pig and Hive peeps in this
> thread?
>
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <[email protected]
> >wrote:
>
> > Maybe if we can keep nullability separate from the
> > serialization/deserialization, we can come up with a solution that works?
> > We're able to essentially infer that a column is null based on its value
> > being missing or empty. So if an iterator through the row key bytes could
> > detect/indicate that, then an application could "infer" the value is
> null.
> >
> > We're definitely planning on keeping byte[] accessors for use cases that
> > need it. I'm curious on the geographic data case, though, could you use a
> > fixed length long with a couple of new SQL built-ins to encode/decode the
> > latitude/longitude?
> >
> >
> > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> >
> >> Actually, that isn't all that far-fetched of a format Matt - pretty
> common
> >> anytime anyone wants to do sortable lat/long (*cough* three letter
> >> agencies
> >> cough*).
> >>
> >> Wouldn't we get the same by providing a simple set of libraries (ala
> >> orderly + other HBase useful things) and then still give access to the
> >> underlying byte array? Perhaps a nullable key type in that lib makes
> sense
> >> if lots of people need it and it would be nice to have standard
> libraries
> >> so tools could interop much more easily.
> >> -------------------
> >> Jesse Yates
> >> @jesse_yates
> >> jyates.github.com
> >>
> >>
> >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <[email protected]>
> wrote:
> >>
> >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
> >>> the
> >>> interfaces should be to provide first-class support for custom user
> types
> >>> in addition to the standard ones included.  Part of the power of
> hbase's
> >>> plain byte[] keys is that users can concoct the perfect key for their
> >>> data
> >>> type.  For example, I have a lot of geographic data where I interleave
> >>> latitude/longitude bits into a sortable 64 bit value that would
> probably
> >>> never be included in a standard library.
> >>>
> >>>
> >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <[email protected]>
> >>> wrote:
> >>>
> >>>  I think having Int32, and NullableInt32 would support minimum
> overhead,
> >>>>
> >>> as
> >>>
> >>>> well as allowing SQL semantics.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <[email protected]>
> >>>> wrote:
> >>>>
> >>>>  Furthermore, is is more important to support null values than squeeze
> >>>>>
> >>>> all
> >>>
> >>>> representations into minimum size (4-bytes for int32, &c.)?
> >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <[email protected]> wrote:
> >>>>>
> >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> [email protected]
> >>>>>> wrote:
> >>>>>>
> >>>>>>   From the SQL perspective, handling null is important.
> >>>>>>>
> >>>>>>
> >>>>>>  From your perspective, it is critical to support NULLs, even at the
> >>>>>> expense of fixed-width encodings at all or supporting representation
> >>>>>>
> >>>>> of a
> >>>>
> >>>>> full range of values. That is, you'd rather be able to represent NULL
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> -2^31?
> >>>>>>
> >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>>>>>
> >>>>>>> Thanks for the thoughtful response (and code!).
> >>>>>>>>
> >>>>>>>> I'm thinking I will press forward with a base implementation that
> >>>>>>>>
> >>>>>>> does
> >>>>
> >>>>>  not
> >>>>>>>> support nulls. The idea is to provide an extensible set of
> >>>>>>>>
> >>>>>>> interfaces,
> >>>>
> >>>>>  so I
> >>>>>>>> think this will not box us into a corner later. That is, a
> >>>>>>>>
> >>>>>>> mirroring
> >>>
> >>>>  package could be implemented that supports null values and accepts
> >>>>>>>> the relevant trade-offs.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Nick
> >>>>>>>>
> >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <[email protected]
> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   I spent some time this weekend extracting bits of our
> >>>>>>>>
> >>>>>>> serialization
> >>>
> >>>>  code to
> >>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools
> <http://github.com/hotpads/**data-tools>
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>> http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>>> >
> >>>>>
> >>>>>>  .
> >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> >>>>>>>>>
> >>>>>>>> laying
> >>>
> >>>>  around.
> >>>>>>>>>
> >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> >>>>>>>>> *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> >>>>>>>>> **java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>
> >>>>  *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> >>>>>>>>> java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>
> >>>>  Looking back, I think my latest opinion on the topic is to reject
> >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> and
> >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>>>>>>> LongArrayList
> >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
> >>>>>>>>>
> >>>>>>>> and
> >>>>
> >>>>>  costs
> >>>>>>>>> a little more in performance.  If the user can't find a pre-made
> >>>>>>>>>
> >>>>>>>> wrapper
> >>>>>
> >>>>>>  class, it's not very difficult for each user to provide their own
> >>>>>>>>> interpretation of null and check for it themselves.
> >>>>>>>>>
> >>>>>>>>> If you reject nullability, the question becomes what to do in
> >>>>>>>>>
> >>>>>>>> situations
> >>>>>
> >>>>>>  where you're implementing existing interfaces that accept nullable
> >>>>>>>>> params.
> >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> an
> >>>>>>>>> add(Long)
> >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> user
> >>>>>>>>>
> >>>>>>>> to
> >>>>
> >>>>>  make
> >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> >>>>>>>>>
> >>>>>>>> null.
> >>>
> >>>>
> >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>>>>>>> [email protected]
> >>>>>>>>>
> >>>>>>>>>  wrote:
> >>>>>>>>>> HmmmŠ good question.
> >>>>>>>>>>
> >>>>>>>>>> I think that fixed width support is important for a great many
> >>>>>>>>>>
> >>>>>>>>> rowkey
> >>>>
> >>>>>  constructs cases, so I'd rather see something like losing
> >>>>>>>>>>
> >>>>>>>>> MIN_VALUE
> >>>
> >>>> and
> >>>>>
> >>>>>>  keeping fixed width.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Heya,
> >>>>>>>>>>
> >>>>>>>>>>> Thinking about data types and serialization. I think null
> >>>>>>>>>>>
> >>>>>>>>>> support
> >>>
> >>>> is
> >>>>
> >>>>>  an
> >>>>>>>>>>> important characteristic for the serialized representations,
> >>>>>>>>>>> especially
> >>>>>>>>>>> when considering the compound type. However, doing so in
> >>>>>>>>>>>
> >>>>>>>>>> directly
> >>>
> >>>>  incompatible with fixed-width representations for numerics. For
> >>>>>>>>>>>
> >>>>>>>>>>>  instance,
> >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> >>>>>>>>>>
> >>>>>>>>> where
> >>>>
> >>>>>  do
> >>>>>>>>>>> you put null? float and double types can cheat a little by
> >>>>>>>>>>>
> >>>>>>>>>> folding
> >>>
> >>>>  negative
> >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> >>>>>>>>>>>
> >>>>>>>>>> strictly
> >>>>
> >>>>>  correct!), leaving a place to represent null. In the long
> >>>>>>>>>>>
> >>>>>>>>>> example
> >>>
> >>>>  case,
> >>>>>>>>>>> the
> >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> >>>>>>>>>>>
> >>>>>>>>>> one.
> >>>>
> >>>>>  This
> >>>>>>>>>>> will allocate an additional encoding which can be used for
> null.
> >>>>>>>>>>>
> >>>>>>>>>> My
> >>>>
> >>>>>  experience working with scientific data, however, makes me wince
> >>>>>>>>>>>
> >>>>>>>>>> at
> >>>>
> >>>>>  the
> >>>>>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> >>>>>>>>>>>
> >>>>>>>>>> already
> >>>>>
> >>>>>>  enough going on that it's simpler to make room.
> >>>>>>>>>>>
> >>>>>>>>>>> Remember, the final goal is to support order-preserving
> >>>>>>>>>>>
> >>>>>>>>>> serialization.
> >>>>>
> >>>>>>  This
> >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> >>>>>>>>>>>
> >>>>>>>>>> instance,
> >>>
> >>>>  it's
> >>>>>>>>>>> not
> >>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
> >>>>>>>>>>>
> >>>>>>>>>> 0x00
> >>>>
> >>>>> so
> >>>>>
> >>>>>>  as
> >>>>>>>>>> to sort lexicographically earlier than any other value.
> >>>>>>>>>>
> >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Nick
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >
>

Re: HBase Types: Explicit Null Support

Reply via email to