Re: HBase Types: Explicit Null Support

Nick Dimiduk Thu, 04 Apr 2013 17:34:57 -0700

On Wed, Apr 3, 2013 at 11:29 AM, Dmitriy Ryaboy <[email protected]> wrote:


> Hiya Nick,
> Pig converts data for HBase storage using this class:
>
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
> is mostly just calling into HBase's Bytes class). As long as Bytes
> handles the null stuff, we'll just inherit the behavior.
>

Dmitriy,

Precisely how this will be exposed via the hbase client is TBD. We won't be
deprecating the existing Bytes utility from the client view, so a new API
for supporting these types will be provided. I'll be able to provide
support and/or a patch for Pig (et al) once  the implementation is a bit
further along.

My question for you as a Pig representative is more about how Pig users
expect Pig to handle NULLs. Are NULL values within a tuple a
common occurrence in Pig? In comparison, I'm thinking about the prevalence
of NULL in SQL.

Thanks,
Nick

On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <[email protected]> wrote:
>
> > I agree that a user-extensible interface is a required feature here.
> > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> > keep in mind, though, that SQL and user applications are not the only
> > consumers of this interface. A big motivation is allowing interop with
> the
> > other higher MR languages. *cough* Where are my Pig and Hive peeps in
> this
> > thread?
> >
> > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <[email protected]
> > >wrote:
> >
> > > Maybe if we can keep nullability separate from the
> > > serialization/deserialization, we can come up with a solution that
> works?
> > > We're able to essentially infer that a column is null based on its
> value
> > > being missing or empty. So if an iterator through the row key bytes
> could
> > > detect/indicate that, then an application could "infer" the value is
> > null.
> > >
> > > We're definitely planning on keeping byte[] accessors for use cases
> that
> > > need it. I'm curious on the geographic data case, though, could you
> use a
> > > fixed length long with a couple of new SQL built-ins to encode/decode
> the
> > > latitude/longitude?
> > >
> > >
> > > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> > >
> > >> Actually, that isn't all that far-fetched of a format Matt - pretty
> > common
> > >> anytime anyone wants to do sortable lat/long (*cough* three letter
> > >> agencies
> > >> cough*).
> > >>
> > >> Wouldn't we get the same by providing a simple set of libraries (ala
> > >> orderly + other HBase useful things) and then still give access to the
> > >> underlying byte array? Perhaps a nullable key type in that lib makes
> > sense
> > >> if lots of people need it and it would be nice to have standard
> > libraries
> > >> so tools could interop much more easily.
> > >> -------------------
> > >> Jesse Yates
> > >> @jesse_yates
> > >> jyates.github.com
> > >>
> > >>
> > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <[email protected]>
> > wrote:
> > >>
> > >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal
> of
> > >>> the
> > >>> interfaces should be to provide first-class support for custom user
> > types
> > >>> in addition to the standard ones included.  Part of the power of
> > hbase's
> > >>> plain byte[] keys is that users can concoct the perfect key for their
> > >>> data
> > >>> type.  For example, I have a lot of geographic data where I
> interleave
> > >>> latitude/longitude bits into a sortable 64 bit value that would
> > probably
> > >>> never be included in a standard library.
> > >>>
> > >>>
> > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <[email protected]>
> > >>> wrote:
> > >>>
> > >>>  I think having Int32, and NullableInt32 would support minimum
> > overhead,
> > >>>>
> > >>> as
> > >>>
> > >>>> well as allowing SQL semantics.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>  Furthermore, is is more important to support null values than
> squeeze
> > >>>>>
> > >>>> all
> > >>>
> > >>>> representations into minimum size (4-bytes for int32, &c.)?
> > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <[email protected]> wrote:
> > >>>>>
> > >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> > [email protected]
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>   From the SQL perspective, handling null is important.
> > >>>>>>>
> > >>>>>>
> > >>>>>>  From your perspective, it is critical to support NULLs, even at
> the
> > >>>>>> expense of fixed-width encodings at all or supporting
> representation
> > >>>>>>
> > >>>>> of a
> > >>>>
> > >>>>> full range of values. That is, you'd rather be able to represent
> NULL
> > >>>>>>
> > >>>>> than
> > >>>>>
> > >>>>>> -2^31?
> > >>>>>>
> > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>>>>>
> > >>>>>>> Thanks for the thoughtful response (and code!).
> > >>>>>>>>
> > >>>>>>>> I'm thinking I will press forward with a base implementation
> that
> > >>>>>>>>
> > >>>>>>> does
> > >>>>
> > >>>>>  not
> > >>>>>>>> support nulls. The idea is to provide an extensible set of
> > >>>>>>>>
> > >>>>>>> interfaces,
> > >>>>
> > >>>>>  so I
> > >>>>>>>> think this will not box us into a corner later. That is, a
> > >>>>>>>>
> > >>>>>>> mirroring
> > >>>
> > >>>>  package could be implemented that supports null values and accepts
> > >>>>>>>> the relevant trade-offs.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Nick
> > >>>>>>>>
> > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <
> [email protected]
> > >
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>   I spent some time this weekend extracting bits of our
> > >>>>>>>>
> > >>>>>>> serialization
> > >>>
> > >>>>  code to
> > >>>>>>>>> a public github repo at
> http://github.com/hotpads/****data-tools
> > <http://github.com/hotpads/**data-tools>
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>> http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>>> >
> > >>>>>
> > >>>>>>  .
> > >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> > >>>>>>>>>
> > >>>>>>>> laying
> > >>>
> > >>>>  around.
> > >>>>>>>>>
> > >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> > >>>>>>>>> *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> > >>>>>>>>> **java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>
> > >>>>  *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> > >>>>>>>>> java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>
> > >>>>  Looking back, I think my latest opinion on the topic is to reject
> > >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> > and
> > >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>>>>>>> LongArrayList
> > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the
> behavior,
> > >>>>>>>>>
> > >>>>>>>> and
> > >>>>
> > >>>>>  costs
> > >>>>>>>>> a little more in performance.  If the user can't find a
> pre-made
> > >>>>>>>>>
> > >>>>>>>> wrapper
> > >>>>>
> > >>>>>>  class, it's not very difficult for each user to provide their own
> > >>>>>>>>> interpretation of null and check for it themselves.
> > >>>>>>>>>
> > >>>>>>>>> If you reject nullability, the question becomes what to do in
> > >>>>>>>>>
> > >>>>>>>> situations
> > >>>>>
> > >>>>>>  where you're implementing existing interfaces that accept
> nullable
> > >>>>>>>>> params.
> > >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> > an
> > >>>>>>>>> add(Long)
> > >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> > user
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>
> > >>>>>  make
> > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> > >>>>>>>>>
> > >>>>>>>> null.
> > >>>
> > >>>>
> > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>>>>>>> [email protected]
> > >>>>>>>>>
> > >>>>>>>>>  wrote:
> > >>>>>>>>>> HmmmŠ good question.
> > >>>>>>>>>>
> > >>>>>>>>>> I think that fixed width support is important for a great many
> > >>>>>>>>>>
> > >>>>>>>>> rowkey
> > >>>>
> > >>>>>  constructs cases, so I'd rather see something like losing
> > >>>>>>>>>>
> > >>>>>>>>> MIN_VALUE
> > >>>
> > >>>> and
> > >>>>>
> > >>>>>>  keeping fixed width.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <[email protected]> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Heya,
> > >>>>>>>>>>
> > >>>>>>>>>>> Thinking about data types and serialization. I think null
> > >>>>>>>>>>>
> > >>>>>>>>>> support
> > >>>
> > >>>> is
> > >>>>
> > >>>>>  an
> > >>>>>>>>>>> important characteristic for the serialized representations,
> > >>>>>>>>>>> especially
> > >>>>>>>>>>> when considering the compound type. However, doing so in
> > >>>>>>>>>>>
> > >>>>>>>>>> directly
> > >>>
> > >>>>  incompatible with fixed-width representations for numerics. For
> > >>>>>>>>>>>
> > >>>>>>>>>>>  instance,
> > >>>>>>>>>> if we want to have a fixed-width signed long stored on
> 8-bytes,
> > >>>>>>>>>>
> > >>>>>>>>> where
> > >>>>
> > >>>>>  do
> > >>>>>>>>>>> you put null? float and double types can cheat a little by
> > >>>>>>>>>>>
> > >>>>>>>>>> folding
> > >>>
> > >>>>  negative
> > >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> > >>>>>>>>>>>
> > >>>>>>>>>> strictly
> > >>>>
> > >>>>>  correct!), leaving a place to represent null. In the long
> > >>>>>>>>>>>
> > >>>>>>>>>> example
> > >>>
> > >>>>  case,
> > >>>>>>>>>>> the
> > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> one.
> > >>>>
> > >>>>>  This
> > >>>>>>>>>>> will allocate an additional encoding which can be used for
> > null.
> > >>>>>>>>>>>
> > >>>>>>>>>> My
> > >>>>
> > >>>>>  experience working with scientific data, however, makes me wince
> > >>>>>>>>>>>
> > >>>>>>>>>> at
> > >>>>
> > >>>>>  the
> > >>>>>>>>>>> idea.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> > >>>>>>>>>>>
> > >>>>>>>>>> already
> > >>>>>
> > >>>>>>  enough going on that it's simpler to make room.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Remember, the final goal is to support order-preserving
> > >>>>>>>>>>>
> > >>>>>>>>>> serialization.
> > >>>>>
> > >>>>>>  This
> > >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> > >>>>>>>>>>>
> > >>>>>>>>>> instance,
> > >>>
> > >>>>  it's
> > >>>>>>>>>>> not
> > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded
> as
> > >>>>>>>>>>>
> > >>>>>>>>>> 0x00
> > >>>>
> > >>>>> so
> > >>>>>
> > >>>>>>  as
> > >>>>>>>>>> to sort lexicographically earlier than any other value.
> > >>>>>>>>>>
> > >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Nick
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Reply via email to