Re: HBase Types: Explicit Null Support

James Taylor Mon, 01 Apr 2013 23:34:14 -0700

Maybe if we can keep nullability separate from theserialization/deserialization, we can come up with a solution thatworks? We're able to essentially infer that a column is null based onits value being missing or empty. So if an iterator through the row keybytes could detect/indicate that, then an application could "infer" thevalue is null.

We're definitely planning on keeping byte[] accessors for use cases thatneed it. I'm curious on the geographic data case, though, could you usea fixed length long with a couple of new SQL built-ins to encode/decodethe latitude/longitude?


On 04/01/2013 11:29 PM, Jesse Yates wrote:

Actually, that isn't all that far-fetched of a format Matt - pretty common
anytime anyone wants to do sortable lat/long (*cough* three letter agencies
cough*).

Wouldn't we get the same by providing a simple set of libraries (ala
orderly + other HBase useful things) and then still give access to the
underlying byte array? Perhaps a nullable key type in that lib makes sense
if lots of people need it and it would be nice to have standard libraries
so tools could interop much more easily.
-------------------
Jesse Yates
@jesse_yates
jyates.github.com


On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <[email protected]> wrote:

Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
interfaces should be to provide first-class support for custom user types
in addition to the standard ones included.  Part of the power of hbase's
plain byte[] keys is that users can concoct the perfect key for their data
type.  For example, I have a lot of geographic data where I interleave
latitude/longitude bits into a sortable 64 bit value that would probably
never be included in a standard library.


On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <[email protected]> wrote:

I think having Int32, and NullableInt32 would support minimum overhead,

as

well as allowing SQL semantics.


On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <[email protected]> wrote:

Furthermore, is is more important to support null values than squeeze

all

representations into minimum size (4-bytes for int32, &c.)?
On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <[email protected]> wrote:

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <[email protected]
wrote:

 From the SQL perspective, handling null is important.


 From your perspective, it is critical to support NULLs, even at the
expense of fixed-width encodings at all or supporting representation

of a

full range of values. That is, you'd rather be able to represent NULL

than

-2^31?

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that

does

not
support nulls. The idea is to provide an extensible set of

interfaces,

so I
think this will not box us into a corner later. That is, a

mirroring

package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <[email protected]>
wrote:

  I spent some time this weekend extracting bits of our

serialization

code to
a public github repo at http://github.com/hotpads/**data-tools<

http://github.com/hotpads/data-tools>

.
   Contributions are welcome - i'm sure we all have this stuff

laying

around.

You can see I've bumped into the NULL problem in a few places:
*

https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<

https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java

*

https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<

https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion.  It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior,

and

costs
a little more in performance.  If the user can't find a pre-made

wrapper

class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in

situations

where you're implementing existing interfaces that accept nullable
params.
   The LongArrayList above implements List<Long> which requires an
add(Long)
method.  In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user

to

make
that swap and then throw IllegalArgumentException if they pass

null.


On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
[email protected]

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many

rowkey

constructs cases, so I'd rather see something like losing

MIN_VALUE

and

keeping fixed width.




On 4/1/13 2:00 PM, "Nick Dimiduk" <[email protected]> wrote:

  Heya,

Thinking about data types and serialization. I think null

support

is

an
important characteristic for the serialized representations,
especially
when considering the compound type. However, doing so in

directly

incompatible with fixed-width representations for numerics. For

instance,
if we want to have a fixed-width signed long stored on 8-bytes,

where

do
you put null? float and double types can cheat a little by

folding

negative
and positive NaN's into a single representation (this isn't

strictly

correct!), leaving a place to represent null. In the long

example

case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by

one.

This
will allocate an additional encoding which can be used for null.

My

experience working with scientific data, however, makes me wince

at

the
idea.

The variable-width encodings have it a little easier. There's

already

enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving

serialization.

This
imposes some limitations on our encoding strategies. For

instance,

it's
not
enough to simply encode null, it really needs to be encoded as

0x00

so

as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

Reply via email to