Furthermore, is is more important to support null values than squeeze all representations into minimum size (4-bytes for int32, &c.)? On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <[email protected]> wrote:
> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <[email protected]>wrote: > >> From the SQL perspective, handling null is important. > > > From your perspective, it is critical to support NULLs, even at the > expense of fixed-width encodings at all or supporting representation of a > full range of values. That is, you'd rather be able to represent NULL than > -2^31? > > On 04/01/2013 01:32 PM, Nick Dimiduk wrote: >> >>> Thanks for the thoughtful response (and code!). >>> >>> I'm thinking I will press forward with a base implementation that does >>> not >>> support nulls. The idea is to provide an extensible set of interfaces, >>> so I >>> think this will not box us into a corner later. That is, a mirroring >>> package could be implemented that supports null values and accepts >>> the relevant trade-offs. >>> >>> Thanks, >>> Nick >>> >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <[email protected]> >>> wrote: >>> >>> I spent some time this weekend extracting bits of our serialization >>>> code to >>>> a public github repo at >>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools> >>>> . >>>> Contributions are welcome - i'm sure we all have this stuff laying >>>> around. >>>> >>>> You can see I've bumped into the NULL problem in a few places: >>>> * >>>> >>>> https://github.com/hotpads/**data-tools/blob/master/src/** >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java> >>>> * >>>> >>>> https://github.com/hotpads/**data-tools/blob/master/src/** >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java> >>>> >>>> Looking back, I think my latest opinion on the topic is to reject >>>> nullability as the rule since it can cause unexpected behavior and >>>> confusion. It's cleaner to provide a wrapper class (so both >>>> LongArrayList >>>> plus NullableLongArrayList) that explicitly defines the behavior, and >>>> costs >>>> a little more in performance. If the user can't find a pre-made wrapper >>>> class, it's not very difficult for each user to provide their own >>>> interpretation of null and check for it themselves. >>>> >>>> If you reject nullability, the question becomes what to do in situations >>>> where you're implementing existing interfaces that accept nullable >>>> params. >>>> The LongArrayList above implements List<Long> which requires an >>>> add(Long) >>>> method. In the above implementation I chose to swap nulls with >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to >>>> make >>>> that swap and then throw IllegalArgumentException if they pass null. >>>> >>>> >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < >>>> [email protected] >>>> >>>>> wrote: >>>>> HmmmŠ good question. >>>>> >>>>> I think that fixed width support is important for a great many rowkey >>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and >>>>> keeping fixed width. >>>>> >>>>> >>>>> >>>>> >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <[email protected]> wrote: >>>>> >>>>> Heya, >>>>>> >>>>>> Thinking about data types and serialization. I think null support is >>>>>> an >>>>>> important characteristic for the serialized representations, >>>>>> especially >>>>>> when considering the compound type. However, doing so in directly >>>>>> incompatible with fixed-width representations for numerics. For >>>>>> >>>>> instance, >>>> >>>>> if we want to have a fixed-width signed long stored on 8-bytes, where >>>>>> do >>>>>> you put null? float and double types can cheat a little by folding >>>>>> negative >>>>>> and positive NaN's into a single representation (this isn't strictly >>>>>> correct!), leaving a place to represent null. In the long example >>>>>> case, >>>>>> the >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. >>>>>> This >>>>>> will allocate an additional encoding which can be used for null. My >>>>>> experience working with scientific data, however, makes me wince at >>>>>> the >>>>>> idea. >>>>>> >>>>>> The variable-width encodings have it a little easier. There's already >>>>>> enough going on that it's simpler to make room. >>>>>> >>>>>> Remember, the final goal is to support order-preserving serialization. >>>>>> This >>>>>> imposes some limitations on our encoding strategies. For instance, >>>>>> it's >>>>>> not >>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so >>>>>> >>>>> as >>>> >>>>> to sort lexicographically earlier than any other value. >>>>>> >>>>>> What do you think? Any ideas, experiences, etc? >>>>>> >>>>>> Thanks, >>>>>> Nick >>>>>> >>>>> >>>>> >>>>> >>>>> >> >
