On Wed, Apr 3, 2013 at 11:29 AM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> Hiya Nick, > Pig converts data for HBase storage using this class: > > https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which > is mostly just calling into HBase's Bytes class). As long as Bytes > handles the null stuff, we'll just inherit the behavior. > Dmitriy, Precisely how this will be exposed via the hbase client is TBD. We won't be deprecating the existing Bytes utility from the client view, so a new API for supporting these types will be provided. I'll be able to provide support and/or a patch for Pig (et al) once the implementation is a bit further along. My question for you as a Pig representative is more about how Pig users expect Pig to handle NULLs. Are NULL values within a tuple a common occurrence in Pig? In comparison, I'm thinking about the prevalence of NULL in SQL. Thanks, Nick On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <ndimi...@gmail.com> wrote: > > > I agree that a user-extensible interface is a required feature here. > > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's > > keep in mind, though, that SQL and user applications are not the only > > consumers of this interface. A big motivation is allowing interop with > the > > other higher MR languages. *cough* Where are my Pig and Hive peeps in > this > > thread? > > > > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtay...@salesforce.com > > >wrote: > > > > > Maybe if we can keep nullability separate from the > > > serialization/deserialization, we can come up with a solution that > works? > > > We're able to essentially infer that a column is null based on its > value > > > being missing or empty. So if an iterator through the row key bytes > could > > > detect/indicate that, then an application could "infer" the value is > > null. > > > > > > We're definitely planning on keeping byte[] accessors for use cases > that > > > need it. I'm curious on the geographic data case, though, could you > use a > > > fixed length long with a couple of new SQL built-ins to encode/decode > the > > > latitude/longitude? > > > > > > > > > On 04/01/2013 11:29 PM, Jesse Yates wrote: > > > > > >> Actually, that isn't all that far-fetched of a format Matt - pretty > > common > > >> anytime anyone wants to do sortable lat/long (*cough* three letter > > >> agencies > > >> cough*). > > >> > > >> Wouldn't we get the same by providing a simple set of libraries (ala > > >> orderly + other HBase useful things) and then still give access to the > > >> underlying byte array? Perhaps a nullable key type in that lib makes > > sense > > >> if lots of people need it and it would be nice to have standard > > libraries > > >> so tools could interop much more easily. > > >> ------------------- > > >> Jesse Yates > > >> @jesse_yates > > >> jyates.github.com > > >> > > >> > > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcor...@hotpads.com> > > wrote: > > >> > > >> Ah, I didn't even realize sql allowed null key parts. Maybe a goal > of > > >>> the > > >>> interfaces should be to provide first-class support for custom user > > types > > >>> in addition to the standard ones included. Part of the power of > > hbase's > > >>> plain byte[] keys is that users can concoct the perfect key for their > > >>> data > > >>> type. For example, I have a lot of geographic data where I > interleave > > >>> latitude/longitude bits into a sortable 64 bit value that would > > probably > > >>> never be included in a standard library. > > >>> > > >>> > > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis....@gmail.com> > > >>> wrote: > > >>> > > >>> I think having Int32, and NullableInt32 would support minimum > > overhead, > > >>>> > > >>> as > > >>> > > >>>> well as allowing SQL semantics. > > >>>> > > >>>> > > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimi...@gmail.com> > > >>>> wrote: > > >>>> > > >>>> Furthermore, is is more important to support null values than > squeeze > > >>>>> > > >>>> all > > >>> > > >>>> representations into minimum size (4-bytes for int32, &c.)? > > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote: > > >>>>> > > >>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor < > > jtay...@salesforce.com > > >>>>>> wrote: > > >>>>>> > > >>>>>> From the SQL perspective, handling null is important. > > >>>>>>> > > >>>>>> > > >>>>>> From your perspective, it is critical to support NULLs, even at > the > > >>>>>> expense of fixed-width encodings at all or supporting > representation > > >>>>>> > > >>>>> of a > > >>>> > > >>>>> full range of values. That is, you'd rather be able to represent > NULL > > >>>>>> > > >>>>> than > > >>>>> > > >>>>>> -2^31? > > >>>>>> > > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote: > > >>>>>> > > >>>>>>> Thanks for the thoughtful response (and code!). > > >>>>>>>> > > >>>>>>>> I'm thinking I will press forward with a base implementation > that > > >>>>>>>> > > >>>>>>> does > > >>>> > > >>>>> not > > >>>>>>>> support nulls. The idea is to provide an extensible set of > > >>>>>>>> > > >>>>>>> interfaces, > > >>>> > > >>>>> so I > > >>>>>>>> think this will not box us into a corner later. That is, a > > >>>>>>>> > > >>>>>>> mirroring > > >>> > > >>>> package could be implemented that supports null values and accepts > > >>>>>>>> the relevant trade-offs. > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Nick > > >>>>>>>> > > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan < > mcor...@hotpads.com > > > > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>> I spent some time this weekend extracting bits of our > > >>>>>>>> > > >>>>>>> serialization > > >>> > > >>>> code to > > >>>>>>>>> a public github repo at > http://github.com/hotpads/****data-tools > > <http://github.com/hotpads/**data-tools> > > >>>>>>>>> < > > >>>>>>>>> > > >>>>>>>> http://github.com/hotpads/**data-tools< > > http://github.com/hotpads/data-tools> > > >>>>> > > > >>>>> > > >>>>>> . > > >>>>>>>>> Contributions are welcome - i'm sure we all have this stuff > > >>>>>>>>> > > >>>>>>>> laying > > >>> > > >>>> around. > > >>>>>>>>> > > >>>>>>>>> You can see I've bumped into the NULL problem in a few places: > > >>>>>>>>> * > > >>>>>>>>> > > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**< > > https://github.com/hotpads/**data-tools/blob/master/src/**> > > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.** > > >>>>>>>>> **java< > > >>>>>>>>> > > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/** > > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java< > > > https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java > > > > > >>> > > >>>> * > > >>>>>>>>> > > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**< > > https://github.com/hotpads/**data-tools/blob/master/src/**> > > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.**** > > >>>>>>>>> java< > > >>>>>>>>> > > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/** > > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java< > > > https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java > > > > > >>> > > >>>> Looking back, I think my latest opinion on the topic is to reject > > >>>>>>>>> nullability as the rule since it can cause unexpected behavior > > and > > >>>>>>>>> confusion. It's cleaner to provide a wrapper class (so both > > >>>>>>>>> LongArrayList > > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the > behavior, > > >>>>>>>>> > > >>>>>>>> and > > >>>> > > >>>>> costs > > >>>>>>>>> a little more in performance. If the user can't find a > pre-made > > >>>>>>>>> > > >>>>>>>> wrapper > > >>>>> > > >>>>>> class, it's not very difficult for each user to provide their own > > >>>>>>>>> interpretation of null and check for it themselves. > > >>>>>>>>> > > >>>>>>>>> If you reject nullability, the question becomes what to do in > > >>>>>>>>> > > >>>>>>>> situations > > >>>>> > > >>>>>> where you're implementing existing interfaces that accept > nullable > > >>>>>>>>> params. > > >>>>>>>>> The LongArrayList above implements List<Long> which requires > > an > > >>>>>>>>> add(Long) > > >>>>>>>>> method. In the above implementation I chose to swap nulls with > > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the > > user > > >>>>>>>>> > > >>>>>>>> to > > >>>> > > >>>>> make > > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass > > >>>>>>>>> > > >>>>>>>> null. > > >>> > > >>>> > > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < > > >>>>>>>>> doug.m...@explorysmedical.com > > >>>>>>>>> > > >>>>>>>>> wrote: > > >>>>>>>>>> HmmmŠ good question. > > >>>>>>>>>> > > >>>>>>>>>> I think that fixed width support is important for a great many > > >>>>>>>>>> > > >>>>>>>>> rowkey > > >>>> > > >>>>> constructs cases, so I'd rather see something like losing > > >>>>>>>>>> > > >>>>>>>>> MIN_VALUE > > >>> > > >>>> and > > >>>>> > > >>>>>> keeping fixed width. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Heya, > > >>>>>>>>>> > > >>>>>>>>>>> Thinking about data types and serialization. I think null > > >>>>>>>>>>> > > >>>>>>>>>> support > > >>> > > >>>> is > > >>>> > > >>>>> an > > >>>>>>>>>>> important characteristic for the serialized representations, > > >>>>>>>>>>> especially > > >>>>>>>>>>> when considering the compound type. However, doing so in > > >>>>>>>>>>> > > >>>>>>>>>> directly > > >>> > > >>>> incompatible with fixed-width representations for numerics. For > > >>>>>>>>>>> > > >>>>>>>>>>> instance, > > >>>>>>>>>> if we want to have a fixed-width signed long stored on > 8-bytes, > > >>>>>>>>>> > > >>>>>>>>> where > > >>>> > > >>>>> do > > >>>>>>>>>>> you put null? float and double types can cheat a little by > > >>>>>>>>>>> > > >>>>>>>>>> folding > > >>> > > >>>> negative > > >>>>>>>>>>> and positive NaN's into a single representation (this isn't > > >>>>>>>>>>> > > >>>>>>>>>> strictly > > >>>> > > >>>>> correct!), leaving a place to represent null. In the long > > >>>>>>>>>>> > > >>>>>>>>>> example > > >>> > > >>>> case, > > >>>>>>>>>>> the > > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE > by > > >>>>>>>>>>> > > >>>>>>>>>> one. > > >>>> > > >>>>> This > > >>>>>>>>>>> will allocate an additional encoding which can be used for > > null. > > >>>>>>>>>>> > > >>>>>>>>>> My > > >>>> > > >>>>> experience working with scientific data, however, makes me wince > > >>>>>>>>>>> > > >>>>>>>>>> at > > >>>> > > >>>>> the > > >>>>>>>>>>> idea. > > >>>>>>>>>>> > > >>>>>>>>>>> The variable-width encodings have it a little easier. There's > > >>>>>>>>>>> > > >>>>>>>>>> already > > >>>>> > > >>>>>> enough going on that it's simpler to make room. > > >>>>>>>>>>> > > >>>>>>>>>>> Remember, the final goal is to support order-preserving > > >>>>>>>>>>> > > >>>>>>>>>> serialization. > > >>>>> > > >>>>>> This > > >>>>>>>>>>> imposes some limitations on our encoding strategies. For > > >>>>>>>>>>> > > >>>>>>>>>> instance, > > >>> > > >>>> it's > > >>>>>>>>>>> not > > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded > as > > >>>>>>>>>>> > > >>>>>>>>>> 0x00 > > >>>> > > >>>>> so > > >>>>> > > >>>>>> as > > >>>>>>>>>> to sort lexicographically earlier than any other value. > > >>>>>>>>>> > > >>>>>>>>>>> What do you think? Any ideas, experiences, etc? > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks, > > >>>>>>>>>>> Nick > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > > > > >