There is one database that I'm aware of that uses sentinels _and_ supports
complex types with missing values: Kx's KDB+. This has led to some
seriously strange choices like the ASCII space character being used as the
sentinel value for strings. See
https://code.kx.com/wiki/Reference/Datatypes for
more details.

On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hey Matt,
>
> Thanks for giving your perspective on the mailing list.
>
> My objective in writing about this recently
> (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
> need to update since the sentinel case can be done more efficiently
> than what's there now) was to help dispel the notion that using a
> separate value (bit or byte) to encode nullness is a performance
> compromise to comply with the requirements of database systems. I too
> prefer real world benchmarks to microbenchmarks, and probably null
> checking is not going to be the main driver of aggregate system
> performance. I had heard many people over the years object to bitmaps
> on performance grounds but without analysis to back it up.
>
> Some context for other readers on the mailing list: A language like R
> is not a database and has fewer built-in scalar types: int32, double,
> string (interned), and boolean. Out of these, int32 and double can use
> one bit pattern for NA (null) and not lose too much. A database system
> generally can't make that kind of compromise, and most popular
> databases can distinguish INT32_MIN (or any other value used as a
> sentinel) and null. If you loaded data from an Avro or Parquet file
> that contained one of those values, you'd have to decide what to do
> with the data (though I understand there's integer64 add-on packages
> for R now)
>
> Now back to Arrow -- we have 3 main kinds of data types:
>
> * Fixed size primitive
> * Variable size primitive (binary, utf8)
> * Nested (list, struct, union)
>
> Out of these, "fixed size primitive" is the only one that can
> generally support O(1) in-place mutation / updates, though all of them
> could support a O(1) "make null" operation (by zeroing a bit). In
> general, when faced with designs we have preferred choices benefiting
> use cases where datasets are treated as immutable or copy-on-write.
>
> If an application _does_ need to do mutation on primitive arrays, then
> you could choose to always allocate the validity bitmap so that it can
> be mutated without requiring allocations to happen arbitrarily in your
> processing workflow. But, if you have data without nulls, it is a nice
> feature to be able to ignore the bitmap or not allocate one at all. If
> you constructed an array from data that you know to be non-nullable,
> some implementations might wish to avoid the waste of creating a
> bitmap with all 1's.
>
> For example, if we create an array::Array from a normal NumPy array of
> integers (which cannot have nulls), we have
>
> In [6]: import pyarrow as pa
> In [7]: import numpy as np
> In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
>
> In [9]: arr.buffers()
> Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]
>
> In [10]: arr.null_count
> Out[10]: 0
>
> Normally, the first buffer would be the validity bitmap memory, but
> here it was not allocated because there are no nulls.
>
> Creating an open standard data representation is a difficult thing;
> one cannot be "all things to all people" but the intent is to be a
> suitable lingua franca for language agnostic data interchange and as a
> runtime representation for analytical query engines (where most
> operators are "pure"). If the Arrow community's goal were to create a
> "mutable column store" then some things might be designed differently
> (perhaps more like internals of https://kudu.apache.org/). It is
> helpful to have an understanding of what compromises have been made
> and how costly they are in real world applications.
>
> best
> Wes
> On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org> wrote:
> >
> > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com> wrote:
> >
> > > 1. I see. Good idea. Can we assume bitmap is always present in Arrow
> then?
> > > I thought I'd seen Wes argue that if there were no NAs, the bitmap
> doesn't
> > > need to be allocated.  Indeed I wasn't worried about the extra storage,
> > > although for 10,000 columns I wonder about the number of vectors.
> > >
> >
> > I think different implementations handle this differently at the moment.
> In
> > the Java code, we allocate the validity buffer at initial allocation
> > always. We're also looking to enhance the allocation strategy so the
> fixed
> > part of values are always allocated with validity (single allocation) to
> > avoid any extra object housekeeping.
> >
> >
> > > 2. It's only subjective until the code complexity is measured, then
> it's
> > > not subjective. I suppose after 20 years of using sentinels, I'm used
> to it
> > > and trust it. I'll keep an open mind on this.
> > >
> > Yup, fair enough.
> >
> >
> > > 3. Since I criticized the scale of Wes' benchmark, I felt I should
> show how
> > > I do benchmarks myself to show where I'm coming from. Yes none-null,
> > > some-null and all-null paths offer savings. But that's the same under
> both
> > > sentinel and bitmap approaches. Under both approaches, you just need to
> > > know which case you're in. That involves storing the number of NAs in
> the
> > > header/summary which can be done under both approaches.
> > >
> >
> > The item we appreciate is that you can do a single comparison every 64
> > values to determine which of the three cases you are in (make this a
> local
> > decision). This means you don't have to do housekeeping ahead of time. It
> > also means that the window of choice is narrow, minimizing the penalty in
> > situations where you have rare invalid values (or rare valid values).
>

Reply via email to