On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote:
>
> > There is one database that I'm aware of that uses sentinels _and_
> supports complex types with missing values: Kx's KDB+.
> I read this and was pleased that KDB is being used as a reference.  It is a
> seriously good database: the gold-standard in many people's eyes.
>
> > This has led to some seriously strange choices like the ASCII space
> character being used as the sentinel value for strings.
> But then I saw this. Surely if sentinels are good enough for KDB then isn't
> that a sign that sentinels are not as bad as this group fears?

KDB has a good reputation in the financial world, but it is a very
niche product. I personally wouldn't draw any inferences about
database design from something with such a small and specialized
audience.

>
> What about grouping and joining columns that contain NA?   Here's an
> example from R data.table :
>
> > DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
> > DT
>        x     v
>    <num> <int>
> 1:     1     1
> 2:     3     2
> 3:     3     3
> 4:    NA     4
> 5:     1     5
> 6:    NA     6
> > DT[,sum(v),keyby=x]
>        x    V1
>    <num> <int>
> 1:    NA    10
> 2:     1     6
> 3:     3     5
>
> The NAs are grouped as a distinct value and are not excluded for
> statistical robustness reasons.  This is very easy to achieve efficiently
> internally; in fact there is no special code to deal with the NA values
> because they are just another distinct value (the sentinel).  In Arrow if a
> bitmap is present, there would be more code needed to deal with the NAs
> (either way: including the NA group or excluding the NA group), if I
> understand correctly.

It depends on who's doing the analysis. Some database systems exclude
nulls in aggregations altogether. In others you indeed would need to
reserve an aggregation bucket to null and use the bitmap when
determining which bucket to update for each value

>
> On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote:
>
> > There is one database that I'm aware of that uses sentinels _and_ supports
> > complex types with missing values: Kx's KDB+. This has led to some
> > seriously strange choices like the ASCII space character being used as the
> > sentinel value for strings. See
> > https://code.kx.com/wiki/Reference/Datatypes for
> > more details.
> >
> > On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hey Matt,
> > >
> > > Thanks for giving your perspective on the mailing list.
> > >
> > > My objective in writing about this recently
> > > (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
> > > need to update since the sentinel case can be done more efficiently
> > > than what's there now) was to help dispel the notion that using a
> > > separate value (bit or byte) to encode nullness is a performance
> > > compromise to comply with the requirements of database systems. I too
> > > prefer real world benchmarks to microbenchmarks, and probably null
> > > checking is not going to be the main driver of aggregate system
> > > performance. I had heard many people over the years object to bitmaps
> > > on performance grounds but without analysis to back it up.
> > >
> > > Some context for other readers on the mailing list: A language like R
> > > is not a database and has fewer built-in scalar types: int32, double,
> > > string (interned), and boolean. Out of these, int32 and double can use
> > > one bit pattern for NA (null) and not lose too much. A database system
> > > generally can't make that kind of compromise, and most popular
> > > databases can distinguish INT32_MIN (or any other value used as a
> > > sentinel) and null. If you loaded data from an Avro or Parquet file
> > > that contained one of those values, you'd have to decide what to do
> > > with the data (though I understand there's integer64 add-on packages
> > > for R now)
> > >
> > > Now back to Arrow -- we have 3 main kinds of data types:
> > >
> > > * Fixed size primitive
> > > * Variable size primitive (binary, utf8)
> > > * Nested (list, struct, union)
> > >
> > > Out of these, "fixed size primitive" is the only one that can
> > > generally support O(1) in-place mutation / updates, though all of them
> > > could support a O(1) "make null" operation (by zeroing a bit). In
> > > general, when faced with designs we have preferred choices benefiting
> > > use cases where datasets are treated as immutable or copy-on-write.
> > >
> > > If an application _does_ need to do mutation on primitive arrays, then
> > > you could choose to always allocate the validity bitmap so that it can
> > > be mutated without requiring allocations to happen arbitrarily in your
> > > processing workflow. But, if you have data without nulls, it is a nice
> > > feature to be able to ignore the bitmap or not allocate one at all. If
> > > you constructed an array from data that you know to be non-nullable,
> > > some implementations might wish to avoid the waste of creating a
> > > bitmap with all 1's.
> > >
> > > For example, if we create an array::Array from a normal NumPy array of
> > > integers (which cannot have nulls), we have
> > >
> > > In [6]: import pyarrow as pa
> > > In [7]: import numpy as np
> > > In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
> > >
> > > In [9]: arr.buffers()
> > > Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]
> > >
> > > In [10]: arr.null_count
> > > Out[10]: 0
> > >
> > > Normally, the first buffer would be the validity bitmap memory, but
> > > here it was not allocated because there are no nulls.
> > >
> > > Creating an open standard data representation is a difficult thing;
> > > one cannot be "all things to all people" but the intent is to be a
> > > suitable lingua franca for language agnostic data interchange and as a
> > > runtime representation for analytical query engines (where most
> > > operators are "pure"). If the Arrow community's goal were to create a
> > > "mutable column store" then some things might be designed differently
> > > (perhaps more like internals of https://kudu.apache.org/). It is
> > > helpful to have an understanding of what compromises have been made
> > > and how costly they are in real world applications.
> > >
> > > best
> > > Wes
> > > On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org>
> > wrote:
> > > >
> > > > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com>
> > wrote:
> > > >
> > > > > 1. I see. Good idea. Can we assume bitmap is always present in Arrow
> > > then?
> > > > > I thought I'd seen Wes argue that if there were no NAs, the bitmap
> > > doesn't
> > > > > need to be allocated.  Indeed I wasn't worried about the extra
> > storage,
> > > > > although for 10,000 columns I wonder about the number of vectors.
> > > > >
> > > >
> > > > I think different implementations handle this differently at the
> > moment.
> > > In
> > > > the Java code, we allocate the validity buffer at initial allocation
> > > > always. We're also looking to enhance the allocation strategy so the
> > > fixed
> > > > part of values are always allocated with validity (single allocation)
> > to
> > > > avoid any extra object housekeeping.
> > > >
> > > >
> > > > > 2. It's only subjective until the code complexity is measured, then
> > > it's
> > > > > not subjective. I suppose after 20 years of using sentinels, I'm used
> > > to it
> > > > > and trust it. I'll keep an open mind on this.
> > > > >
> > > > Yup, fair enough.
> > > >
> > > >
> > > > > 3. Since I criticized the scale of Wes' benchmark, I felt I should
> > > show how
> > > > > I do benchmarks myself to show where I'm coming from. Yes none-null,
> > > > > some-null and all-null paths offer savings. But that's the same under
> > > both
> > > > > sentinel and bitmap approaches. Under both approaches, you just need
> > to
> > > > > know which case you're in. That involves storing the number of NAs in
> > > the
> > > > > header/summary which can be done under both approaches.
> > > > >
> > > >
> > > > The item we appreciate is that you can do a single comparison every 64
> > > > values to determine which of the three cases you are in (make this a
> > > local
> > > > decision). This means you don't have to do housekeeping ahead of time.
> > It
> > > > also means that the window of choice is narrow, minimizing the penalty
> > in
> > > > situations where you have rare invalid values (or rare valid values).
> > >
> >

Reply via email to