1. I see. Good idea. Can we assume bitmap is always present in Arrow then?
I thought I'd seen Wes argue that if there were no NAs, the bitmap doesn't
need to be allocated.  Indeed I wasn't worried about the extra storage,
although for 10,000 columns I wonder about the number of vectors.
2. It's only subjective until the code complexity is measured, then it's
not subjective. I suppose after 20 years of using sentinels, I'm used to it
and trust it. I'll keep an open mind on this.
3. Since I criticized the scale of Wes' benchmark, I felt I should show how
I do benchmarks myself to show where I'm coming from. Yes none-null,
some-null and all-null paths offer savings. But that's the same under both
sentinel and bitmap approaches. Under both approaches, you just need to
know which case you're in. That involves storing the number of NAs in the
header/summary which can be done under both approaches.

On Mon, Nov 5, 2018 at 2:58 PM Jacques Nadeau <jacq...@apache.org> wrote:

> A few random thoughts...
>
> It seems like you outline three things as possible concerns:
>
> 1) Late allocation failure
> 2) Code complexity, errors.
> 3) Performance
>
> For 1: is the validity bit sizing really too expensive to allocate on
> initial allocation? In our experience, it is nominal in relationship to the
> size of the rest of the data. Is your data different than this? If you
> assume an initial allocation rather than a lazy allocation, your concern of
> late failure go away. (If you want to pre-initialize to all set instead of
> not-set by default, that seems like a minor memory clearing change.)
>
> For 2: I think this is very subjective. Having code that checks for a
> particular sentinel and then has special handling for it seems much more
> complex/error prone to me than keeping validity information completely
> independent of value storage. In effective, sentinel values actually
> compound two completely independent concerns. This becomes even more
> complex in situations where sentinels can change depending on the range of
> possible data.
>
> For 3: I'm not sure how the db-benchmark is helpful for this discussion
> since it isn't granular enough to clearly understand the difference of this
> particular design choice. I can assure you that the ability to eliminate
> large amount of branches and operations in a null-if-null complex
> expression tree pattern made substantial performance improvements in
> Gandiva. Even in Dremio, we've specialized for sub-segments of data where
> you have three custom paths: none-null, some-null, all-null. Being able to
> do that without having to read each value individually can be the largest
> of all wins, especially where many people declare all data as nullable but
> rarely use that nullability.
>
>
>
>
>
>
> On Mon, Nov 5, 2018 at 2:31 PM Matt Dowle <mattjdo...@gmail.com> wrote:
>
> > Hi,
> > (First post to this mailing list.)
> > I tweeted here and Wes invited me to follow up on this list :
> > https://twitter.com/wesmckinn/status/1059440916987961346
> >
> > Wes - it was great to meet you at Stanford in September. There I
> mentioned
> > the assign/update aspect which is a downside of bitmap for NA, imho. I
> did
> > see your recent article but I didn't see the update aspect mentioned
> there.
> > It's rarely about speed for me and I'm uncomfortable when the longest
> > runtime presented is just 0.04s.
> >
> > The concern surrounds updating (UPDATE in SQL, sub-assign in Python/R).
> If
> > there are no NAs in the column,  the first assignment of an NA has to
> > allocate the bitmap vector and link that properly to the column. This
> > allocation could fail (albeit rarely) due to out-of-memory so code has to
> > exist to deal with that properly, from multi-threaded code too. I
> mentioned
> > cache lines in the tweet not to focus on speed, but to focus on the
> > complexity of updating the several places correctly from multi-threaded
> > code: updating the NA count, the bitmap value and the column itself is 3
> > assigns all to be correctly wrapped in one critical section?
> >
> > In contrast, the sentinel approach (INT_MIN) is one assign to one place.
> > That's simpler. There's less scope for a corruption since it isn't
> possible
> > for a junk value to be wrongly included due to the bitmap values not
> being
> > set properly, somehow. Simpler means less code. When we're talking about
> > multi-threaded update to memory-mapped shared data, small simplifications
> > like this (sentinel) can help a lot for robustness, safety and
> correctness
> > while keeping the internal code required to achieve that to a minimum.
> >
> > Here is an example benchmark which I consider more relevant to make
> design
> > decisions based on: https://h2oai.github.io/db-benchmark/.  More
> relevant
> > because the scale is in minutes and not-working. It's this not-working
> > aspect that has often been in my mind when making data.table more memory
> > efficient, for example. Not really raw sub-second speed per se.  Bitmap
> for
> > NA has a higher chance of resulting in instability (crashes or incorrect
> > results) than sentinel simply due to there being more parts (three things
> > to keep in sync when updating, including a possible allocate): more to go
> > wrong at the low level.
> >
> > No?  I'm interested to hear further and I would like to use Arrow. I
> really
> > would. But this is a major sticking point.
> >
> > Best, Matt
> >
>

Reply via email to