> On Jan 28, 2019, at 11:22 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> I was having a discussion recently about Arrow and the topic of
> server-side filtering vs. client-side filtering came up.
> 
> The basic problem is this:
> 
> If you have a RecordBatch that you wish to filter out some of the
> "rows", one way to track this in-memory is to create a separate array
> of true/false values instead of forcing a materialization of a
> filtered RecordBatch.
> 
> So you might have a record batch with 2 fields and 4 rows
> 
> a: [1, 2, 3, 4]
> b: ['foo', 'bar', 'baz', 'qux']
> 
> and then a filter
> 
> is_selected: [true, true, false, true]

In Gandiva too, we use a slight variant of this : a selection vector that has 
indices of the selected records.

> 
> This can be easily handled as an application-level concern. Creating a
> bitmask is generally cheap relative to materializing the filtered
> version, and some operators may support "pushing down" such filters
> (e.g. aggregations may accept a selection mask to exclude "off"
> values). I myself implemented such a scheme in the past for a query
> engine I built in the 2013-2014 time frame and it yielded material
> performance improvements in some cases.
> 
> One question is what you should do when you want to put the data on
> the wire, e.g. via RPC / Flight or IPC. Two options
> 
> * Pass the complete RecordBatch, plus the filter as a "special" field
> and attach some metadata so that you know that filter field is
> "special"
> 
> * Filter the RecordBatch before sending, send only the selected rows
> 
> The first option can of course be implemented as an application-level
> detail, much as we handle the serialization of pandas row indexes
> right now (where we have custom pandas metadata and "special"
> fields/columns for the index arrays). But it could be a common enough
> use case to merit a more well-defined standardized approach.

> 
> I'm not sure what the answer is, but I wanted to describe the problem
> as I see it and see if anyone has any thoughts about it.
> 
> I'm aware that Dremio is a user of "selection vectors" (selected
> indices, instead of boolean true/false values, so we would have [0, 1,
> 3] in the above case), so the similar discussion may apply to passing
> selection vectors on the wire.

I’m not sure if it’s worth passing the unfiltered record batch on the wire (the 
cost of additional network bytes will likely outweigh the materiailzation 
cost). But, it’s definitely beneficial between operators (eg. Filter and 
projector).

Also, having an efficient way to materialize the filtered version will be 
useful (i.e selection vector remover).

> 
> - Wes

Reply via email to