On Mon, Jan 28, 2019 at 12:53 AM Wes McKinney <wesmck...@gmail.com> wrote:
> I was having a discussion recently about Arrow and the topic of > server-side filtering vs. client-side filtering came up. > > The basic problem is this: > > If you have a RecordBatch that you wish to filter out some of the > "rows", one way to track this in-memory is to create a separate array > of true/false values instead of forcing a materialization of a > filtered RecordBatch. > > So you might have a record batch with 2 fields and 4 rows > > a: [1, 2, 3, 4] > b: ['foo', 'bar', 'baz', 'qux'] > > and then a filter > > is_selected: [true, true, false, true] > > This can be easily handled as an application-level concern. Creating a > bitmask is generally cheap relative to materializing the filtered > version, and some operators may support "pushing down" such filters > (e.g. aggregations may accept a selection mask to exclude "off" > values). I myself implemented such a scheme in the past for a query > engine I built in the 2013-2014 time frame and it yielded material > performance improvements in some cases. > > One question is what you should do when you want to put the data on > the wire, e.g. via RPC / Flight or IPC. Two options > > * Pass the complete RecordBatch, plus the filter as a "special" field > and attach some metadata so that you know that filter field is > "special" > > * Filter the RecordBatch before sending, send only the selected rows > > The first option can of course be implemented as an application-level > detail, much as we handle the serialization of pandas row indexes > right now (where we have custom pandas metadata and "special" > fields/columns for the index arrays). But it could be a common enough > use case to merit a more well-defined standardized approach. > > I'm not sure what the answer is, but I wanted to describe the problem > as I see it and see if anyone has any thoughts about it. > > I'm aware that Dremio is a user of "selection vectors" (selected > indices, instead of boolean true/false values, so we would have [0, 1, > 3] in the above case), so the similar discussion may apply to passing > selection vectors on the wire. > Some remarks: - The purpose of using selection vector/masks is to exploit late materialization as long as possible. I think having mask support make sense in the zero-copy IPC, e.g. same-host or host-to-gpu transfer (no true zero-copy but... avoid the scatter-copy preceeding the DMA copy). - Selection vector as opposed the raw bitmaps start making sense when selectivity is well under 50%. If we add the support of masking in the protocol, I'd make sure that it would have future compatibility for other types of compressed bitmap. This could be as simple as a tagged union with only one type for now. - I'd also take into account the iterator-composability with the current null-bitmap. (bitmap & bitmap) is easier to compose than (bitmap & selection-vector). - If we _don't_ add masking support to the protocol and still want filter push-down, we'll need to add a feature which is the dual, a column that represents the original row id (or the row offset).