[GitHub] [arrow-rs] tustvold opened a new issue #284: RecordBatch Sort Order

GitBox Tue, 11 May 2021 06:38:25 -0700


tustvold opened a new issue #284:
URL: https://github.com/apache/arrow-rs/issues/284

It is often the case that a RecordBatch is sorted lexicographically on one
or more columns, and knowing this allows eliminating redundant sorts, more
efficient lookups, etc...

To give a concrete use-case, within
[IOx](https://github.com/influxdata/influxdb_iox/) data is compacted into
sorted, read-only blocks that periodically must be merged together into new
sorted, read-only blocks. The result is that we are performing a lot of
operations on blocks of sorted data, that would benefit from being able to
express that they are sorted.

It should be noted that the parquet format already has a similar concept
stored in its metadata, see
[here](https://github.com/apache/parquet-format/blob/2e23a1168f50e83cacbbf970259a947e430ebe3a/src/main/thrift/parquet.thrift#L827)
although I've yet to find an implementation that actually makes use of it.

In the short-term I can workaround this with IOx-specific logic, but thought
it worthwhile to maybe start a conversation about introducing some sort of
standardised way to represent this in an arrow schema, as I imagine crates like
Datafusion would also stand to benefit from this.

There are some areas that I can see being pretty gnarly, however.
Datafusion, and I imagine other systems, use a single schema to refer to a
collection of RecordBatches. Some logic would therefore be needed to compute
the common sort order "prefix" between the record batches. A similar, but more
complex issue would arise when merging schemas.

I'm not sure if this is even the right place to be raising this issue, but
thought it couldn't hurt to do so :smile:

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold opened a new issue #284: RecordBatch Sort Order

Reply via email to