tustvold opened a new issue #284: URL: https://github.com/apache/arrow-rs/issues/284
It is often the case that a RecordBatch is sorted lexicographically on one or more columns, and knowing this allows eliminating redundant sorts, more efficient lookups, etc... To give a concrete use-case, within [IOx](https://github.com/influxdata/influxdb_iox/) data is compacted into sorted, read-only blocks that periodically must be merged together into new sorted, read-only blocks. The result is that we are performing a lot of operations on blocks of sorted data, that would benefit from being able to express that they are sorted. It should be noted that the parquet format already has a similar concept stored in its metadata, see [here](https://github.com/apache/parquet-format/blob/2e23a1168f50e83cacbbf970259a947e430ebe3a/src/main/thrift/parquet.thrift#L827) although I've yet to find an implementation that actually makes use of it. In the short-term I can workaround this with IOx-specific logic, but thought it worthwhile to maybe start a conversation about introducing some sort of standardised way to represent this in an arrow schema, as I imagine crates like Datafusion would also stand to benefit from this. There are some areas that I can see being pretty gnarly, however. Datafusion, and I imagine other systems, use a single schema to refer to a collection of RecordBatches. Some logic would therefore be needed to compute the common sort order "prefix" between the record batches. A similar, but more complex issue would arise when merging schemas. I'm not sure if this is even the right place to be raising this issue, but thought it couldn't hurt to do so :smile: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
