tustvold opened a new issue #284:
URL: https://github.com/apache/arrow-rs/issues/284


   It is often the case that a RecordBatch is sorted lexicographically on one 
or more columns, and knowing this allows eliminating redundant sorts, more 
efficient lookups, etc...
   
   To give a concrete use-case, within 
[IOx](https://github.com/influxdata/influxdb_iox/) data is compacted into 
sorted, read-only blocks that periodically must be merged together into new 
sorted, read-only blocks. The result is that we are performing a lot of 
operations on blocks of sorted data, that would benefit from being able to 
express that they are sorted. 
   
   It should be noted that the parquet format already has a similar concept 
stored in its metadata, see 
[here](https://github.com/apache/parquet-format/blob/2e23a1168f50e83cacbbf970259a947e430ebe3a/src/main/thrift/parquet.thrift#L827)
 although I've yet to find an implementation that actually makes use of it.
   
   In the short-term I can workaround this with IOx-specific logic, but thought 
it worthwhile to maybe start a conversation about introducing some sort of 
standardised way to represent this in an arrow schema, as I imagine crates like 
Datafusion would also stand to benefit from this.
   
   There are some areas that I can see being pretty gnarly, however. 
Datafusion, and I imagine other systems, use a single schema to refer to a 
collection of RecordBatches. Some logic would therefore be needed to compute 
the common sort order "prefix" between the record batches. A similar, but more 
complex issue would arise when merging schemas. 
   
   I'm not sure if this is even the right place to be raising this issue, but 
thought it couldn't hurt to do so :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to