Thank you for your feedback, Antoine. I updated my draft PR and added the 'range_inc' type: https://github.com/apache/arrow/pull/50028/
Please let me know if you have any further suggestions :) Best, Hoeze Am 02.06.26 um 19:00 schrieb Antoine Pitrou:
Le 25/05/2026 à 16:54, Hoeze a écrit :Yes, you're right, the current proposal would probably not be sufficient for continuous PostgreSQL ranges. Column level boundary flags were intentional as it allows to check closedness in the schema instead of during runtime. This is also how Pandas' `IntervalArray`/`IntervalIndex` works. PostgreSQL's built-in discrete ranges (`int4range`, `int8range`, `daterange`) canonicalize to left-closed intervals; here my proposal would be sufficient. However, continuous ranges (`numrange`, `tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case my proposal would indeed not be flexible enough. I could imagine a number of possible solutions to this shortcoming: * Union type of all four closedness versions: Possible but not very elegant. Would shift the implementation burden towards the applications, that have to support union types. * Create a separate canonical data type for per-value boundary flags: Storage type `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`, mirroring PostgreSQL's internal representation. Both types would coexist: `arrow.range` for the uniform case (and for canonicalized discrete PostgreSQL ranges), and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges. * Extend `arrow.range` itself with a per-value mode: Keep a single extension type, but allow `{"closed": "per_value"}` in the metadata, in which case the storage struct gains two boolean fields `lower_inc` and `upper_inc`. One extension name, two storage layouts. Simpler from a type-registry standpoint, slightly more conditional logic in implementations. * Always store per-value flags: Drop the metadata key entirely and always use `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`. Two extra bytes per row uncompressed, but highly RLE/dictionary- friendly when uniform (which it usually is). Maximally simple to specify, at the cost of some overhead in the common pandas-style case. I currently lean towards the second option, as it preserves the schema-level check for the common case while still giving continuous, per-value closedness ranges a lossless path. Fixed-shape tensor vs. variable-shape tensor extension types went the same route. The main alternative would be option 3, but a single extension name covering two storage layouts ties the layout to a JSON metadata field rather than to the type name itself, which is easier for downstream tooling to get wrong I believe. What do you think?I agree that option 2 sounds best, the tensor analogy is spot-on. Regards Antoine.
