Re: [Format] Bounded numbers?
To add onto Antoine's two points: 1. expressing the semantics (and perhaps enforcing them, e.g. return an >error when an addition gives a result out of bounds) There was a proposed extension type to capture Range/Interval ( https://issues.apache.org/jira/browse/ARROW-12637). I can imagine having a kernel that takes a scalar range/interval and applies it to integers (and then maybe tracks the metadata). There has also been some discussions on structured metadata for data statistics but no one has put in the effort to formalize a proposal on this. > 2. improving performance / resource usage As Antoine noted, Parquet encoding already deals with this well. For Arrow at some point we might introduce alternative encodings that could save space (https://github.com/apache/arrow/pull/4815 is an old proposal) that could be used to reduce the bit-width requirements. As noted by Antoine, I don't expect Arrow to support non-power of 2 integers though. There has also been some proposals to support lower bit-width Decimal types which could also help for things like temperature. On Tue, Jun 22, 2021 at 7:02 AM Alessandro Molina < alessan...@ursacomputing.com> wrote: > On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou > wrote: > > > On Mon, 21 Jun 2021 23:50:29 -0400 > > Ying Zhou wrote: > > > Hi, > > > > > > In data people use there are often bounded numbers, mostly integers > with > > clear and fixed upper and lower bounds but also decimals and floats as > well > > e.g. test scores, numerous codes in older databases, max temperature of a > > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should > > include such types in Arrow (and more importantly in Parquet & Avro where > > size matters a lot more). > > > > You are expressing two separate concerns here: > > 1. expressing the semantics (and perhaps enforcing them, e.g. return an > >error when an addition gives a result out of bounds) > > > > I wonder if DictionaryArray could be a foundation for such semantics. It > doesn't seem unreasonable to have a check that prevents you from adding > values that are outside of the values accepted by the dictionary. Seems > reasonable to implement most things like test scores, temperatures etc... > Probably unreasonable for things with a bigger domain of valid values like > coordinates and floats in general. >
Re: [Format] Bounded numbers?
On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou wrote: > On Mon, 21 Jun 2021 23:50:29 -0400 > Ying Zhou wrote: > > Hi, > > > > In data people use there are often bounded numbers, mostly integers with > clear and fixed upper and lower bounds but also decimals and floats as well > e.g. test scores, numerous codes in older databases, max temperature of a > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should > include such types in Arrow (and more importantly in Parquet & Avro where > size matters a lot more). > > You are expressing two separate concerns here: > 1. expressing the semantics (and perhaps enforcing them, e.g. return an >error when an addition gives a result out of bounds) > I wonder if DictionaryArray could be a foundation for such semantics. It doesn't seem unreasonable to have a check that prevents you from adding values that are outside of the values accepted by the dictionary. Seems reasonable to implement most things like test scores, temperatures etc... Probably unreasonable for things with a bigger domain of valid values like coordinates and floats in general.
Re: [Format] Bounded numbers?
If you need to use them in an application that is built on Arrow and Parquet, you can certainly implement an Arrow extension type (on top of FixedSizeBinary in Arrow, for example). On Tue, Jun 22, 2021 at 5:27 AM Antoine Pitrou wrote: > > On Mon, 21 Jun 2021 23:50:29 -0400 > Ying Zhou wrote: > > Hi, > > > > In data people use there are often bounded numbers, mostly integers with > > clear and fixed upper and lower bounds but also decimals and floats as well > > e.g. test scores, numerous codes in older databases, max temperature of a > > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should > > include such types in Arrow (and more importantly in Parquet & Avro where > > size matters a lot more). > > You are expressing two separate concerns here: > 1. expressing the semantics (and perhaps enforcing them, e.g. return an >error when an addition gives a result out of bounds) > 2. improving performance / resource usage > > I would reject concern #2. In Arrow, we probably don't want to > standardize integers with a non-power of two bitwidth. In Parquet, > integer compression already takes advantage of actual magnitude (using > e.g. DELTA_BINARY_PACKED: > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5). > Additional information about the expected magnitude would probably not > bring any additional gains. > > As for concern #1, I have no strong opinion. Perhaps that could be > expressed as custom metadata, or perhaps as a dedicated > parametric BoundInteger datatype. > > Regards > > Antoine. > >
Re: [Format] Bounded numbers?
On Mon, 21 Jun 2021 23:50:29 -0400 Ying Zhou wrote: > Hi, > > In data people use there are often bounded numbers, mostly integers with > clear and fixed upper and lower bounds but also decimals and floats as well > e.g. test scores, numerous codes in older databases, max temperature of a > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should > include such types in Arrow (and more importantly in Parquet & Avro where > size matters a lot more). You are expressing two separate concerns here: 1. expressing the semantics (and perhaps enforcing them, e.g. return an error when an addition gives a result out of bounds) 2. improving performance / resource usage I would reject concern #2. In Arrow, we probably don't want to standardize integers with a non-power of two bitwidth. In Parquet, integer compression already takes advantage of actual magnitude (using e.g. DELTA_BINARY_PACKED: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5). Additional information about the expected magnitude would probably not bring any additional gains. As for concern #1, I have no strong opinion. Perhaps that could be expressed as custom metadata, or perhaps as a dedicated parametric BoundInteger datatype. Regards Antoine.
[Format] Bounded numbers?
Hi, In data people use there are often bounded numbers, mostly integers with clear and fixed upper and lower bounds but also decimals and floats as well e.g. test scores, numerous codes in older databases, max temperature of a city, latitudes, longitudes, numerous IDs etc. I wonder whether we should include such types in Arrow (and more importantly in Parquet & Avro where size matters a lot more). P.S. An implementation of bounded integers in C++ is here: https://github.com/davidstone/bounded-integer