Re: [Format] Bounded numbers?

2021-06-22 Thread Micah Kornfield
To add onto Antoine's two points:

1. expressing the semantics (and perhaps enforcing them, e.g. return an
>error when an addition gives a result out of bounds)

There was a proposed extension type to capture Range/Interval (
https://issues.apache.org/jira/browse/ARROW-12637).  I can imagine having a
kernel that takes a scalar range/interval and applies it to integers (and
then maybe tracks the metadata).   There has also been some discussions on
structured metadata for data statistics but no one has put in the effort to
formalize a proposal on this.


> 2. improving performance / resource usage

As Antoine noted, Parquet encoding already deals with this well.  For Arrow
at some point we might introduce alternative encodings that could save
space (https://github.com/apache/arrow/pull/4815 is an old proposal) that
could be used to reduce the bit-width requirements.  As noted by Antoine, I
don't expect Arrow to support non-power of 2 integers though.  There has
also been some proposals to support lower bit-width Decimal types which
could also help for things like temperature.

On Tue, Jun 22, 2021 at 7:02 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou 
> wrote:
>
> > On Mon, 21 Jun 2021 23:50:29 -0400
> > Ying Zhou  wrote:
> > > Hi,
> > >
> > > In data people use there are often bounded numbers, mostly integers
> with
> > clear and fixed upper and lower bounds but also decimals and floats as
> well
> > e.g. test scores, numerous codes in older databases, max temperature of a
> > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should
> > include such types in Arrow (and more importantly in Parquet & Avro where
> > size matters a lot more).
> >
> > You are expressing two separate concerns here:
> > 1. expressing the semantics (and perhaps enforcing them, e.g. return an
> >error when an addition gives a result out of bounds)
> >
>
> I wonder if DictionaryArray could be a foundation for such semantics. It
> doesn't seem unreasonable to have a check that prevents you from adding
> values that are outside of the values accepted by the dictionary. Seems
> reasonable to implement most things like test scores, temperatures etc...
> Probably unreasonable for things with a bigger domain of valid values like
> coordinates and floats in general.
>


Re: [Format] Bounded numbers?

2021-06-22 Thread Alessandro Molina
On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou  wrote:

> On Mon, 21 Jun 2021 23:50:29 -0400
> Ying Zhou  wrote:
> > Hi,
> >
> > In data people use there are often bounded numbers, mostly integers with
> clear and fixed upper and lower bounds but also decimals and floats as well
> e.g. test scores, numerous codes in older databases, max temperature of a
> city, latitudes, longitudes, numerous IDs etc. I wonder whether we should
> include such types in Arrow (and more importantly in Parquet & Avro where
> size matters a lot more).
>
> You are expressing two separate concerns here:
> 1. expressing the semantics (and perhaps enforcing them, e.g. return an
>error when an addition gives a result out of bounds)
>

I wonder if DictionaryArray could be a foundation for such semantics. It
doesn't seem unreasonable to have a check that prevents you from adding
values that are outside of the values accepted by the dictionary. Seems
reasonable to implement most things like test scores, temperatures etc...
Probably unreasonable for things with a bigger domain of valid values like
coordinates and floats in general.


Re: [Format] Bounded numbers?

2021-06-22 Thread Wes McKinney
If you need to use them in an application that is built on Arrow and
Parquet, you can certainly implement an Arrow extension type (on top
of FixedSizeBinary in Arrow, for example).

On Tue, Jun 22, 2021 at 5:27 AM Antoine Pitrou  wrote:
>
> On Mon, 21 Jun 2021 23:50:29 -0400
> Ying Zhou  wrote:
> > Hi,
> >
> > In data people use there are often bounded numbers, mostly integers with 
> > clear and fixed upper and lower bounds but also decimals and floats as well 
> > e.g. test scores, numerous codes in older databases, max temperature of a 
> > city, latitudes, longitudes, numerous IDs etc. I wonder whether we should 
> > include such types in Arrow (and more importantly in Parquet & Avro where 
> > size matters a lot more).
>
> You are expressing two separate concerns here:
> 1. expressing the semantics (and perhaps enforcing them, e.g. return an
>error when an addition gives a result out of bounds)
> 2. improving performance / resource usage
>
> I would reject concern #2.  In Arrow, we probably don't want to
> standardize integers with a non-power of two bitwidth.  In Parquet,
> integer compression already takes advantage of actual magnitude (using
> e.g. DELTA_BINARY_PACKED:
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5).
> Additional information about the expected magnitude would probably not
> bring any additional gains.
>
> As for concern #1, I have no strong opinion.  Perhaps that could be
> expressed as custom metadata, or perhaps as a dedicated
> parametric BoundInteger datatype.
>
> Regards
>
> Antoine.
>
>


Re: [Format] Bounded numbers?

2021-06-22 Thread Antoine Pitrou
On Mon, 21 Jun 2021 23:50:29 -0400
Ying Zhou  wrote:
> Hi,
> 
> In data people use there are often bounded numbers, mostly integers with 
> clear and fixed upper and lower bounds but also decimals and floats as well 
> e.g. test scores, numerous codes in older databases, max temperature of a 
> city, latitudes, longitudes, numerous IDs etc. I wonder whether we should 
> include such types in Arrow (and more importantly in Parquet & Avro where 
> size matters a lot more).

You are expressing two separate concerns here:
1. expressing the semantics (and perhaps enforcing them, e.g. return an
   error when an addition gives a result out of bounds)
2. improving performance / resource usage

I would reject concern #2.  In Arrow, we probably don't want to
standardize integers with a non-power of two bitwidth.  In Parquet,
integer compression already takes advantage of actual magnitude (using
e.g. DELTA_BINARY_PACKED:
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5).
Additional information about the expected magnitude would probably not
bring any additional gains.

As for concern #1, I have no strong opinion.  Perhaps that could be
expressed as custom metadata, or perhaps as a dedicated
parametric BoundInteger datatype.

Regards

Antoine.




[Format] Bounded numbers?

2021-06-21 Thread Ying Zhou
Hi,

In data people use there are often bounded numbers, mostly integers with clear 
and fixed upper and lower bounds but also decimals and floats as well e.g. test 
scores, numerous codes in older databases, max temperature of a city, 
latitudes, longitudes, numerous IDs etc. I wonder whether we should include 
such types in Arrow (and more importantly in Parquet & Avro where size matters 
a lot more).

P.S. An implementation of bounded integers in C++ is here: 
https://github.com/davidstone/bounded-integer