Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Jorge Cardoso Leitão Tue, 08 Mar 2022 12:59:49 -0800

Agreed.

Also, I would like to revise my previous comment about the small risk.
While prototyping this I did hit some bumps. They primary came from two
reasons:


* I was unable to find arrow/json files in the arrow-testing generated
files with a non-default decimal bitwidth (I think we only have the
on-the-fly generated file in archery)
* the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`)
and implementations may not support the 256 case (e.g. Rust has no native
i256). For these cases, this could be the first non-default decimal
implementation.

So, maybe we follow the standard procedure?

Best,
Jorge



On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > it’ll help achieve better performance on TPC-H (and maybe other
> > benchmarks). The decimal columns need only 12 digits of precision, for
> > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > decimal to be faster.
>
>
> We should be careful here.  If this assumes loading from Parquet or other
> file formats currently in the library, arbitrarily changing the type to
> load the minimum data-length possible could break users, this should
> probably be a configuration option.  This also reminds me I think there is
> some technical debt with decimals and parquet.
>
> [1] https://issues.apache.org/jira/browse/ARROW-12022
>
> On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > it’ll help achieve better performance on TPC-H (and maybe other
> > benchmarks). The decimal columns need only 12 digits of precision, for
> > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > decimal to be faster.
> >
> > Sasha Krassovsky
> >
> > > 8 марта 2022 г., в 09:01, Micah Kornfield <emkornfi...@gmail.com>
> > написал(а):
> > >
> > > 
> > >>
> > >>
> > >> Do we want to keep the historical "C++ and Java" requirement or
> > >> do we want to make it a more flexible "two independent official
> > >> implementations", which could be for example C++ and Rust, Rust and
> > >> Java, etc.
> > >
> > >
> > > I think flexibility here is a good idea, I'd like to hear other
> opinions.
> > >
> > > For this particular case if there aren't volunteers to help out in
> > another
> > > implementation I'm willing to help with Java (I don't have bandwidth to
> > > do both C++ and Java).
> > >
> > > Cheers,
> > > -Micah
> > >
> > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >>
> > >>
> > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> > >>>>
> > >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk
> > >>>> from an integration perspective, as implementations already need to
> > read
> > >>>> the bitwidth to select the appropriate physical representation (if
> > they
> > >>>> support it).
> > >>>
> > >>> I think there are two reasons for having implementations first.
> > >>> 1.  Lower risk bugs in implementation/spec.
> > >>> 2.  A mechanism to ensure that there is some boot-strapped coverage
> in
> > >>> commonly used reference implementations.
> > >>
> > >> That sounds reasonable.
> > >>
> > >> Another question that came to my mind is: traditionally, we've
> mandated
> > >> implementations in the two reference Arrow implementations (C++ and
> > >> Java).  However, our implementation landscape is now much richer than
> it
> > >> used to be (for example, there is a tremendous activity on the Rust
> > >> side).  Do we want to keep the historical "C++ and Java" requirement
> or
> > >> do we want to make it a more flexible "two independent official
> > >> implementations", which could be for example C++ and Rust, Rust and
> > >> Java, etc.
> > >>
> > >> (by "independent" I mean that one should not be based on the other,
> for
> > >> example it should not be "C++ and Python" :-))
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>>
> > >>> I agree 1, is fairly low-risk.
> > >>>
> > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
> > >>> jorgecarlei...@gmail.com> wrote:
> > >>>
> > >>>> +1 adding 32 and 64 bit decimals.
> > >>>>
> > >>>> +0 to release it without integration tests - both IPC and the C data
> > >>>> interface use a variable bit width to declare the appropriate size
> for
> > >>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a
> low
> > >> risk
> > >>>> from an integration perspective, as implementations already need to
> > read
> > >>>> the bitwidth to select the appropriate physical representation (if
> > they
> > >>>> support it).
> > >>>>
> > >>>> Best,
> > >>>> Jorge
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <anto...@python.org>
> wrote:
> > >>>>
> > >>>>>
> > >>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
> > >>>>>> I think this makes sense to add these.  Typically when adding new
> > >>>> types,
> > >>>>>> we've waited  on the official vote until there are two reference
> > >>>>>> implementations demonstrating compatibility.
> > >>>>>
> > >>>>> You are right, I had forgotten about that.  Though in this case, it
> > >>>>> might be argued we are just relaxing the constraints on an existing
> > >> type.
> > >>>>>
> > >>>>> What do others think?
> > >>>>>
> > >>>>> Regards
> > >>>>>
> > >>>>> Antoine.
> > >>>>>
> > >>>>>
> > >>>>>>
> > >>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <anto...@python.org
> >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> Currently, the Arrow format specification restricts the bitwidth
> of
> > >>>>>>> decimal numbers to either 128 or 256 bits.
> > >>>>>>>
> > >>>>>>> However, there is interest in allowing other bitwidths, at least
> 32
> > >>>> and
> > >>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> > >>>>>>> datatype would allow for precisions of up to 18 digits
> > (respectively
> > >> 9
> > >>>>>>> digits), which are sufficient for some applications which are
> > mainly
> > >>>>>>> looking for exact computations rather than sheer precision.
> > >> Obviously,
> > >>>>>>> smaller datatypes are cheaper to store in memory and cheaper to
> run
> > >>>>>>> computations on.
> > >>>>>>>
> > >>>>>>> For example, the Spark documentation mentions that some decimal
> > types
> > >>>>>>> may fit in a Java int (32 bits) or long (64 bits):
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
> > >>>>>>>
> > >>>>>>> ... and a draft PR had even been filed for initial support in the
> > C++
> > >>>>>>> implementation (https://github.com/apache/arrow/pull/8578).
> > >>>>>>>
> > >>>>>>> I am therefore proposing that we relax the wording in the Arrow
> > >> format
> > >>>>>>> specification to also allow 32- and 64-bit decimal types.
> > >>>>>>>
> > >>>>>>> This is a preliminary discussion to gather opinions and potential
> > >>>>>>> counter-arguments against this proposal. If no strong
> > >> counter-argument
> > >>>>>>> emerges, we will probably run a vote in a week or two.
> > >>>>>>>
> > >>>>>>> Best regards
> > >>>>>>>
> > >>>>>>> Antoine.
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Reply via email to