Agreed. Also, I would like to revise my previous comment about the small risk. While prototyping this I did hit some bumps. They primary came from two reasons:
* I was unable to find arrow/json files in the arrow-testing generated files with a non-default decimal bitwidth (I think we only have the on-the-fly generated file in archery) * the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`) and implementations may not support the 256 case (e.g. Rust has no native i256). For these cases, this could be the first non-default decimal implementation. So, maybe we follow the standard procedure? Best, Jorge On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > I’d also like to chime in in favor of 32- and 64-bit decimals because > > it’ll help achieve better performance on TPC-H (and maybe other > > benchmarks). The decimal columns need only 12 digits of precision, for > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > > 128-bit decimal. You can technically use a float too, but I expect 64-bit > > decimal to be faster. > > > We should be careful here. If this assumes loading from Parquet or other > file formats currently in the library, arbitrarily changing the type to > load the minimum data-length possible could break users, this should > probably be a configuration option. This also reminds me I think there is > some technical debt with decimals and parquet. > > [1] https://issues.apache.org/jira/browse/ARROW-12022 > > On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky < > krassovskysa...@gmail.com> > wrote: > > > I’d also like to chime in in favor of 32- and 64-bit decimals because > > it’ll help achieve better performance on TPC-H (and maybe other > > benchmarks). The decimal columns need only 12 digits of precision, for > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > > 128-bit decimal. You can technically use a float too, but I expect 64-bit > > decimal to be faster. > > > > Sasha Krassovsky > > > > > 8 марта 2022 г., в 09:01, Micah Kornfield <emkornfi...@gmail.com> > > написал(а): > > > > > > > > >> > > >> > > >> Do we want to keep the historical "C++ and Java" requirement or > > >> do we want to make it a more flexible "two independent official > > >> implementations", which could be for example C++ and Rust, Rust and > > >> Java, etc. > > > > > > > > > I think flexibility here is a good idea, I'd like to hear other > opinions. > > > > > > For this particular case if there aren't volunteers to help out in > > another > > > implementation I'm willing to help with Java (I don't have bandwidth to > > > do both C++ and Java). > > > > > > Cheers, > > > -Micah > > > > > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <anto...@python.org> > > wrote: > > >> > > >> > > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit : > > >>>> > > >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk > > >>>> from an integration perspective, as implementations already need to > > read > > >>>> the bitwidth to select the appropriate physical representation (if > > they > > >>>> support it). > > >>> > > >>> I think there are two reasons for having implementations first. > > >>> 1. Lower risk bugs in implementation/spec. > > >>> 2. A mechanism to ensure that there is some boot-strapped coverage > in > > >>> commonly used reference implementations. > > >> > > >> That sounds reasonable. > > >> > > >> Another question that came to my mind is: traditionally, we've > mandated > > >> implementations in the two reference Arrow implementations (C++ and > > >> Java). However, our implementation landscape is now much richer than > it > > >> used to be (for example, there is a tremendous activity on the Rust > > >> side). Do we want to keep the historical "C++ and Java" requirement > or > > >> do we want to make it a more flexible "two independent official > > >> implementations", which could be for example C++ and Rust, Rust and > > >> Java, etc. > > >> > > >> (by "independent" I mean that one should not be based on the other, > for > > >> example it should not be "C++ and Python" :-)) > > >> > > >> Regards > > >> > > >> Antoine. > > >> > > >> > > >>> > > >>> I agree 1, is fairly low-risk. > > >>> > > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < > > >>> jorgecarlei...@gmail.com> wrote: > > >>> > > >>>> +1 adding 32 and 64 bit decimals. > > >>>> > > >>>> +0 to release it without integration tests - both IPC and the C data > > >>>> interface use a variable bit width to declare the appropriate size > for > > >>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a > low > > >> risk > > >>>> from an integration perspective, as implementations already need to > > read > > >>>> the bitwidth to select the appropriate physical representation (if > > they > > >>>> support it). > > >>>> > > >>>> Best, > > >>>> Jorge > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <anto...@python.org> > wrote: > > >>>> > > >>>>> > > >>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit : > > >>>>>> I think this makes sense to add these. Typically when adding new > > >>>> types, > > >>>>>> we've waited on the official vote until there are two reference > > >>>>>> implementations demonstrating compatibility. > > >>>>> > > >>>>> You are right, I had forgotten about that. Though in this case, it > > >>>>> might be argued we are just relaxing the constraints on an existing > > >> type. > > >>>>> > > >>>>> What do others think? > > >>>>> > > >>>>> Regards > > >>>>> > > >>>>> Antoine. > > >>>>> > > >>>>> > > >>>>>> > > >>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <anto...@python.org > > > > >>>>> wrote: > > >>>>>> > > >>>>>>> > > >>>>>>> Hello, > > >>>>>>> > > >>>>>>> Currently, the Arrow format specification restricts the bitwidth > of > > >>>>>>> decimal numbers to either 128 or 256 bits. > > >>>>>>> > > >>>>>>> However, there is interest in allowing other bitwidths, at least > 32 > > >>>> and > > >>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal > > >>>>>>> datatype would allow for precisions of up to 18 digits > > (respectively > > >> 9 > > >>>>>>> digits), which are sufficient for some applications which are > > mainly > > >>>>>>> looking for exact computations rather than sheer precision. > > >> Obviously, > > >>>>>>> smaller datatypes are cheaper to store in memory and cheaper to > run > > >>>>>>> computations on. > > >>>>>>> > > >>>>>>> For example, the Spark documentation mentions that some decimal > > types > > >>>>>>> may fit in a Java int (32 bits) or long (64 bits): > > >>>>>>> > > >>>>>>> > > >>>>> > > >>>> > > >> > > > https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html > > >>>>>>> > > >>>>>>> ... and a draft PR had even been filed for initial support in the > > C++ > > >>>>>>> implementation (https://github.com/apache/arrow/pull/8578). > > >>>>>>> > > >>>>>>> I am therefore proposing that we relax the wording in the Arrow > > >> format > > >>>>>>> specification to also allow 32- and 64-bit decimal types. > > >>>>>>> > > >>>>>>> This is a preliminary discussion to gather opinions and potential > > >>>>>>> counter-arguments against this proposal. If no strong > > >> counter-argument > > >>>>>>> emerges, we will probably run a vote in a week or two. > > >>>>>>> > > >>>>>>> Best regards > > >>>>>>> > > >>>>>>> Antoine. > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > >