Re: Flight/FlightSQL Optimization for Small Results?

2022-03-08 Thread Micah Kornfield
>
> The operation flow would be like this, or what would it look like?
> Client ---> GetFlightInfo (query/update operation in payload) ---> Server
> ---> Results (non-streamed)


This is roughly the flow I was imagining if the server chooses to send back
inlined data.

-Micah

On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray  wrote:

> Thank you for doing this, left a few questions on the GH issue
>
> I would adopt this proposal as soon as it makes it into nightlies
> (or possibly earlier if it's just a matter of regenerating the proto
> definitions)
>
> The operation flow would be like this, or what would it look like?
>
> Client ---> GetFlightInfo (query/update operation in payload) ---> Server
> ---> Results (non-streamed)
>
>
>
>
> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield 
> wrote:
>
>> Some people have already left comments on
>> https://github.com/apache/arrow/pull/12571  More eyes on it would be
>> appreciated.  If there aren't more comments, I'll try to start
>> implementing
>> this feature in Flight next week, and hopefully have a vote after it is
>> supported in Java and C++/Python.
>>
>>
>> Thanks,
>> Micah
>>
>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield 
>> wrote:
>>
>> > I put together straw-man proposal in PR [1] for the Flight changes.
>> > Ultimately, it seemed based on the use-cases discussed inlining the
>> data on
>> > the Ticket made the most sense.  This might be overly complex (I'm not
>> sure
>> > how I feel about a enum indicating partial vs full results) but welcome
>> > feedback.  Once we get consensus on this proposal, I can add changes to
>> > Flight SQL and try to provide reference implementations.
>> >
>> > [1] https://github.com/apache/arrow/pull/12571
>> >
>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield 
>> > wrote:
>> >
>> >> Would it make sense to make this part of DoGet since it
>> >>> still would be returning a record batch
>> >>
>> >> I would lean against this. I think in many cases the client doesn't
>> know
>> >> the size of the data that it expects.  Leaving the flexibility on the
>> >> server side to send back inlined data when it thinks it makes sense,
>> or a
>> >> bunch of tickets when there is in fact a lot of data seems like the
>> best
>> >> option here.
>> >>
>> >> For cases like previewing data, you usually just want to get a small
>> >>> amount
>> >>> of data quickly.
>> >>
>> >> This is interesting and might be an additional use case.  If we did
>> >> decide to extend FlightInfo we might also want a way of annotating
>> inlined
>> >> data with its corresponding ticket.  That way even for large results,
>> you
>> >> could still send back a small preview if desired.
>> >>
>> >> After considering it a little bit I think I'm sold that inlined data
>> >> should not replace a ticket.  So in my mind the open question is
>> whether
>> >> the client needs to actively opt-in to inlined data.  The scenarios I
>> could
>> >> come with where inlined data isn't useful are:
>> >> 1.  The client is an old client and isn't aware inline data might be
>> >> returned.  In this case the main cost is of extra data on the wire and
>> >> storing it as unknown fields [1].
>> >> 2.  The client is a new client but still doesn't want to get inline
>> data
>> >> (it might want to distribute all consumption to other processes).  Same
>> >> cost is paid as option 1.
>> >>
>> >> Are there other scenarios?  If servers choose reasonable limits on what
>> >> data to inline, the extra complexity of negotiating with the client in
>> this
>> >> case might not be worth the benefits.
>> >>
>> >> Cheers,
>> >> Micah
>> >>
>> >>
>> >> [1]
>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns
>> >>
>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler 
>> wrote:
>> >>
>> >>> I think this would be a useful feature and be nice to have in Flight
>> >>> core.
>> >>> For cases like previewing data, you usually just want to get a small
>> >>> amount
>> >>> of data quickly. Would it make sense to make this part of DoGet since
>> it
>> >>> still would be returning a record batch? Perhaps a Ticket could be
>> made
>> >>> to
>> >>> have an optional FlightDescriptor that would serve as an all-in-one
>> shot?
>> >>>
>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li  wrote:
>> >>>
>> >>> > I agree with something along Antoine's proposal, though: maybe we
>> >>> should
>> >>> > be more structured with the flags (akin to what Micah mentioned with
>> >>> the
>> >>> > Feature enum).
>> >>> >
>> >>> > Also, the flag could be embedded into the Flight SQL messages
>> instead.
>> >>> (So
>> >>> > in effect, Flight would only add the capability to return data with
>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide
>> how
>> >>> > they want to take advantage of that.)
>> >>> >
>> >>> > I think having a completely separate method and return type and
>> having
>> >>> to
>> >>> > poll for it beforehand somewhat defeats the purpose of having
>> it/would
>> >>> be
>> >>> > much 

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Jorge Cardoso Leitão
Agreed.

Also, I would like to revise my previous comment about the small risk.
While prototyping this I did hit some bumps. They primary came from two
reasons:

* I was unable to find arrow/json files in the arrow-testing generated
files with a non-default decimal bitwidth (I think we only have the
on-the-fly generated file in archery)
* the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`)
and implementations may not support the 256 case (e.g. Rust has no native
i256). For these cases, this could be the first non-default decimal
implementation.

So, maybe we follow the standard procedure?

Best,
Jorge



On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield 
wrote:

> >
> > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > it’ll help achieve better performance on TPC-H (and maybe other
> > benchmarks). The decimal columns need only 12 digits of precision, for
> > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > decimal to be faster.
>
>
> We should be careful here.  If this assumes loading from Parquet or other
> file formats currently in the library, arbitrarily changing the type to
> load the minimum data-length possible could break users, this should
> probably be a configuration option.  This also reminds me I think there is
> some technical debt with decimals and parquet.
>
> [1] https://issues.apache.org/jira/browse/ARROW-12022
>
> On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > it’ll help achieve better performance on TPC-H (and maybe other
> > benchmarks). The decimal columns need only 12 digits of precision, for
> > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > decimal to be faster.
> >
> > Sasha Krassovsky
> >
> > > 8 марта 2022 г., в 09:01, Micah Kornfield 
> > написал(а):
> > >
> > > 
> > >>
> > >>
> > >> Do we want to keep the historical "C++ and Java" requirement or
> > >> do we want to make it a more flexible "two independent official
> > >> implementations", which could be for example C++ and Rust, Rust and
> > >> Java, etc.
> > >
> > >
> > > I think flexibility here is a good idea, I'd like to hear other
> opinions.
> > >
> > > For this particular case if there aren't volunteers to help out in
> > another
> > > implementation I'm willing to help with Java (I don't have bandwidth to
> > > do both C++ and Java).
> > >
> > > Cheers,
> > > -Micah
> > >
> > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>
> > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> > 
> >  Relaxing from {128,256} to {32,64,128,256} seems a low risk
> >  from an integration perspective, as implementations already need to
> > read
> >  the bitwidth to select the appropriate physical representation (if
> > they
> >  support it).
> > >>>
> > >>> I think there are two reasons for having implementations first.
> > >>> 1.  Lower risk bugs in implementation/spec.
> > >>> 2.  A mechanism to ensure that there is some boot-strapped coverage
> in
> > >>> commonly used reference implementations.
> > >>
> > >> That sounds reasonable.
> > >>
> > >> Another question that came to my mind is: traditionally, we've
> mandated
> > >> implementations in the two reference Arrow implementations (C++ and
> > >> Java).  However, our implementation landscape is now much richer than
> it
> > >> used to be (for example, there is a tremendous activity on the Rust
> > >> side).  Do we want to keep the historical "C++ and Java" requirement
> or
> > >> do we want to make it a more flexible "two independent official
> > >> implementations", which could be for example C++ and Rust, Rust and
> > >> Java, etc.
> > >>
> > >> (by "independent" I mean that one should not be based on the other,
> for
> > >> example it should not be "C++ and Python" :-))
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>>
> > >>> I agree 1, is fairly low-risk.
> > >>>
> > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
> > >>> jorgecarlei...@gmail.com> wrote:
> > >>>
> >  +1 adding 32 and 64 bit decimals.
> > 
> >  +0 to release it without integration tests - both IPC and the C data
> >  interface use a variable bit width to declare the appropriate size
> for
> >  decimal types. Relaxing from {128,256} to {32,64,128,256} seems a
> low
> > >> risk
> >  from an integration perspective, as implementations already need to
> > read
> >  the bitwidth to select the appropriate physical representation (if
> > they
> >  support it).
> > 
> >  Best,
> >  Jorge
> > 
> > 
> > 
> > 
> >  On Mon, Mar 7, 2022, 11:41 Antoine Pitrou 
> wrote:
> > 
> > >
> > > Le 03/03/2022 à 

Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2022-03-08 Thread Andrew Lamb
I am not sure if everyone saw it in the agenda[1], but we plan to have a
meeting tomorrow. I'll plan to record it for anyone who can not make this
time.

15:00 UTC Wednesday March 9, 2022
Meeting Location: (in agenda)
Matthew Turner:  focused on JIT and row representation, next Wednesday,
March 9th,
@yijie: JIT  overview

[1]
https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#

On Thu, Mar 3, 2022 at 12:50 AM Benson Muite 
wrote:

> Interested in learning more about this. Can work through the code and
> discuss on 17 March either 4:00 or 16:00 UTC.
>
> Benson
>
> On 3/3/22 12:03 AM, Andrew Lamb wrote:
> > I noticed that Matthew Turner added a note to the agenda[1] for a walk
> > through of the JIT code. I would be interested in this as well -- would
> > anyone plan to be on the call and discuss it?
> >
> > I don't think I have time to prepare that content prior
> >
> > Andrew
> >
> > [1]
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> >
>


Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Micah Kornfield
>
> I’d also like to chime in in favor of 32- and 64-bit decimals because
> it’ll help achieve better performance on TPC-H (and maybe other
> benchmarks). The decimal columns need only 12 digits of precision, for
> which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> 128-bit decimal. You can technically use a float too, but I expect 64-bit
> decimal to be faster.


We should be careful here.  If this assumes loading from Parquet or other
file formats currently in the library, arbitrarily changing the type to
load the minimum data-length possible could break users, this should
probably be a configuration option.  This also reminds me I think there is
some technical debt with decimals and parquet.

[1] https://issues.apache.org/jira/browse/ARROW-12022

On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky 
wrote:

> I’d also like to chime in in favor of 32- and 64-bit decimals because
> it’ll help achieve better performance on TPC-H (and maybe other
> benchmarks). The decimal columns need only 12 digits of precision, for
> which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> 128-bit decimal. You can technically use a float too, but I expect 64-bit
> decimal to be faster.
>
> Sasha Krassovsky
>
> > 8 марта 2022 г., в 09:01, Micah Kornfield 
> написал(а):
> >
> > 
> >>
> >>
> >> Do we want to keep the historical "C++ and Java" requirement or
> >> do we want to make it a more flexible "two independent official
> >> implementations", which could be for example C++ and Rust, Rust and
> >> Java, etc.
> >
> >
> > I think flexibility here is a good idea, I'd like to hear other opinions.
> >
> > For this particular case if there aren't volunteers to help out in
> another
> > implementation I'm willing to help with Java (I don't have bandwidth to
> > do both C++ and Java).
> >
> > Cheers,
> > -Micah
> >
> >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> 
>  Relaxing from {128,256} to {32,64,128,256} seems a low risk
>  from an integration perspective, as implementations already need to
> read
>  the bitwidth to select the appropriate physical representation (if
> they
>  support it).
> >>>
> >>> I think there are two reasons for having implementations first.
> >>> 1.  Lower risk bugs in implementation/spec.
> >>> 2.  A mechanism to ensure that there is some boot-strapped coverage in
> >>> commonly used reference implementations.
> >>
> >> That sounds reasonable.
> >>
> >> Another question that came to my mind is: traditionally, we've mandated
> >> implementations in the two reference Arrow implementations (C++ and
> >> Java).  However, our implementation landscape is now much richer than it
> >> used to be (for example, there is a tremendous activity on the Rust
> >> side).  Do we want to keep the historical "C++ and Java" requirement or
> >> do we want to make it a more flexible "two independent official
> >> implementations", which could be for example C++ and Rust, Rust and
> >> Java, etc.
> >>
> >> (by "independent" I mean that one should not be based on the other, for
> >> example it should not be "C++ and Python" :-))
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> I agree 1, is fairly low-risk.
> >>>
> >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
>  +1 adding 32 and 64 bit decimals.
> 
>  +0 to release it without integration tests - both IPC and the C data
>  interface use a variable bit width to declare the appropriate size for
>  decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low
> >> risk
>  from an integration perspective, as implementations already need to
> read
>  the bitwidth to select the appropriate physical representation (if
> they
>  support it).
> 
>  Best,
>  Jorge
> 
> 
> 
> 
>  On Mon, Mar 7, 2022, 11:41 Antoine Pitrou  wrote:
> 
> >
> > Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
> >> I think this makes sense to add these.  Typically when adding new
>  types,
> >> we've waited  on the official vote until there are two reference
> >> implementations demonstrating compatibility.
> >
> > You are right, I had forgotten about that.  Though in this case, it
> > might be argued we are just relaxing the constraints on an existing
> >> type.
> >
> > What do others think?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >>
> >> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou 
> > wrote:
> >>
> >>>
> >>> Hello,
> >>>
> >>> Currently, the Arrow format specification restricts the bitwidth of
> >>> decimal numbers to either 128 or 256 bits.
> >>>
> >>> However, there is interest in allowing other bitwidths, at least 32
>  and
> >>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> >>> 

Re: Flight/FlightSQL Optimization for Small Results?

2022-03-08 Thread Gavin Ray
Thank you for doing this, left a few questions on the GH issue

I would adopt this proposal as soon as it makes it into nightlies
(or possibly earlier if it's just a matter of regenerating the proto
definitions)

The operation flow would be like this, or what would it look like?

Client ---> GetFlightInfo (query/update operation in payload) ---> Server
---> Results (non-streamed)




On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield 
wrote:

> Some people have already left comments on
> https://github.com/apache/arrow/pull/12571  More eyes on it would be
> appreciated.  If there aren't more comments, I'll try to start implementing
> this feature in Flight next week, and hopefully have a vote after it is
> supported in Java and C++/Python.
>
>
> Thanks,
> Micah
>
> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield 
> wrote:
>
> > I put together straw-man proposal in PR [1] for the Flight changes.
> > Ultimately, it seemed based on the use-cases discussed inlining the data
> on
> > the Ticket made the most sense.  This might be overly complex (I'm not
> sure
> > how I feel about a enum indicating partial vs full results) but welcome
> > feedback.  Once we get consensus on this proposal, I can add changes to
> > Flight SQL and try to provide reference implementations.
> >
> > [1] https://github.com/apache/arrow/pull/12571
> >
> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield 
> > wrote:
> >
> >> Would it make sense to make this part of DoGet since it
> >>> still would be returning a record batch
> >>
> >> I would lean against this. I think in many cases the client doesn't know
> >> the size of the data that it expects.  Leaving the flexibility on the
> >> server side to send back inlined data when it thinks it makes sense, or
> a
> >> bunch of tickets when there is in fact a lot of data seems like the best
> >> option here.
> >>
> >> For cases like previewing data, you usually just want to get a small
> >>> amount
> >>> of data quickly.
> >>
> >> This is interesting and might be an additional use case.  If we did
> >> decide to extend FlightInfo we might also want a way of annotating
> inlined
> >> data with its corresponding ticket.  That way even for large results,
> you
> >> could still send back a small preview if desired.
> >>
> >> After considering it a little bit I think I'm sold that inlined data
> >> should not replace a ticket.  So in my mind the open question is whether
> >> the client needs to actively opt-in to inlined data.  The scenarios I
> could
> >> come with where inlined data isn't useful are:
> >> 1.  The client is an old client and isn't aware inline data might be
> >> returned.  In this case the main cost is of extra data on the wire and
> >> storing it as unknown fields [1].
> >> 2.  The client is a new client but still doesn't want to get inline data
> >> (it might want to distribute all consumption to other processes).  Same
> >> cost is paid as option 1.
> >>
> >> Are there other scenarios?  If servers choose reasonable limits on what
> >> data to inline, the extra complexity of negotiating with the client in
> this
> >> case might not be worth the benefits.
> >>
> >> Cheers,
> >> Micah
> >>
> >>
> >> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns
> >>
> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler  wrote:
> >>
> >>> I think this would be a useful feature and be nice to have in Flight
> >>> core.
> >>> For cases like previewing data, you usually just want to get a small
> >>> amount
> >>> of data quickly. Would it make sense to make this part of DoGet since
> it
> >>> still would be returning a record batch? Perhaps a Ticket could be made
> >>> to
> >>> have an optional FlightDescriptor that would serve as an all-in-one
> shot?
> >>>
> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li  wrote:
> >>>
> >>> > I agree with something along Antoine's proposal, though: maybe we
> >>> should
> >>> > be more structured with the flags (akin to what Micah mentioned with
> >>> the
> >>> > Feature enum).
> >>> >
> >>> > Also, the flag could be embedded into the Flight SQL messages
> instead.
> >>> (So
> >>> > in effect, Flight would only add the capability to return data with
> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide
> how
> >>> > they want to take advantage of that.)
> >>> >
> >>> > I think having a completely separate method and return type and
> having
> >>> to
> >>> > poll for it beforehand somewhat defeats the purpose of having
> it/would
> >>> be
> >>> > much harder of a transition.
> >>> >
> >>> > Also: it should be `repeated FlightInfo inline_data` right? In case
> we
> >>> > also need dictionary batches?
> >>> >
> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
> >>> > > Can we just add the following field to the FlightDescriptor
> message:
> >>> > >
> >>> > >   bool accept_inline_data = 4;
> >>> > >
> >>> > > and this one to the FlightInfo message:
> >>> > >
> >>> > >   FlightData inline_data = 100;
> >>> > >
> >>> > > 

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Sasha Krassovsky
I’d also like to chime in in favor of 32- and 64-bit decimals because it’ll 
help achieve better performance on TPC-H (and maybe other benchmarks). The 
decimal columns need only 12 digits of precision, for which a 64-bit decimal is 
sufficient. It’s currently wasteful to use a 128-bit decimal. You can 
technically use a float too, but I expect 64-bit decimal to be faster. 

Sasha Krassovsky

> 8 марта 2022 г., в 09:01, Micah Kornfield  написал(а):
> 
> 
>> 
>> 
>> Do we want to keep the historical "C++ and Java" requirement or
>> do we want to make it a more flexible "two independent official
>> implementations", which could be for example C++ and Rust, Rust and
>> Java, etc.
> 
> 
> I think flexibility here is a good idea, I'd like to hear other opinions.
> 
> For this particular case if there aren't volunteers to help out in another
> implementation I'm willing to help with Java (I don't have bandwidth to
> do both C++ and Java).
> 
> Cheers,
> -Micah
> 
>> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou  wrote:
>> 
>> 
>> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
 
 Relaxing from {128,256} to {32,64,128,256} seems a low risk
 from an integration perspective, as implementations already need to read
 the bitwidth to select the appropriate physical representation (if they
 support it).
>>> 
>>> I think there are two reasons for having implementations first.
>>> 1.  Lower risk bugs in implementation/spec.
>>> 2.  A mechanism to ensure that there is some boot-strapped coverage in
>>> commonly used reference implementations.
>> 
>> That sounds reasonable.
>> 
>> Another question that came to my mind is: traditionally, we've mandated
>> implementations in the two reference Arrow implementations (C++ and
>> Java).  However, our implementation landscape is now much richer than it
>> used to be (for example, there is a tremendous activity on the Rust
>> side).  Do we want to keep the historical "C++ and Java" requirement or
>> do we want to make it a more flexible "two independent official
>> implementations", which could be for example C++ and Rust, Rust and
>> Java, etc.
>> 
>> (by "independent" I mean that one should not be based on the other, for
>> example it should not be "C++ and Python" :-))
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>>> 
>>> I agree 1, is fairly low-risk.
>>> 
>>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
>>> jorgecarlei...@gmail.com> wrote:
>>> 
 +1 adding 32 and 64 bit decimals.
 
 +0 to release it without integration tests - both IPC and the C data
 interface use a variable bit width to declare the appropriate size for
 decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low
>> risk
 from an integration perspective, as implementations already need to read
 the bitwidth to select the appropriate physical representation (if they
 support it).
 
 Best,
 Jorge
 
 
 
 
 On Mon, Mar 7, 2022, 11:41 Antoine Pitrou  wrote:
 
> 
> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
>> I think this makes sense to add these.  Typically when adding new
 types,
>> we've waited  on the official vote until there are two reference
>> implementations demonstrating compatibility.
> 
> You are right, I had forgotten about that.  Though in this case, it
> might be argued we are just relaxing the constraints on an existing
>> type.
> 
> What do others think?
> 
> Regards
> 
> Antoine.
> 
> 
>> 
>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou 
> wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> Currently, the Arrow format specification restricts the bitwidth of
>>> decimal numbers to either 128 or 256 bits.
>>> 
>>> However, there is interest in allowing other bitwidths, at least 32
 and
>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
>>> datatype would allow for precisions of up to 18 digits (respectively
>> 9
>>> digits), which are sufficient for some applications which are mainly
>>> looking for exact computations rather than sheer precision.
>> Obviously,
>>> smaller datatypes are cheaper to store in memory and cheaper to run
>>> computations on.
>>> 
>>> For example, the Spark documentation mentions that some decimal types
>>> may fit in a Java int (32 bits) or long (64 bits):
>>> 
>>> 
> 
 
>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
>>> 
>>> ... and a draft PR had even been filed for initial support in the C++
>>> implementation (https://github.com/apache/arrow/pull/8578).
>>> 
>>> I am therefore proposing that we relax the wording in the Arrow
>> format
>>> specification to also allow 32- and 64-bit decimal types.
>>> 
>>> This is a preliminary discussion to gather opinions and potential
>>> counter-arguments against this 

Re: Flight/FlightSQL Optimization for Small Results?

2022-03-08 Thread Micah Kornfield
Some people have already left comments on
https://github.com/apache/arrow/pull/12571  More eyes on it would be
appreciated.  If there aren't more comments, I'll try to start implementing
this feature in Flight next week, and hopefully have a vote after it is
supported in Java and C++/Python.


Thanks,
Micah

On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield 
wrote:

> I put together straw-man proposal in PR [1] for the Flight changes.
> Ultimately, it seemed based on the use-cases discussed inlining the data on
> the Ticket made the most sense.  This might be overly complex (I'm not sure
> how I feel about a enum indicating partial vs full results) but welcome
> feedback.  Once we get consensus on this proposal, I can add changes to
> Flight SQL and try to provide reference implementations.
>
> [1] https://github.com/apache/arrow/pull/12571
>
> On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield 
> wrote:
>
>> Would it make sense to make this part of DoGet since it
>>> still would be returning a record batch
>>
>> I would lean against this. I think in many cases the client doesn't know
>> the size of the data that it expects.  Leaving the flexibility on the
>> server side to send back inlined data when it thinks it makes sense, or a
>> bunch of tickets when there is in fact a lot of data seems like the best
>> option here.
>>
>> For cases like previewing data, you usually just want to get a small
>>> amount
>>> of data quickly.
>>
>> This is interesting and might be an additional use case.  If we did
>> decide to extend FlightInfo we might also want a way of annotating inlined
>> data with its corresponding ticket.  That way even for large results, you
>> could still send back a small preview if desired.
>>
>> After considering it a little bit I think I'm sold that inlined data
>> should not replace a ticket.  So in my mind the open question is whether
>> the client needs to actively opt-in to inlined data.  The scenarios I could
>> come with where inlined data isn't useful are:
>> 1.  The client is an old client and isn't aware inline data might be
>> returned.  In this case the main cost is of extra data on the wire and
>> storing it as unknown fields [1].
>> 2.  The client is a new client but still doesn't want to get inline data
>> (it might want to distribute all consumption to other processes).  Same
>> cost is paid as option 1.
>>
>> Are there other scenarios?  If servers choose reasonable limits on what
>> data to inline, the extra complexity of negotiating with the client in this
>> case might not be worth the benefits.
>>
>> Cheers,
>> Micah
>>
>>
>> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns
>>
>> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler  wrote:
>>
>>> I think this would be a useful feature and be nice to have in Flight
>>> core.
>>> For cases like previewing data, you usually just want to get a small
>>> amount
>>> of data quickly. Would it make sense to make this part of DoGet since it
>>> still would be returning a record batch? Perhaps a Ticket could be made
>>> to
>>> have an optional FlightDescriptor that would serve as an all-in-one shot?
>>>
>>> On Tue, Mar 1, 2022 at 8:44 AM David Li  wrote:
>>>
>>> > I agree with something along Antoine's proposal, though: maybe we
>>> should
>>> > be more structured with the flags (akin to what Micah mentioned with
>>> the
>>> > Feature enum).
>>> >
>>> > Also, the flag could be embedded into the Flight SQL messages instead.
>>> (So
>>> > in effect, Flight would only add the capability to return data with
>>> > FlightInfo, and it's up to applications, like Flight SQL, to decide how
>>> > they want to take advantage of that.)
>>> >
>>> > I think having a completely separate method and return type and having
>>> to
>>> > poll for it beforehand somewhat defeats the purpose of having it/would
>>> be
>>> > much harder of a transition.
>>> >
>>> > Also: it should be `repeated FlightInfo inline_data` right? In case we
>>> > also need dictionary batches?
>>> >
>>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
>>> > > Can we just add the following field to the FlightDescriptor message:
>>> > >
>>> > >   bool accept_inline_data = 4;
>>> > >
>>> > > and this one to the FlightInfo message:
>>> > >
>>> > >   FlightData inline_data = 100;
>>> > >
>>> > > Then new clients can `accept_inline_data` to true (the default being
>>> > > false if omitted) to signal servers that they can put the data if
>>> > > `inline_data` if deemed small enough.
>>> > >
>>> > > (the `accept_inline_data` field could also be used to the Criteria
>>> > > message)
>>> > >
>>> > >
>>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty
>>> > > (FlightDescriptor being used in other contexts where
>>> > > `accept_inline_data` makes no sense), we can instead define a new
>>> > > method:
>>> > >
>>> > >   rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {}
>>> > >
>>> > > with:
>>> > >
>>> > > message GetFlightInfoRequest {
>>> 

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Micah Kornfield
>
> Do we want to keep the historical "C++ and Java" requirement or
> do we want to make it a more flexible "two independent official
> implementations", which could be for example C++ and Rust, Rust and
> Java, etc.


I think flexibility here is a good idea, I'd like to hear other opinions.

For this particular case if there aren't volunteers to help out in another
implementation I'm willing to help with Java (I don't have bandwidth to
do both C++ and Java).

Cheers,
-Micah

On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou  wrote:

>
> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> >>
> >> Relaxing from {128,256} to {32,64,128,256} seems a low risk
> >> from an integration perspective, as implementations already need to read
> >> the bitwidth to select the appropriate physical representation (if they
> >> support it).
> >
> > I think there are two reasons for having implementations first.
> > 1.  Lower risk bugs in implementation/spec.
> > 2.  A mechanism to ensure that there is some boot-strapped coverage in
> > commonly used reference implementations.
>
> That sounds reasonable.
>
> Another question that came to my mind is: traditionally, we've mandated
> implementations in the two reference Arrow implementations (C++ and
> Java).  However, our implementation landscape is now much richer than it
> used to be (for example, there is a tremendous activity on the Rust
> side).  Do we want to keep the historical "C++ and Java" requirement or
> do we want to make it a more flexible "two independent official
> implementations", which could be for example C++ and Rust, Rust and
> Java, etc.
>
> (by "independent" I mean that one should not be based on the other, for
> example it should not be "C++ and Python" :-))
>
> Regards
>
> Antoine.
>
>
> >
> > I agree 1, is fairly low-risk.
> >
> > On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> >> +1 adding 32 and 64 bit decimals.
> >>
> >> +0 to release it without integration tests - both IPC and the C data
> >> interface use a variable bit width to declare the appropriate size for
> >> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low
> risk
> >> from an integration perspective, as implementations already need to read
> >> the bitwidth to select the appropriate physical representation (if they
> >> support it).
> >>
> >> Best,
> >> Jorge
> >>
> >>
> >>
> >>
> >> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou  wrote:
> >>
> >>>
> >>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
>  I think this makes sense to add these.  Typically when adding new
> >> types,
>  we've waited  on the official vote until there are two reference
>  implementations demonstrating compatibility.
> >>>
> >>> You are right, I had forgotten about that.  Though in this case, it
> >>> might be argued we are just relaxing the constraints on an existing
> type.
> >>>
> >>> What do others think?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> 
>  On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou 
> >>> wrote:
> 
> >
> > Hello,
> >
> > Currently, the Arrow format specification restricts the bitwidth of
> > decimal numbers to either 128 or 256 bits.
> >
> > However, there is interest in allowing other bitwidths, at least 32
> >> and
> > 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> > datatype would allow for precisions of up to 18 digits (respectively
> 9
> > digits), which are sufficient for some applications which are mainly
> > looking for exact computations rather than sheer precision.
> Obviously,
> > smaller datatypes are cheaper to store in memory and cheaper to run
> > computations on.
> >
> > For example, the Spark documentation mentions that some decimal types
> > may fit in a Java int (32 bits) or long (64 bits):
> >
> >
> >>>
> >>
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
> >
> > ... and a draft PR had even been filed for initial support in the C++
> > implementation (https://github.com/apache/arrow/pull/8578).
> >
> > I am therefore proposing that we relax the wording in the Arrow
> format
> > specification to also allow 32- and 64-bit decimal types.
> >
> > This is a preliminary discussion to gather opinions and potential
> > counter-arguments against this proposal. If no strong
> counter-argument
> > emerges, we will probably run a vote in a week or two.
> >
> > Best regards
> >
> > Antoine.
> >
> 
> >>>
> >>
> >
>


[RESULT][VOTE][RUST] Release Apache Arrow Rust 10.0.0 RC1

2022-03-08 Thread Andrew Lamb
With 8 +1 (3 binding) the release is approved! Thank you to all who
verified it.

The release is available here:
  https://dist.apache.org/repos/dist/release/arrow/arrow-rs-10.0.0

It has also been uploaded to crates.io:
https://crates.io/crates/arrow/10.0.0
https://crates.io/crates/arrow-flight/10.0.0
https://crates.io/crates/parquet/10.0.0
https://crates.io/crates/parquet-derive/10.0.0

On Mon, Mar 7, 2022 at 11:59 PM Yijie Shen 
wrote:

> +1 (non-binding) verified on Windows Subsystem for Linux. Thanks, Andrew!
>
> On Tue, Mar 8, 2022 at 10:43 AM QP Hou  wrote:
>
> > +1 (binding). Thanks Andrew.
> >
> > On Mon, Mar 7, 2022 at 9:17 AM Chao Sun  wrote:
> > >
> > > +1 (non-binding) verified on Mac. Thanks Andrew!
> > >
> > > On Mon, Mar 7, 2022 at 7:47 AM Matthew Turner
> > >  wrote:
> > > >
> > > > +1 (non-binding) after running release verification script on M1 Mac.
> > > >
> > > > Thanks, Andrew.
> > > >
> > > > From: Andy Grove 
> > > > Date: Monday, March 7, 2022 at 10:00 AM
> > > > To: dev 
> > > > Subject: Re: [VOTE][RUST] Release Apache Arrow Rust 10.0.0 RC1
> > > > +1 (binding)
> > > >
> > > > Verified on Ubuntu 20.04.3 LTS
> > > >
> > > > On Mon, Mar 7, 2022 at 6:52 AM Kun Liu  wrote:
> > > >
> > > > > I have tested it in the mac and got "Release candidate looks good!"
> > > > > message.
> > > > > The ut passed in my mac.
> > > > >
> > > > > +1 non-binding.
> > > > >
> > > > > Thanks,
> > > > > Kun
> > > > >
> > > > > R
> > > > >
> > > > > Wang Xudong  于2022年3月5日周六 22:00写道:
> > > > >
> > > > > > +1 non-binding
> > > > > >
> > > > > > Test on macOS, "Release candidate looks good!"
> > > > > > Thank you alamb!
> > > > > >
> > > > > > ---
> > > > > > xudong
> > > > > >
> > > > > >
> > > > > >
> > > > > > Andrew Lamb  于2022年3月5日周六 20:06写道:
> > > > > >
> > > > > > > Salutations Arrow Rust Community,
> > > > > > >
> > > > > > > I would like to propose a release of Apache Arrow Rust
> > Implementation,
> > > > > > > version 10.0.0. As previously discussed[5]  the "Integration
> > Test" CI
> > > > > is
> > > > > > > failing[6], but I we have determined it is a bug in the test,
> > not in
> > > > > the
> > > > > > > code itself and have a fix ready [7]
> > > > > > >
> > > > > > > This release candidate is based on commit:
> > > > > > > a7bd09abde0010a58d0cd0557384df5aadba83ac [1]
> > > > > > >
> > > > > > > The proposed release tarball and signatures are hosted at [2].
> > > > > > >
> > > > > > > The changelog is located at [3].
> > > > > > >
> > > > > > > Please download, verify checksums and signatures, run the unit
> > tests,
> > > > > > > and vote on the release. There is a script [4] that automates
> > some of
> > > > > > > the verification.
> > > > > > >
> > > > > > > The vote will be open for at least 72 hours.
> > > > > > >
> > > > > > > [ ] +1 Release this as Apache Arrow Rust
> > > > > > > [ ] +0
> > > > > > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Ftree%2Fa7bd09abde0010a58d0cd0557384df5aadba83acdata=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=66NubVTy23HM%2B8coRtaWHAvQWYPQDMEvfNeBTHyc40E%3Dreserved=0
> > > > > > > [2]:
> > > > > > >
> > > > >
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Farrow%2Fapache-arrow-rs-10.0.0-rc1data=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=fcoy2Ee7qx0UdQo504491hgeIL9%2Fekbnz35J4CgduuQ%3Dreserved=0
> > > > > > > [3]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Fblob%2Fa7bd09abde0010a58d0cd0557384df5aadba83ac%2FCHANGELOG.mddata=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=BTjlJ7euXemTh%2BV0Bw90iCGdhvBJCKWS50dQm8AhXjQ%3Dreserved=0
> > > > > > > [4]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Fblob%2Fmaster%2Fdev%2Frelease%2Fverify-release-candidate.shdata=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=QXf6z4OLE0cqbOJpzHCB2JYe5AULep4PgABZFbBCjFs%3Dreserved=0
> > > > > > > [5]:
> >
> 

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Antoine Pitrou



Le 07/03/2022 à 20:26, Micah Kornfield a écrit :


Relaxing from {128,256} to {32,64,128,256} seems a low risk
from an integration perspective, as implementations already need to read
the bitwidth to select the appropriate physical representation (if they
support it).


I think there are two reasons for having implementations first.
1.  Lower risk bugs in implementation/spec.
2.  A mechanism to ensure that there is some boot-strapped coverage in
commonly used reference implementations.


That sounds reasonable.

Another question that came to my mind is: traditionally, we've mandated 
implementations in the two reference Arrow implementations (C++ and 
Java).  However, our implementation landscape is now much richer than it 
used to be (for example, there is a tremendous activity on the Rust 
side).  Do we want to keep the historical "C++ and Java" requirement or 
do we want to make it a more flexible "two independent official 
implementations", which could be for example C++ and Rust, Rust and 
Java, etc.


(by "independent" I mean that one should not be based on the other, for 
example it should not be "C++ and Python" :-))


Regards

Antoine.




I agree 1, is fairly low-risk.

On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:


+1 adding 32 and 64 bit decimals.

+0 to release it without integration tests - both IPC and the C data
interface use a variable bit width to declare the appropriate size for
decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low risk
from an integration perspective, as implementations already need to read
the bitwidth to select the appropriate physical representation (if they
support it).

Best,
Jorge




On Mon, Mar 7, 2022, 11:41 Antoine Pitrou  wrote:



Le 03/03/2022 à 18:05, Micah Kornfield a écrit :

I think this makes sense to add these.  Typically when adding new

types,

we've waited  on the official vote until there are two reference
implementations demonstrating compatibility.


You are right, I had forgotten about that.  Though in this case, it
might be argued we are just relaxing the constraints on an existing type.

What do others think?

Regards

Antoine.




On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou 

wrote:




Hello,

Currently, the Arrow format specification restricts the bitwidth of
decimal numbers to either 128 or 256 bits.

However, there is interest in allowing other bitwidths, at least 32

and

64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
datatype would allow for precisions of up to 18 digits (respectively 9
digits), which are sufficient for some applications which are mainly
looking for exact computations rather than sheer precision. Obviously,
smaller datatypes are cheaper to store in memory and cheaper to run
computations on.

For example, the Spark documentation mentions that some decimal types
may fit in a Java int (32 bits) or long (64 bits):





https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html


... and a draft PR had even been filed for initial support in the C++
implementation (https://github.com/apache/arrow/pull/8578).

I am therefore proposing that we relax the wording in the Arrow format
specification to also allow 32- and 64-bit decimal types.

This is a preliminary discussion to gather opinions and potential
counter-arguments against this proposal. If no strong counter-argument
emerges, we will probably run a vote in a week or two.

Best regards

Antoine.