Re: Flight/FlightSQL Optimization for Small Results?
> > The operation flow would be like this, or what would it look like? > Client ---> GetFlightInfo (query/update operation in payload) ---> Server > ---> Results (non-streamed) This is roughly the flow I was imagining if the server chooses to send back inlined data. -Micah On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray wrote: > Thank you for doing this, left a few questions on the GH issue > > I would adopt this proposal as soon as it makes it into nightlies > (or possibly earlier if it's just a matter of regenerating the proto > definitions) > > The operation flow would be like this, or what would it look like? > > Client ---> GetFlightInfo (query/update operation in payload) ---> Server > ---> Results (non-streamed) > > > > > On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield > wrote: > >> Some people have already left comments on >> https://github.com/apache/arrow/pull/12571 More eyes on it would be >> appreciated. If there aren't more comments, I'll try to start >> implementing >> this feature in Flight next week, and hopefully have a vote after it is >> supported in Java and C++/Python. >> >> >> Thanks, >> Micah >> >> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield >> wrote: >> >> > I put together straw-man proposal in PR [1] for the Flight changes. >> > Ultimately, it seemed based on the use-cases discussed inlining the >> data on >> > the Ticket made the most sense. This might be overly complex (I'm not >> sure >> > how I feel about a enum indicating partial vs full results) but welcome >> > feedback. Once we get consensus on this proposal, I can add changes to >> > Flight SQL and try to provide reference implementations. >> > >> > [1] https://github.com/apache/arrow/pull/12571 >> > >> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield >> > wrote: >> > >> >> Would it make sense to make this part of DoGet since it >> >>> still would be returning a record batch >> >> >> >> I would lean against this. I think in many cases the client doesn't >> know >> >> the size of the data that it expects. Leaving the flexibility on the >> >> server side to send back inlined data when it thinks it makes sense, >> or a >> >> bunch of tickets when there is in fact a lot of data seems like the >> best >> >> option here. >> >> >> >> For cases like previewing data, you usually just want to get a small >> >>> amount >> >>> of data quickly. >> >> >> >> This is interesting and might be an additional use case. If we did >> >> decide to extend FlightInfo we might also want a way of annotating >> inlined >> >> data with its corresponding ticket. That way even for large results, >> you >> >> could still send back a small preview if desired. >> >> >> >> After considering it a little bit I think I'm sold that inlined data >> >> should not replace a ticket. So in my mind the open question is >> whether >> >> the client needs to actively opt-in to inlined data. The scenarios I >> could >> >> come with where inlined data isn't useful are: >> >> 1. The client is an old client and isn't aware inline data might be >> >> returned. In this case the main cost is of extra data on the wire and >> >> storing it as unknown fields [1]. >> >> 2. The client is a new client but still doesn't want to get inline >> data >> >> (it might want to distribute all consumption to other processes). Same >> >> cost is paid as option 1. >> >> >> >> Are there other scenarios? If servers choose reasonable limits on what >> >> data to inline, the extra complexity of negotiating with the client in >> this >> >> case might not be worth the benefits. >> >> >> >> Cheers, >> >> Micah >> >> >> >> >> >> [1] >> https://developers.google.com/protocol-buffers/docs/proto3#unknowns >> >> >> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler >> wrote: >> >> >> >>> I think this would be a useful feature and be nice to have in Flight >> >>> core. >> >>> For cases like previewing data, you usually just want to get a small >> >>> amount >> >>> of data quickly. Would it make sense to make this part of DoGet since >> it >> >>> still would be returning a record batch? Perhaps a Ticket could be >> made >> >>> to >> >>> have an optional FlightDescriptor that would serve as an all-in-one >> shot? >> >>> >> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li wrote: >> >>> >> >>> > I agree with something along Antoine's proposal, though: maybe we >> >>> should >> >>> > be more structured with the flags (akin to what Micah mentioned with >> >>> the >> >>> > Feature enum). >> >>> > >> >>> > Also, the flag could be embedded into the Flight SQL messages >> instead. >> >>> (So >> >>> > in effect, Flight would only add the capability to return data with >> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide >> how >> >>> > they want to take advantage of that.) >> >>> > >> >>> > I think having a completely separate method and return type and >> having >> >>> to >> >>> > poll for it beforehand somewhat defeats the purpose of having >> it/would >> >>> be >> >>> > much
Re: [Discuss][Format] Add 32-bit and 64-bit Decimals
Agreed. Also, I would like to revise my previous comment about the small risk. While prototyping this I did hit some bumps. They primary came from two reasons: * I was unable to find arrow/json files in the arrow-testing generated files with a non-default decimal bitwidth (I think we only have the on-the-fly generated file in archery) * the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`) and implementations may not support the 256 case (e.g. Rust has no native i256). For these cases, this could be the first non-default decimal implementation. So, maybe we follow the standard procedure? Best, Jorge On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield wrote: > > > > I’d also like to chime in in favor of 32- and 64-bit decimals because > > it’ll help achieve better performance on TPC-H (and maybe other > > benchmarks). The decimal columns need only 12 digits of precision, for > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > > 128-bit decimal. You can technically use a float too, but I expect 64-bit > > decimal to be faster. > > > We should be careful here. If this assumes loading from Parquet or other > file formats currently in the library, arbitrarily changing the type to > load the minimum data-length possible could break users, this should > probably be a configuration option. This also reminds me I think there is > some technical debt with decimals and parquet. > > [1] https://issues.apache.org/jira/browse/ARROW-12022 > > On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky < > krassovskysa...@gmail.com> > wrote: > > > I’d also like to chime in in favor of 32- and 64-bit decimals because > > it’ll help achieve better performance on TPC-H (and maybe other > > benchmarks). The decimal columns need only 12 digits of precision, for > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > > 128-bit decimal. You can technically use a float too, but I expect 64-bit > > decimal to be faster. > > > > Sasha Krassovsky > > > > > 8 марта 2022 г., в 09:01, Micah Kornfield > > написал(а): > > > > > > > > >> > > >> > > >> Do we want to keep the historical "C++ and Java" requirement or > > >> do we want to make it a more flexible "two independent official > > >> implementations", which could be for example C++ and Rust, Rust and > > >> Java, etc. > > > > > > > > > I think flexibility here is a good idea, I'd like to hear other > opinions. > > > > > > For this particular case if there aren't volunteers to help out in > > another > > > implementation I'm willing to help with Java (I don't have bandwidth to > > > do both C++ and Java). > > > > > > Cheers, > > > -Micah > > > > > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou > > wrote: > > >> > > >> > > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit : > > > > Relaxing from {128,256} to {32,64,128,256} seems a low risk > > from an integration perspective, as implementations already need to > > read > > the bitwidth to select the appropriate physical representation (if > > they > > support it). > > >>> > > >>> I think there are two reasons for having implementations first. > > >>> 1. Lower risk bugs in implementation/spec. > > >>> 2. A mechanism to ensure that there is some boot-strapped coverage > in > > >>> commonly used reference implementations. > > >> > > >> That sounds reasonable. > > >> > > >> Another question that came to my mind is: traditionally, we've > mandated > > >> implementations in the two reference Arrow implementations (C++ and > > >> Java). However, our implementation landscape is now much richer than > it > > >> used to be (for example, there is a tremendous activity on the Rust > > >> side). Do we want to keep the historical "C++ and Java" requirement > or > > >> do we want to make it a more flexible "two independent official > > >> implementations", which could be for example C++ and Rust, Rust and > > >> Java, etc. > > >> > > >> (by "independent" I mean that one should not be based on the other, > for > > >> example it should not be "C++ and Python" :-)) > > >> > > >> Regards > > >> > > >> Antoine. > > >> > > >> > > >>> > > >>> I agree 1, is fairly low-risk. > > >>> > > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < > > >>> jorgecarlei...@gmail.com> wrote: > > >>> > > +1 adding 32 and 64 bit decimals. > > > > +0 to release it without integration tests - both IPC and the C data > > interface use a variable bit width to declare the appropriate size > for > > decimal types. Relaxing from {128,256} to {32,64,128,256} seems a > low > > >> risk > > from an integration perspective, as implementations already need to > > read > > the bitwidth to select the appropriate physical representation (if > > they > > support it). > > > > Best, > > Jorge > > > > > > > > > > On Mon, Mar 7, 2022, 11:41 Antoine Pitrou > wrote: > > > > > > > > Le 03/03/2022 à
Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?
I am not sure if everyone saw it in the agenda[1], but we plan to have a meeting tomorrow. I'll plan to record it for anyone who can not make this time. 15:00 UTC Wednesday March 9, 2022 Meeting Location: (in agenda) Matthew Turner: focused on JIT and row representation, next Wednesday, March 9th, @yijie: JIT overview [1] https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# On Thu, Mar 3, 2022 at 12:50 AM Benson Muite wrote: > Interested in learning more about this. Can work through the code and > discuss on 17 March either 4:00 or 16:00 UTC. > > Benson > > On 3/3/22 12:03 AM, Andrew Lamb wrote: > > I noticed that Matthew Turner added a note to the agenda[1] for a walk > > through of the JIT code. I would be interested in this as well -- would > > anyone plan to be on the call and discuss it? > > > > I don't think I have time to prepare that content prior > > > > Andrew > > > > [1] > > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > >
Re: [Discuss][Format] Add 32-bit and 64-bit Decimals
> > I’d also like to chime in in favor of 32- and 64-bit decimals because > it’ll help achieve better performance on TPC-H (and maybe other > benchmarks). The decimal columns need only 12 digits of precision, for > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > 128-bit decimal. You can technically use a float too, but I expect 64-bit > decimal to be faster. We should be careful here. If this assumes loading from Parquet or other file formats currently in the library, arbitrarily changing the type to load the minimum data-length possible could break users, this should probably be a configuration option. This also reminds me I think there is some technical debt with decimals and parquet. [1] https://issues.apache.org/jira/browse/ARROW-12022 On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky wrote: > I’d also like to chime in in favor of 32- and 64-bit decimals because > it’ll help achieve better performance on TPC-H (and maybe other > benchmarks). The decimal columns need only 12 digits of precision, for > which a 64-bit decimal is sufficient. It’s currently wasteful to use a > 128-bit decimal. You can technically use a float too, but I expect 64-bit > decimal to be faster. > > Sasha Krassovsky > > > 8 марта 2022 г., в 09:01, Micah Kornfield > написал(а): > > > > > >> > >> > >> Do we want to keep the historical "C++ and Java" requirement or > >> do we want to make it a more flexible "two independent official > >> implementations", which could be for example C++ and Rust, Rust and > >> Java, etc. > > > > > > I think flexibility here is a good idea, I'd like to hear other opinions. > > > > For this particular case if there aren't volunteers to help out in > another > > implementation I'm willing to help with Java (I don't have bandwidth to > > do both C++ and Java). > > > > Cheers, > > -Micah > > > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou > wrote: > >> > >> > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit : > > Relaxing from {128,256} to {32,64,128,256} seems a low risk > from an integration perspective, as implementations already need to > read > the bitwidth to select the appropriate physical representation (if > they > support it). > >>> > >>> I think there are two reasons for having implementations first. > >>> 1. Lower risk bugs in implementation/spec. > >>> 2. A mechanism to ensure that there is some boot-strapped coverage in > >>> commonly used reference implementations. > >> > >> That sounds reasonable. > >> > >> Another question that came to my mind is: traditionally, we've mandated > >> implementations in the two reference Arrow implementations (C++ and > >> Java). However, our implementation landscape is now much richer than it > >> used to be (for example, there is a tremendous activity on the Rust > >> side). Do we want to keep the historical "C++ and Java" requirement or > >> do we want to make it a more flexible "two independent official > >> implementations", which could be for example C++ and Rust, Rust and > >> Java, etc. > >> > >> (by "independent" I mean that one should not be based on the other, for > >> example it should not be "C++ and Python" :-)) > >> > >> Regards > >> > >> Antoine. > >> > >> > >>> > >>> I agree 1, is fairly low-risk. > >>> > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < > >>> jorgecarlei...@gmail.com> wrote: > >>> > +1 adding 32 and 64 bit decimals. > > +0 to release it without integration tests - both IPC and the C data > interface use a variable bit width to declare the appropriate size for > decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low > >> risk > from an integration perspective, as implementations already need to > read > the bitwidth to select the appropriate physical representation (if > they > support it). > > Best, > Jorge > > > > > On Mon, Mar 7, 2022, 11:41 Antoine Pitrou wrote: > > > > > Le 03/03/2022 à 18:05, Micah Kornfield a écrit : > >> I think this makes sense to add these. Typically when adding new > types, > >> we've waited on the official vote until there are two reference > >> implementations demonstrating compatibility. > > > > You are right, I had forgotten about that. Though in this case, it > > might be argued we are just relaxing the constraints on an existing > >> type. > > > > What do others think? > > > > Regards > > > > Antoine. > > > > > >> > >> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou > > wrote: > >> > >>> > >>> Hello, > >>> > >>> Currently, the Arrow format specification restricts the bitwidth of > >>> decimal numbers to either 128 or 256 bits. > >>> > >>> However, there is interest in allowing other bitwidths, at least 32 > and > >>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal > >>>
Re: Flight/FlightSQL Optimization for Small Results?
Thank you for doing this, left a few questions on the GH issue I would adopt this proposal as soon as it makes it into nightlies (or possibly earlier if it's just a matter of regenerating the proto definitions) The operation flow would be like this, or what would it look like? Client ---> GetFlightInfo (query/update operation in payload) ---> Server ---> Results (non-streamed) On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield wrote: > Some people have already left comments on > https://github.com/apache/arrow/pull/12571 More eyes on it would be > appreciated. If there aren't more comments, I'll try to start implementing > this feature in Flight next week, and hopefully have a vote after it is > supported in Java and C++/Python. > > > Thanks, > Micah > > On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield > wrote: > > > I put together straw-man proposal in PR [1] for the Flight changes. > > Ultimately, it seemed based on the use-cases discussed inlining the data > on > > the Ticket made the most sense. This might be overly complex (I'm not > sure > > how I feel about a enum indicating partial vs full results) but welcome > > feedback. Once we get consensus on this proposal, I can add changes to > > Flight SQL and try to provide reference implementations. > > > > [1] https://github.com/apache/arrow/pull/12571 > > > > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield > > wrote: > > > >> Would it make sense to make this part of DoGet since it > >>> still would be returning a record batch > >> > >> I would lean against this. I think in many cases the client doesn't know > >> the size of the data that it expects. Leaving the flexibility on the > >> server side to send back inlined data when it thinks it makes sense, or > a > >> bunch of tickets when there is in fact a lot of data seems like the best > >> option here. > >> > >> For cases like previewing data, you usually just want to get a small > >>> amount > >>> of data quickly. > >> > >> This is interesting and might be an additional use case. If we did > >> decide to extend FlightInfo we might also want a way of annotating > inlined > >> data with its corresponding ticket. That way even for large results, > you > >> could still send back a small preview if desired. > >> > >> After considering it a little bit I think I'm sold that inlined data > >> should not replace a ticket. So in my mind the open question is whether > >> the client needs to actively opt-in to inlined data. The scenarios I > could > >> come with where inlined data isn't useful are: > >> 1. The client is an old client and isn't aware inline data might be > >> returned. In this case the main cost is of extra data on the wire and > >> storing it as unknown fields [1]. > >> 2. The client is a new client but still doesn't want to get inline data > >> (it might want to distribute all consumption to other processes). Same > >> cost is paid as option 1. > >> > >> Are there other scenarios? If servers choose reasonable limits on what > >> data to inline, the extra complexity of negotiating with the client in > this > >> case might not be worth the benefits. > >> > >> Cheers, > >> Micah > >> > >> > >> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns > >> > >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler wrote: > >> > >>> I think this would be a useful feature and be nice to have in Flight > >>> core. > >>> For cases like previewing data, you usually just want to get a small > >>> amount > >>> of data quickly. Would it make sense to make this part of DoGet since > it > >>> still would be returning a record batch? Perhaps a Ticket could be made > >>> to > >>> have an optional FlightDescriptor that would serve as an all-in-one > shot? > >>> > >>> On Tue, Mar 1, 2022 at 8:44 AM David Li wrote: > >>> > >>> > I agree with something along Antoine's proposal, though: maybe we > >>> should > >>> > be more structured with the flags (akin to what Micah mentioned with > >>> the > >>> > Feature enum). > >>> > > >>> > Also, the flag could be embedded into the Flight SQL messages > instead. > >>> (So > >>> > in effect, Flight would only add the capability to return data with > >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide > how > >>> > they want to take advantage of that.) > >>> > > >>> > I think having a completely separate method and return type and > having > >>> to > >>> > poll for it beforehand somewhat defeats the purpose of having > it/would > >>> be > >>> > much harder of a transition. > >>> > > >>> > Also: it should be `repeated FlightInfo inline_data` right? In case > we > >>> > also need dictionary batches? > >>> > > >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: > >>> > > Can we just add the following field to the FlightDescriptor > message: > >>> > > > >>> > > bool accept_inline_data = 4; > >>> > > > >>> > > and this one to the FlightInfo message: > >>> > > > >>> > > FlightData inline_data = 100; > >>> > > > >>> > >
Re: [Discuss][Format] Add 32-bit and 64-bit Decimals
I’d also like to chime in in favor of 32- and 64-bit decimals because it’ll help achieve better performance on TPC-H (and maybe other benchmarks). The decimal columns need only 12 digits of precision, for which a 64-bit decimal is sufficient. It’s currently wasteful to use a 128-bit decimal. You can technically use a float too, but I expect 64-bit decimal to be faster. Sasha Krassovsky > 8 марта 2022 г., в 09:01, Micah Kornfield написал(а): > > >> >> >> Do we want to keep the historical "C++ and Java" requirement or >> do we want to make it a more flexible "two independent official >> implementations", which could be for example C++ and Rust, Rust and >> Java, etc. > > > I think flexibility here is a good idea, I'd like to hear other opinions. > > For this particular case if there aren't volunteers to help out in another > implementation I'm willing to help with Java (I don't have bandwidth to > do both C++ and Java). > > Cheers, > -Micah > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou wrote: >> >> >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit : Relaxing from {128,256} to {32,64,128,256} seems a low risk from an integration perspective, as implementations already need to read the bitwidth to select the appropriate physical representation (if they support it). >>> >>> I think there are two reasons for having implementations first. >>> 1. Lower risk bugs in implementation/spec. >>> 2. A mechanism to ensure that there is some boot-strapped coverage in >>> commonly used reference implementations. >> >> That sounds reasonable. >> >> Another question that came to my mind is: traditionally, we've mandated >> implementations in the two reference Arrow implementations (C++ and >> Java). However, our implementation landscape is now much richer than it >> used to be (for example, there is a tremendous activity on the Rust >> side). Do we want to keep the historical "C++ and Java" requirement or >> do we want to make it a more flexible "two independent official >> implementations", which could be for example C++ and Rust, Rust and >> Java, etc. >> >> (by "independent" I mean that one should not be based on the other, for >> example it should not be "C++ and Python" :-)) >> >> Regards >> >> Antoine. >> >> >>> >>> I agree 1, is fairly low-risk. >>> >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < >>> jorgecarlei...@gmail.com> wrote: >>> +1 adding 32 and 64 bit decimals. +0 to release it without integration tests - both IPC and the C data interface use a variable bit width to declare the appropriate size for decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low >> risk from an integration perspective, as implementations already need to read the bitwidth to select the appropriate physical representation (if they support it). Best, Jorge On Mon, Mar 7, 2022, 11:41 Antoine Pitrou wrote: > > Le 03/03/2022 à 18:05, Micah Kornfield a écrit : >> I think this makes sense to add these. Typically when adding new types, >> we've waited on the official vote until there are two reference >> implementations demonstrating compatibility. > > You are right, I had forgotten about that. Though in this case, it > might be argued we are just relaxing the constraints on an existing >> type. > > What do others think? > > Regards > > Antoine. > > >> >> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou > wrote: >> >>> >>> Hello, >>> >>> Currently, the Arrow format specification restricts the bitwidth of >>> decimal numbers to either 128 or 256 bits. >>> >>> However, there is interest in allowing other bitwidths, at least 32 and >>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal >>> datatype would allow for precisions of up to 18 digits (respectively >> 9 >>> digits), which are sufficient for some applications which are mainly >>> looking for exact computations rather than sheer precision. >> Obviously, >>> smaller datatypes are cheaper to store in memory and cheaper to run >>> computations on. >>> >>> For example, the Spark documentation mentions that some decimal types >>> may fit in a Java int (32 bits) or long (64 bits): >>> >>> > >> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html >>> >>> ... and a draft PR had even been filed for initial support in the C++ >>> implementation (https://github.com/apache/arrow/pull/8578). >>> >>> I am therefore proposing that we relax the wording in the Arrow >> format >>> specification to also allow 32- and 64-bit decimal types. >>> >>> This is a preliminary discussion to gather opinions and potential >>> counter-arguments against this
Re: Flight/FlightSQL Optimization for Small Results?
Some people have already left comments on https://github.com/apache/arrow/pull/12571 More eyes on it would be appreciated. If there aren't more comments, I'll try to start implementing this feature in Flight next week, and hopefully have a vote after it is supported in Java and C++/Python. Thanks, Micah On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield wrote: > I put together straw-man proposal in PR [1] for the Flight changes. > Ultimately, it seemed based on the use-cases discussed inlining the data on > the Ticket made the most sense. This might be overly complex (I'm not sure > how I feel about a enum indicating partial vs full results) but welcome > feedback. Once we get consensus on this proposal, I can add changes to > Flight SQL and try to provide reference implementations. > > [1] https://github.com/apache/arrow/pull/12571 > > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield > wrote: > >> Would it make sense to make this part of DoGet since it >>> still would be returning a record batch >> >> I would lean against this. I think in many cases the client doesn't know >> the size of the data that it expects. Leaving the flexibility on the >> server side to send back inlined data when it thinks it makes sense, or a >> bunch of tickets when there is in fact a lot of data seems like the best >> option here. >> >> For cases like previewing data, you usually just want to get a small >>> amount >>> of data quickly. >> >> This is interesting and might be an additional use case. If we did >> decide to extend FlightInfo we might also want a way of annotating inlined >> data with its corresponding ticket. That way even for large results, you >> could still send back a small preview if desired. >> >> After considering it a little bit I think I'm sold that inlined data >> should not replace a ticket. So in my mind the open question is whether >> the client needs to actively opt-in to inlined data. The scenarios I could >> come with where inlined data isn't useful are: >> 1. The client is an old client and isn't aware inline data might be >> returned. In this case the main cost is of extra data on the wire and >> storing it as unknown fields [1]. >> 2. The client is a new client but still doesn't want to get inline data >> (it might want to distribute all consumption to other processes). Same >> cost is paid as option 1. >> >> Are there other scenarios? If servers choose reasonable limits on what >> data to inline, the extra complexity of negotiating with the client in this >> case might not be worth the benefits. >> >> Cheers, >> Micah >> >> >> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns >> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler wrote: >> >>> I think this would be a useful feature and be nice to have in Flight >>> core. >>> For cases like previewing data, you usually just want to get a small >>> amount >>> of data quickly. Would it make sense to make this part of DoGet since it >>> still would be returning a record batch? Perhaps a Ticket could be made >>> to >>> have an optional FlightDescriptor that would serve as an all-in-one shot? >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li wrote: >>> >>> > I agree with something along Antoine's proposal, though: maybe we >>> should >>> > be more structured with the flags (akin to what Micah mentioned with >>> the >>> > Feature enum). >>> > >>> > Also, the flag could be embedded into the Flight SQL messages instead. >>> (So >>> > in effect, Flight would only add the capability to return data with >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide how >>> > they want to take advantage of that.) >>> > >>> > I think having a completely separate method and return type and having >>> to >>> > poll for it beforehand somewhat defeats the purpose of having it/would >>> be >>> > much harder of a transition. >>> > >>> > Also: it should be `repeated FlightInfo inline_data` right? In case we >>> > also need dictionary batches? >>> > >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: >>> > > Can we just add the following field to the FlightDescriptor message: >>> > > >>> > > bool accept_inline_data = 4; >>> > > >>> > > and this one to the FlightInfo message: >>> > > >>> > > FlightData inline_data = 100; >>> > > >>> > > Then new clients can `accept_inline_data` to true (the default being >>> > > false if omitted) to signal servers that they can put the data if >>> > > `inline_data` if deemed small enough. >>> > > >>> > > (the `accept_inline_data` field could also be used to the Criteria >>> > > message) >>> > > >>> > > >>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty >>> > > (FlightDescriptor being used in other contexts where >>> > > `accept_inline_data` makes no sense), we can instead define a new >>> > > method: >>> > > >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {} >>> > > >>> > > with: >>> > > >>> > > message GetFlightInfoRequest { >>>
Re: [Discuss][Format] Add 32-bit and 64-bit Decimals
> > Do we want to keep the historical "C++ and Java" requirement or > do we want to make it a more flexible "two independent official > implementations", which could be for example C++ and Rust, Rust and > Java, etc. I think flexibility here is a good idea, I'd like to hear other opinions. For this particular case if there aren't volunteers to help out in another implementation I'm willing to help with Java (I don't have bandwidth to do both C++ and Java). Cheers, -Micah On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou wrote: > > Le 07/03/2022 à 20:26, Micah Kornfield a écrit : > >> > >> Relaxing from {128,256} to {32,64,128,256} seems a low risk > >> from an integration perspective, as implementations already need to read > >> the bitwidth to select the appropriate physical representation (if they > >> support it). > > > > I think there are two reasons for having implementations first. > > 1. Lower risk bugs in implementation/spec. > > 2. A mechanism to ensure that there is some boot-strapped coverage in > > commonly used reference implementations. > > That sounds reasonable. > > Another question that came to my mind is: traditionally, we've mandated > implementations in the two reference Arrow implementations (C++ and > Java). However, our implementation landscape is now much richer than it > used to be (for example, there is a tremendous activity on the Rust > side). Do we want to keep the historical "C++ and Java" requirement or > do we want to make it a more flexible "two independent official > implementations", which could be for example C++ and Rust, Rust and > Java, etc. > > (by "independent" I mean that one should not be based on the other, for > example it should not be "C++ and Python" :-)) > > Regards > > Antoine. > > > > > > I agree 1, is fairly low-risk. > > > > On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > >> +1 adding 32 and 64 bit decimals. > >> > >> +0 to release it without integration tests - both IPC and the C data > >> interface use a variable bit width to declare the appropriate size for > >> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low > risk > >> from an integration perspective, as implementations already need to read > >> the bitwidth to select the appropriate physical representation (if they > >> support it). > >> > >> Best, > >> Jorge > >> > >> > >> > >> > >> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou wrote: > >> > >>> > >>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit : > I think this makes sense to add these. Typically when adding new > >> types, > we've waited on the official vote until there are two reference > implementations demonstrating compatibility. > >>> > >>> You are right, I had forgotten about that. Though in this case, it > >>> might be argued we are just relaxing the constraints on an existing > type. > >>> > >>> What do others think? > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > > On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou > >>> wrote: > > > > > Hello, > > > > Currently, the Arrow format specification restricts the bitwidth of > > decimal numbers to either 128 or 256 bits. > > > > However, there is interest in allowing other bitwidths, at least 32 > >> and > > 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal > > datatype would allow for precisions of up to 18 digits (respectively > 9 > > digits), which are sufficient for some applications which are mainly > > looking for exact computations rather than sheer precision. > Obviously, > > smaller datatypes are cheaper to store in memory and cheaper to run > > computations on. > > > > For example, the Spark documentation mentions that some decimal types > > may fit in a Java int (32 bits) or long (64 bits): > > > > > >>> > >> > https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html > > > > ... and a draft PR had even been filed for initial support in the C++ > > implementation (https://github.com/apache/arrow/pull/8578). > > > > I am therefore proposing that we relax the wording in the Arrow > format > > specification to also allow 32- and 64-bit decimal types. > > > > This is a preliminary discussion to gather opinions and potential > > counter-arguments against this proposal. If no strong > counter-argument > > emerges, we will probably run a vote in a week or two. > > > > Best regards > > > > Antoine. > > > > >>> > >> > > >
[RESULT][VOTE][RUST] Release Apache Arrow Rust 10.0.0 RC1
With 8 +1 (3 binding) the release is approved! Thank you to all who verified it. The release is available here: https://dist.apache.org/repos/dist/release/arrow/arrow-rs-10.0.0 It has also been uploaded to crates.io: https://crates.io/crates/arrow/10.0.0 https://crates.io/crates/arrow-flight/10.0.0 https://crates.io/crates/parquet/10.0.0 https://crates.io/crates/parquet-derive/10.0.0 On Mon, Mar 7, 2022 at 11:59 PM Yijie Shen wrote: > +1 (non-binding) verified on Windows Subsystem for Linux. Thanks, Andrew! > > On Tue, Mar 8, 2022 at 10:43 AM QP Hou wrote: > > > +1 (binding). Thanks Andrew. > > > > On Mon, Mar 7, 2022 at 9:17 AM Chao Sun wrote: > > > > > > +1 (non-binding) verified on Mac. Thanks Andrew! > > > > > > On Mon, Mar 7, 2022 at 7:47 AM Matthew Turner > > > wrote: > > > > > > > > +1 (non-binding) after running release verification script on M1 Mac. > > > > > > > > Thanks, Andrew. > > > > > > > > From: Andy Grove > > > > Date: Monday, March 7, 2022 at 10:00 AM > > > > To: dev > > > > Subject: Re: [VOTE][RUST] Release Apache Arrow Rust 10.0.0 RC1 > > > > +1 (binding) > > > > > > > > Verified on Ubuntu 20.04.3 LTS > > > > > > > > On Mon, Mar 7, 2022 at 6:52 AM Kun Liu wrote: > > > > > > > > > I have tested it in the mac and got "Release candidate looks good!" > > > > > message. > > > > > The ut passed in my mac. > > > > > > > > > > +1 non-binding. > > > > > > > > > > Thanks, > > > > > Kun > > > > > > > > > > R > > > > > > > > > > Wang Xudong 于2022年3月5日周六 22:00写道: > > > > > > > > > > > +1 non-binding > > > > > > > > > > > > Test on macOS, "Release candidate looks good!" > > > > > > Thank you alamb! > > > > > > > > > > > > --- > > > > > > xudong > > > > > > > > > > > > > > > > > > > > > > > > Andrew Lamb 于2022年3月5日周六 20:06写道: > > > > > > > > > > > > > Salutations Arrow Rust Community, > > > > > > > > > > > > > > I would like to propose a release of Apache Arrow Rust > > Implementation, > > > > > > > version 10.0.0. As previously discussed[5] the "Integration > > Test" CI > > > > > is > > > > > > > failing[6], but I we have determined it is a bug in the test, > > not in > > > > > the > > > > > > > code itself and have a fix ready [7] > > > > > > > > > > > > > > This release candidate is based on commit: > > > > > > > a7bd09abde0010a58d0cd0557384df5aadba83ac [1] > > > > > > > > > > > > > > The proposed release tarball and signatures are hosted at [2]. > > > > > > > > > > > > > > The changelog is located at [3]. > > > > > > > > > > > > > > Please download, verify checksums and signatures, run the unit > > tests, > > > > > > > and vote on the release. There is a script [4] that automates > > some of > > > > > > > the verification. > > > > > > > > > > > > > > The vote will be open for at least 72 hours. > > > > > > > > > > > > > > [ ] +1 Release this as Apache Arrow Rust > > > > > > > [ ] +0 > > > > > > > [ ] -1 Do not release this as Apache Arrow Rust because... > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Ftree%2Fa7bd09abde0010a58d0cd0557384df5aadba83acdata=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=66NubVTy23HM%2B8coRtaWHAvQWYPQDMEvfNeBTHyc40E%3Dreserved=0 > > > > > > > [2]: > > > > > > > > > > > > > > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Farrow%2Fapache-arrow-rs-10.0.0-rc1data=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=fcoy2Ee7qx0UdQo504491hgeIL9%2Fekbnz35J4CgduuQ%3Dreserved=0 > > > > > > > [3]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Fblob%2Fa7bd09abde0010a58d0cd0557384df5aadba83ac%2FCHANGELOG.mddata=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=BTjlJ7euXemTh%2BV0Bw90iCGdhvBJCKWS50dQm8AhXjQ%3Dreserved=0 > > > > > > > [4]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Fblob%2Fmaster%2Fdev%2Frelease%2Fverify-release-candidate.shdata=04%7C01%7C%7C8db71e50f13d468fcd2b08da004b4654%7C84df9e7fe9f640afb435%7C1%7C0%7C637822620540523542%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=QXf6z4OLE0cqbOJpzHCB2JYe5AULep4PgABZFbBCjFs%3Dreserved=0 > > > > > > > [5]: > > >
Re: [Discuss][Format] Add 32-bit and 64-bit Decimals
Le 07/03/2022 à 20:26, Micah Kornfield a écrit : Relaxing from {128,256} to {32,64,128,256} seems a low risk from an integration perspective, as implementations already need to read the bitwidth to select the appropriate physical representation (if they support it). I think there are two reasons for having implementations first. 1. Lower risk bugs in implementation/spec. 2. A mechanism to ensure that there is some boot-strapped coverage in commonly used reference implementations. That sounds reasonable. Another question that came to my mind is: traditionally, we've mandated implementations in the two reference Arrow implementations (C++ and Java). However, our implementation landscape is now much richer than it used to be (for example, there is a tremendous activity on the Rust side). Do we want to keep the historical "C++ and Java" requirement or do we want to make it a more flexible "two independent official implementations", which could be for example C++ and Rust, Rust and Java, etc. (by "independent" I mean that one should not be based on the other, for example it should not be "C++ and Python" :-)) Regards Antoine. I agree 1, is fairly low-risk. On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: +1 adding 32 and 64 bit decimals. +0 to release it without integration tests - both IPC and the C data interface use a variable bit width to declare the appropriate size for decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low risk from an integration perspective, as implementations already need to read the bitwidth to select the appropriate physical representation (if they support it). Best, Jorge On Mon, Mar 7, 2022, 11:41 Antoine Pitrou wrote: Le 03/03/2022 à 18:05, Micah Kornfield a écrit : I think this makes sense to add these. Typically when adding new types, we've waited on the official vote until there are two reference implementations demonstrating compatibility. You are right, I had forgotten about that. Though in this case, it might be argued we are just relaxing the constraints on an existing type. What do others think? Regards Antoine. On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou wrote: Hello, Currently, the Arrow format specification restricts the bitwidth of decimal numbers to either 128 or 256 bits. However, there is interest in allowing other bitwidths, at least 32 and 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal datatype would allow for precisions of up to 18 digits (respectively 9 digits), which are sufficient for some applications which are mainly looking for exact computations rather than sheer precision. Obviously, smaller datatypes are cheaper to store in memory and cheaper to run computations on. For example, the Spark documentation mentions that some decimal types may fit in a Java int (32 bits) or long (64 bits): https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html ... and a draft PR had even been filed for initial support in the C++ implementation (https://github.com/apache/arrow/pull/8578). I am therefore proposing that we relax the wording in the Arrow format specification to also allow 32- and 64-bit decimal types. This is a preliminary discussion to gather opinions and potential counter-arguments against this proposal. If no strong counter-argument emerges, we will probably run a vote in a week or two. Best regards Antoine.