The ADBC and Flight SQL proposals have been updated for Micah/Taeyun/Will's 
comments.

On Wed, Feb 15, 2023, at 09:17, David Li wrote:
> Hi Taeyun,
>
> Thanks for the detailed feedback!
>
> - I will clarify that PollFlightInfo should return as quickly as 
> possible on the first call, and that updates in progress value are also 
> OK (though the server shouldn't spam updates). (I wanted to avoid 
> streaming calls as it does not work as well with browser-based gRPC 
> clients.)
> - I will clarify cancel_descriptor to note that it is optional.
> - I wanted to avoid adding several new RPC methods, but if there is 
> rough agreement that these would be generally useful, I will add them 
> and deprecate the Flight SQL message [3]. (We could also possibly 
> define 'standard' DoAction Protobuf messages, but I worry about 
> implementation [1]. I may prototype this first, since then we could 
> avoid having redundant paths in Flight RPC/Flight SQL.) If we do this, 
> I think we do not need cancel_descriptor. (It can work like 
> CancelQuery.)
> - I meant that CancelQuery should work with a partial FlightInfo from a 
> PollFlightInfo response. However this doesn't work if there's no 
> endpoints in the response! I will add app_metadata fields to 
> FlightInfo/FlightEndpoint. I think this can also be useful for 
> applications that need to add their own semantics to these messages 
> anyways, since Ticket is not meant to be parsed by the client. (You 
> could stuff the info into the schema, but that also doesn't work if the 
> schema is not yet known.)
>
> As for the partial DoGet: I think this is interesting and we can 
> discuss. Google BigQuery Storage supports this use case [2]. As you 
> note, if you are using this to request only a few rows, you may not 
> benefit much from Arrow.
>
> [1]: The C++ Protobuf library makes it difficult to define and share 
> messages across multiple shared libraries. On Windows, protoc does not 
> properly insert dllimport/dllexport macros (despite claiming to), and 
> on Unixes Protobuf interacts oddly with our linker script/symbol 
> hiding. This would be a lot of work, but I wonder if we could use an 
> implementation like upb/nanopb that does not rely on global state for 
> Arrow. This would also hopefully ease conflicts with projects that want 
> to use their own Protobuf definitions - as with Substrait. The main 
> challenge here is getting them to work with gRPC; I think we would have 
> to handroll the gRPC code that is normally generated. This may not be 
> too bad, just undocumented/may not be a stable API, and it would also 
> let us avoid the iffy casting we currently do to bypass gRPC's 
> serialization.
> [2]: 
> https://github.com/googleapis/googleapis/blob/1870ba2163526fa9fba63bf899c92707476d4603/google/cloud/bigquery/storage/v1/storage.proto#L268-L282
> [3]: It may be time to consider explicit versioning of Flight 
> RPC/Flight SQL?
>
> On Tue, Feb 14, 2023, at 20:45, Taeyun Kim wrote:
>> Hi David,
>>
>> Thank you very much for your proposal.
>> My comments about it are as follows:
>>
>> About PollFlightInfo:
>>
>> Many SQL queries (in fact, almost all OLAP queries?) cannot produce any
>> output records until it completes - because of GROUP BY or ORDER BY clause.
>> In that case, PollFlightInfo can degenerate to GetFlightInfo since the
>> server will not respond unless there are changes to the result. If the
>> 'progress' field of RetryInfo is also regarded as the result, the server can
>> respond with a different progress value. But the server that does not know
>> the progress information cannot use that.
>> The client can call the RPC with a timeout to avoid arbitrarily long
>> polling, but in that case, the client would not be able to get a descriptor
>> for cancellation of the query if the first PollFlightInfo does not return
>> soon. Maybe it should be specified that the server processing PollFlightInfo
>> must return immediately after it parses the query and starts executing it to
>> provide the cancel_descriptor as soon as possible.
>> Regarding cancel_descriptor, it would be nice for the server to unset it
>> even if the query is still in progress, to notify the client that the query
>> cancellation is not supported.
>> BTW, I thought of something like StreamingGetFlightInfo, which is a
>> bidirectional streaming version of PollFlightInfo. But maybe PollFlightInfo
>> is better since the other client that does not own the GRPC call stream can
>> cancel the query. (Or maybe StreamingGetFlightInfo can send
>> cancel_descriptor for use outside the stream.)
>>
>> About CloseQuery:
>>
>> I think that it would be great if the RPC call is in Flight RPC rather than
>> in FlightSQL RPC since the FlightInfo that it tries to close is got from
>> GetFlightInfo/PollFlightInfo in Flight RPC. In that case, maybe it would be
>> nice to name it 'CloseFlightInfo', to be matched with GetFlightInfo.
>>
>> About RefreshQuery:
>>
>> Same as CloseQuery. Maybe it can be named 'RetainFlightInfo'.
>>
>> About CancelQuery:
>>
>> I don't know how to use it. CancenQuery requires FlightInfo from the server.
>> But by the time the client receives FlightInfo, the query has been already
>> completed, doesn't it?
>>
>> Another (unrelated?) request (not in the proposal):
>>
>> In DoGet, the client must consume the whole endpoint. It can make it
>> difficult for a client who only wants to or can retrieve only a small
>> portion of it. (For example, there may be a web client that displays the
>> result in tabular format page-by-page. A web server can cache the DoGet
>> result, but by doing that the web server must manage a state. A stateful web
>> server is harder to implement and manage.) Can we have a variant of DoGet
>> that only retrieves a portion of an endpoint? That RPC method can have
>> record_offset and record_count arguments. (Maybe it defeats the purpose of
>> Flight RPC which prefers fast, bulk transfer.)
>>
>> Thank you.
>>
>> -----Original Message-----
>> From: David Li <lidav...@apache.org> 
>> Sent: Wednesday, February 15, 2023 8:06 AM
>> To: dev@arrow.apache.org
>> Subject: Re: [DISCUSS] Flight RPC/Flight SQL/ADBC enhancements
>>
>> Ah, right. I haven't written up the last set of ADBC proposals yet. I'll do
>> that in the next day or two.
>>
>> On Tue, Feb 14, 2023, at 17:38, Will Jones wrote:
>>> Hi David,
>>>
>>> The proposals in the Flight/Flight SQL document look excellent. As 
>>> I've been looking at ADBC I've been wondering about polling / async 
>>> execution, cancellation, and progress indicators. Glad to see those in 
>>> the Flight document, but where are they in the ADBC issues? Do they 
>>> still need to be created?
>>>
>>> Best,
>>>
>>> Will Jones
>>>
>>> On Tue, Feb 14, 2023 at 12:58 PM David Li <lidav...@apache.org> wrote:
>>>
>>>> Hello,
>>>>
>>>> I would like to submit some Flight RPC and Flight SQL enhancements 
>>>> for discussion. They cover the following:
>>>>
>>>> - Executing 'queries' in a retryable, nonblocking way
>>>> - Handling ordered result sets
>>>> - Handling expiration of/re-reading result sets
>>>>
>>>> In addition, there are corresponding proposals for ADBC in 
>>>> anticipation of these features, James's catalogs proposal for Flight 
>>>> SQL, and other feedback.
>>>>
>>>> The Flight proposals are described in this document [1]. It should be 
>>>> open for comments.
>>>> The ADBC proposals are filed as individual issues in this milestone [2].
>>>>
>>>> Any feedback is much appreciated. There are not yet prototype 
>>>> implementations, but if there is a rough consensus then I can begin on
>> that.
>>>>
>>>> [1]:
>>>> https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXT
>>>> ILDfkadx2g/edit?usp=sharing
>>>> [2]: https://github.com/apache/arrow-adbc/milestone/3
>>>>
>>>> Thanks,
>>>> David
>>>>

Reply via email to