Hi David,

Thanks for sharing the Spark information. I agree with you
understanding. Thanks for summarizing this.


If anyone has any comments about our long-running queries
approach, please share your comment on this thread, issue or
PR:

Issue: https://github.com/apache/arrow/issues/36155
PR: https://github.com/apache/arrow/pull/36946


Thanks,
-- 
kou

In <0a28d06b-f18e-4717-8527-988159eb4...@app.fastmail.com>
  "Re: [DISCUSS][Format][Flight] Long-running queries support" on Thu, 03 Aug 
2023 09:18:21 -0400,
  "David Li" <lidav...@apache.org> wrote:

> Thanks Kou!
> 
> For those interested, Spark recently landed a proposal with some similar 
> goals in Spark Connect, calling it "reattachable execution": 
> https://github.com/apache/spark/pull/42228
> 
> From my understanding, Flight is more flexible here, decoupling the query 
> execution from fetching the result set, and allowing the result set to be 
> distributed/partitioned. (That said, Spark Connect's implementation lets you 
> resume from where you left off when fetching data, while currently Flight 
> doesn't standardize anything other than retrying the entire DoGet.) Both use 
> or plan to use timeouts to deal with cleaning up resources (while allowing 
> clients to explicitly free resources too).
> 
> On Tue, Aug 1, 2023, at 16:45, Sutou Kouhei wrote:
>> Hi,
>>
>> I would like to propose adding support for long-running
>> queries to Apache Arrow Flight. If anyone has comments for
>> this proposal, please share them at here or the issue for
>> this proposal:
>>   https://github.com/apache/arrow/issues/36155
>>
>> Implementation in C++/Go/Java:
>>   https://github.com/apache/arrow/pull/36946
>> Documentations:
>>   
>> http://crossbow.voltrondata.com/pr_docs/36946/format/Flight.html#downloading-data-by-running-a-heavy-query
>>
>>
>> This is one of proposals in "[DISCUSS] Flight RPC/Flight
>> SQL/ADBC enhancements":
>>
>>   https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp
>>
>> See also the "Flight RPC: Long-Running Queries" section in
>> the design document for the proposals:
>>
>>   
>> https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit#
>>
>>
>> Changes since the original proposal:
>>
>> * Removed RetryInfo.cancel_descriptor because we can use the
>>   existing CancelFlightInfo action.
>> * Renamed RetryInfo.retry_descriptor to
>>   RetryInfo.flight_descriptor because we have only one
>>   descriptor by removing RetryInfo.cancel_descriptor.
>>
>>
>> Background:
>>
>> Queries generally don't complete instantly (as much as we
>> would like them to). So where can we put the 'query
>> evaluation time'?
>>
>> * In GetFlightInfo: block and wait for the query to complete.
>>   * Con: this is a long-running blocking call, which may
>>     fail or time out. Then when the client retries, the
>>     server has to redo all the work.
>>   * Con: parts of the result may be ready before others, but
>>     the client can't do anything until everything is ready.
>>
>> * In DoGet: return a fixed number of partitions
>>   * Con: this makes handling worker failures hard. Systems
>>     like Trino support fault-tolerant execution by replacing
>>     workers at runtime. But GetFlightInfo has already
>>     passed, so we can't notify the client of new workers2.
>>   * Con: we have to know or fix the partitioning up front.
>>
>> Neither solution is optimal.
>>
>>
>> Proposal:
>>
>> Add PollFlightInfo as a retryable version of GetFlightInfo.
>> Clients can poll the current query status and start reading
>> the currently available results so far before the query is
>> completed.
>>
>> See documentation for details:
>>   
>> http://crossbow.voltrondata.com/pr_docs/36946/format/Flight.html#downloading-data-by-running-a-heavy-query
>>
>>
>> Implementation:
>>
>> https://github.com/apache/arrow/pull/36946 is an
>> implementation of this proposal. The pull request has the
>> followings:
>>
>> 1. Format changes:
>>    * format/Flight.proto
>>      
>> https://github.com/apache/arrow/pull/36946/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
>>
>> 2. Documentation changes:
>>    docs/source/format/Flight.rst
>>    
>> https://github.com/apache/arrow/pull/36946/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
>>
>> 3. The C++ implementation and an integration test:
>>    * cpp/src/arrow/flight/
>>
>> 4. The Go implementation and an integration test:
>>    * go/arrow/flight/server.go
>>    * go/arrow/internal/flight_integration/scenario.go
>>
>> 5. The Java implementation and an integration test:
>>    * java/flight/flight-core/
>>    * java/flight/flight-integration-tests/
>>
>>
>> Next:
>>
>> I'll start a vote for this proposal after we reach a consensus
>> on this proposal.
>>
>> It's the standard process for format change.
>> See also:
>> https://arrow.apache.org/docs/dev/format/Changing.html
>>
>>
>> Thanks,
>> -- 
>> kou

Reply via email to