This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 067fd2a2c6 GH-46193: [Flight][Format] Extend Flight Location URI
Semantics (#46194)
067fd2a2c6 is described below
commit 067fd2a2c6e54d33b9ae8a3324f59bebe960d485
Author: Matt Topol <[email protected]>
AuthorDate: Mon May 5 11:44:11 2025 -0400
GH-46193: [Flight][Format] Extend Flight Location URI Semantics (#46194)
### Rationale for this change
Updating the documentation in Flight.proto and Flight.rst to extend the
semantics of the allowed Flight location URIs.
### What changes are included in this PR?
Just documentation changes. Currently, none of the Arrow Flight
implementations actually implement handling of the returned URIs beyond
possibly parsing them and wrapping in a `Location` structure. It is left
to the consumer to implement the logic of whether to re-use the same
client or spin up a new client with the new location etc. to perform the
`DoGet` request against. As such, there wasn't a need to make any
code/library changes to accomodate this as part of this PR.
* GitHub Issue: #46193
---------
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Ian Cook <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
---
docs/source/format/Flight.rst | 53 +++++++++++++++++++++++++++++++++++++++++++
format/Flight.proto | 37 ++++++++++++++++++++++++++++--
2 files changed, 88 insertions(+), 2 deletions(-)
diff --git a/docs/source/format/Flight.rst b/docs/source/format/Flight.rst
index aac979cf75..7355a698d0 100644
--- a/docs/source/format/Flight.rst
+++ b/docs/source/format/Flight.rst
@@ -333,6 +333,13 @@ schemes for the given transports:
+----------------------------+--------------------------------+
| (reuse connection) | arrow-flight-reuse-connection: |
+----------------------------+--------------------------------+
+| HTTP (1) | http: or https: |
++----------------------------+--------------------------------+
+
+Notes:
+
+* \(1) See :ref:`flight-extended-uris` for semantics when using
+ http/https as the transport. It should be accessible via a GET request.
Connection Reuse
----------------
@@ -360,6 +367,52 @@ string, so the obvious candidates are not compatible. The
chosen
representation can be parsed by both implementations, as well as Go's
``net/url`` and Python's ``urllib.parse``.
+.. _flight-extended-uris:
+
+Extended Location URIs
+----------------------
+
+In addition to alternative transports, a server may also return
+URIs that reference an external service or object storage location.
+This can be useful in cases where intermediate data is cached as
+Apache Parquet files on cloud storage or is otherwise accessible
+via an HTTP service. In these scenarios, it is more efficient to be
+able to provide a URI where the client may simply download the data
+directly, rather than requiring a Flight service to read it back into
+memory and serve it from a ``DoGet`` request.
+
+To avoid the complexities of Flight clients having to implement support
+for multiple different cloud storage vendors (e.g. AWS S3, Google Cloud),
+we extend the URIs to only allow an HTTP/HTTPS URI where the client can
+perform a simple GET request to download the data. Authentication can be
+handled either by negotiating externally to the Flight protocol or by the
+server sending a presigned URL that the client can make a GET request to.
+This should be supported by all current major cloud storage vendors, meaning
+only the server needs to know the semantics of the underlying object store
APIs.
+
+When using an extended location URI, the client should ignore any
+value in the ``Ticket`` field of the ``FlightEndpoint``. The
+``Ticket`` is only used for identifying data in the context of a
+Flight service, and is not needed when the client is directly
+downloading data from an external service.
+
+Clients should assume that, unless otherwise specified, the data is
+being returned using the :ref:`format-ipc` just as it would
+via a ``DoGet`` call. If the returned ``Content-Type`` header is a generic
+media type such as ``application/octet-stream``, the client should still assume
+it is an Arrow IPC stream. For other media types, such as Apache Parquet,
+the server should use the appropriate IANA Media Type that a client
+would recognize.
+
+Finally, the server may also allow the client to choose what format the
+data is returned in by respecting the ``Accept`` header in the request.
+If multiple formats are requested and supported, the choice of which to
+use is server-specific. If none of the requested content-types are
+supported, the server may respond with either 406 (Not Acceptable),
+415 (Unsupported Media Type), or successfuly respond with a different
+format that it does support, along with the correct ``Content-Type``
+header.
+
Error Handling
==============
diff --git a/format/Flight.proto b/format/Flight.proto
index f2b0f889cf..690031ff00 100644
--- a/format/Flight.proto
+++ b/format/Flight.proto
@@ -426,8 +426,41 @@ message Ticket {
}
/*
- * A location where a Flight service will accept retrieval of a particular
- * stream given a ticket.
+ * A location to retrieve a particular stream from. This URI should be one of
+ * the following:
+ * - An empty string or the string 'arrow-flight-reuse-connection://?':
+ * indicating that the ticket can be redeemed on the service where the
+ * ticket was generated via a DoGet request.
+ * - A valid grpc URI (grpc://, grpc+tls://, grpc+unix://, etc.):
+ * indicating that the ticket can be redeemed on the service at the given
+ * URI via a DoGet request.
+ * - A valid HTTP URI (http://, https://, etc.):
+ * indicating that the client should perform a GET request against the
+ * given URI to retrieve the stream. The ticket should be empty
+ * in this case and should be ignored by the client. Cloud object storage
+ * can be utilized by presigned URLs or mediating the auth separately and
+ * returning the full URL (e.g.
https://amzn-s3-demo-bucket.s3.us-west-2.amazonaws.com/...).
+ *
+ * We allow non-Flight URIs for the purpose of allowing Flight services to
indicate that
+ * results can be downloaded in formats other than Arrow (such as Parquet) or
to allow
+ * direct fetching of results from a URI to reduce excess copying and data
movement.
+ * In these cases, the following conventions should be followed by servers and
clients:
+ *
+ * - Unless otherwise specified by the 'Content-Type' header of the response,
+ * a client should assume the response is using the Arrow IPC Streaming
format.
+ * Usage of an IANA media type like 'application/octet-stream' should be
assumed to
+ * be using the Arrow IPC Streaming format.
+ * - The server may allow the client to choose a specific response format by
+ * specifying an 'Accept' header in the request, such as
'application/vnd.apache.parquet'
+ * or 'application/vnd.apache.arrow.stream'. If multiple types are
requested and
+ * supported by the server, the choice of which to use is server-specific.
If
+ * none of the requested content-types are supported, the server may
respond with
+ * either 406 (Not Acceptable) or 415 (Unsupported Media Type), or
successfully
+ * respond with a different format that it does support along with the
correct
+ * 'Content-Type' header.
+ *
+ * Note: new schemes may be proposed in the future to allow for more
flexibility based
+ * on community requests.
*/
message Location {
string uri = 1;