This is an automated email from the ASF dual-hosted git repository.

zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 067fd2a2c6 GH-46193: [Flight][Format] Extend Flight Location URI 
Semantics (#46194)
067fd2a2c6 is described below

commit 067fd2a2c6e54d33b9ae8a3324f59bebe960d485
Author: Matt Topol <[email protected]>
AuthorDate: Mon May 5 11:44:11 2025 -0400

    GH-46193: [Flight][Format] Extend Flight Location URI Semantics (#46194)
    
    ### Rationale for this change
    Updating the documentation in Flight.proto and Flight.rst to extend the
    semantics of the allowed Flight location URIs.
    
    ### What changes are included in this PR?
    Just documentation changes. Currently, none of the Arrow Flight
    implementations actually implement handling of the returned URIs beyond
    possibly parsing them and wrapping in a `Location` structure. It is left
    to the consumer to implement the logic of whether to re-use the same
    client or spin up a new client with the new location etc. to perform the
    `DoGet` request against. As such, there wasn't a need to make any
    code/library changes to accomodate this as part of this PR.
    
    
    * GitHub Issue: #46193
    
    ---------
    
    Co-authored-by: Sutou Kouhei <[email protected]>
    Co-authored-by: Ian Cook <[email protected]>
    Co-authored-by: Raúl Cumplido <[email protected]>
---
 docs/source/format/Flight.rst | 53 +++++++++++++++++++++++++++++++++++++++++++
 format/Flight.proto           | 37 ++++++++++++++++++++++++++++--
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/docs/source/format/Flight.rst b/docs/source/format/Flight.rst
index aac979cf75..7355a698d0 100644
--- a/docs/source/format/Flight.rst
+++ b/docs/source/format/Flight.rst
@@ -333,6 +333,13 @@ schemes for the given transports:
 +----------------------------+--------------------------------+
 | (reuse connection)         | arrow-flight-reuse-connection: |
 +----------------------------+--------------------------------+
+| HTTP (1)                   | http: or https:                |
++----------------------------+--------------------------------+
+
+Notes:
+
+* \(1) See :ref:`flight-extended-uris` for semantics when using
+   http/https as the transport. It should be accessible via a GET request.
 
 Connection Reuse
 ----------------
@@ -360,6 +367,52 @@ string, so the obvious candidates are not compatible.  The 
chosen
 representation can be parsed by both implementations, as well as Go's
 ``net/url`` and Python's ``urllib.parse``.
 
+.. _flight-extended-uris:
+
+Extended Location URIs
+----------------------
+
+In addition to alternative transports, a server may also return
+URIs that reference an external service or object storage location.
+This can be useful in cases where intermediate data is cached as
+Apache Parquet files on cloud storage or is otherwise accessible
+via an HTTP service. In these scenarios, it is more efficient to be
+able to provide a URI where the client may simply download the data
+directly, rather than requiring a Flight service to read it back into
+memory and serve it from a ``DoGet`` request.
+
+To avoid the complexities of Flight clients having to implement support
+for multiple different cloud storage vendors (e.g. AWS S3, Google Cloud),
+we extend the URIs to only allow an HTTP/HTTPS URI where the client can
+perform a simple GET request to download the data. Authentication can be
+handled either by negotiating externally to the Flight protocol or by the
+server sending a presigned URL that the client can make a GET request to.
+This should be supported by all current major cloud storage vendors, meaning
+only the server needs to know the semantics of the underlying object store 
APIs.
+
+When using an extended location URI, the client should ignore any
+value in the ``Ticket`` field of the ``FlightEndpoint``. The
+``Ticket`` is only used for identifying data in the context of a
+Flight service, and is not needed when the client is directly
+downloading data from an external service.
+
+Clients should assume that, unless otherwise specified, the data is
+being returned using the :ref:`format-ipc` just as it would
+via a ``DoGet`` call. If the returned ``Content-Type`` header is a generic
+media type such as ``application/octet-stream``, the client should still assume
+it is an Arrow IPC stream. For other media types, such as Apache Parquet,
+the server should use the appropriate IANA Media Type that a client
+would recognize.
+
+Finally, the server may also allow the client to choose what format the
+data is returned in by respecting the ``Accept`` header in the request.
+If multiple formats are requested and supported, the choice of which to
+use is server-specific. If none of the requested content-types are
+supported, the server may respond with either 406 (Not Acceptable),
+415 (Unsupported Media Type), or successfuly respond with a different
+format that it does support, along with the correct ``Content-Type``
+header.
+
 Error Handling
 ==============
 
diff --git a/format/Flight.proto b/format/Flight.proto
index f2b0f889cf..690031ff00 100644
--- a/format/Flight.proto
+++ b/format/Flight.proto
@@ -426,8 +426,41 @@ message Ticket {
 }
 
 /*
- * A location where a Flight service will accept retrieval of a particular
- * stream given a ticket.
+ * A location to retrieve a particular stream from. This URI should be one of
+ * the following:
+ *  - An empty string or the string 'arrow-flight-reuse-connection://?':
+ *    indicating that the ticket can be redeemed on the service where the
+ *    ticket was generated via a DoGet request.
+ *  - A valid grpc URI (grpc://, grpc+tls://, grpc+unix://, etc.):
+ *    indicating that the ticket can be redeemed on the service at the given
+ *    URI via a DoGet request.
+ *  - A valid HTTP URI (http://, https://, etc.):
+ *    indicating that the client should perform a GET request against the
+ *    given URI to retrieve the stream. The ticket should be empty
+ *    in this case and should be ignored by the client. Cloud object storage
+ *    can be utilized by presigned URLs or mediating the auth separately and
+ *    returning the full URL (e.g. 
https://amzn-s3-demo-bucket.s3.us-west-2.amazonaws.com/...).
+ *
+ * We allow non-Flight URIs for the purpose of allowing Flight services to 
indicate that
+ * results can be downloaded in formats other than Arrow (such as Parquet) or 
to allow
+ * direct fetching of results from a URI to reduce excess copying and data 
movement.
+ * In these cases, the following conventions should be followed by servers and 
clients:
+ *
+ *  - Unless otherwise specified by the 'Content-Type' header of the response,
+ *    a client should assume the response is using the Arrow IPC Streaming 
format.
+ *    Usage of an IANA media type like 'application/octet-stream' should be 
assumed to
+ *    be using the Arrow IPC Streaming format.
+ *  - The server may allow the client to choose a specific response format by
+ *    specifying an 'Accept' header in the request, such as 
'application/vnd.apache.parquet'
+ *    or 'application/vnd.apache.arrow.stream'. If multiple types are 
requested and
+ *    supported by the server, the choice of which to use is server-specific. 
If
+ *    none of the requested content-types are supported, the server may 
respond with
+ *    either 406 (Not Acceptable) or 415 (Unsupported Media Type), or 
successfully
+ *    respond with a different format that it does support along with the 
correct
+ *    'Content-Type' header.
+ *
+ * Note: new schemes may be proposed in the future to allow for more 
flexibility based
+ * on community requests.
  */
 message Location {
   string uri = 1;

Reply via email to