Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Wes McKinney Mon, 08 Jan 2024 07:20:41 -0800

hi all — I was just catching up on e-mail threads and wanted to give a few
historical comments on this.


When we were assembling the Arrow PMC and committing to do the project in
2015, standardizing Arrow-over-REST was always something that was on the
TODO list — at that time we didn't have the IPC protocol yet, so that was
the fundamental design/engineering that had to take place. I agree that
having example code and well-documented patterns for using REST+Arrow in
production would make it easier for people to adopt Arrow in their systems
for transport, and it would have been better to do this years ago to help
onboard users into the ecosystem faster (and make the "getting started"
part of this less of a DYI affair).

When Jacques and I did the original design / prototyping for Flight-on-gRPC
(2018), the goal was not to convey that as "the Preferred Way" for network
transport (which could steer people away from directly using HTTP, though
perhaps that was an unintentional consequence because of our effort
developing and promoting Flight), but rather to establish generic patterns
for creating distributed Arrow services using gRPC, and to optimize the
serialization aspect (i.e. avoiding extra protobuf encoding/decoding steps
which would be present if you naively used gRPC and put the Arrow IPC
format in a protobuf message as a blob).

In any case, I think having a "Rosetta stone"-type setup of starter code
for building HTTP services that send and receive Arrow would be a help to
developers/users who want to adopt Arrow in their systems.

Thanks
Wes

On Wed, Dec 6, 2023 at 1:30 PM Ian Cook <ianmc...@apache.org> wrote:

> I just remembered that there is an unused "Arrow Experiments" repo [1]
> which Wes created a few years ago [2]. That seems like a more
> appropriate place to open PRs like this one. If there are no
> objections, I will start using that repo for these Arrow-over-HTTP
> PRs.
>
> [1] https://github.com/apache/arrow-experiments
> [2] https://lists.apache.org/thread/cw14s874pwplzf9ycnvfwtwq0xq17npg
>
> Ian
>
> On Wed, Dec 6, 2023 at 1:45 PM Ian Cook <ianmc...@apache.org> wrote:
> >
> > Antoine,
> >
> > Thank you for taking a look. I agree—these are basic examples intended
> > to prove the concept and answer fundamental questions. Next I intend
> > to expand the set of examples to cover more complex cases.
> >
> > > This might necessitate some kind of framing layer, or a
> > > standardized delimiter.
> >
> > I am interested to hear more perspectives on this. My perspective is
> > that we should recommend using HTTP conventions to keep clean
> > separation between the Arrow-formatted binary data payloads and the
> > various application-specific fields. This can be achieved by encoding
> > application-specific fields in URI paths, query parameters, headers,
> > or separate parts of multipart/form-data messages.
> >
> > Ian
> >
> > On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou <anto...@python.org>
> wrote:
> > >
> > >
> > > Hi,
> > >
> > > While this looks like a nice start, I would expect more precise
> > > recommendations for writing non-trivial services. Especially, one
> > > question is how to send both an application-specific POST request and
> an
> > > Arrow stream, or an application-specific GET response and an Arrow
> > > stream. This might necessitate some kind of framing layer, or a
> > > standardized delimiter.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > Le 05/12/2023 à 21:10, Ian Cook a écrit :
> > > > This is a continuation of the discussion entitled "[DISCUSS]
> Protocol for
> > > > exchanging Arrow data over REST APIs". See the previous messages at
> > > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.
> > > >
> > > > To inform this discussion, I created some basic Arrow-over-HTTP
> client and
> > > > server examples here:
> > > > https://github.com/apache/arrow/pull/39081
> > > >
> > > > My intention is to expand and improve this set of examples (with
> your help)
> > > > until they reflect a set of conventions that we are comfortable
> documenting
> > > > as recommendations.
> > > >
> > > > Please take a look and add comments / suggestions in the PR.
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington
> > > > <de...@voltrondata.com.invalid> wrote:
> > > >
> > > >> I also think a set of best practices for Arrow over HTTP would be a
> > > >> valuable resource for the community...even if it never becomes a
> > > >> specification of its own, it will be beneficial for API developers
> and
> > > >> consumers of those APIs to have a place to look to understand how
> > > >> Arrow can help improve throughput/latency/maybe other things.
> Possibly
> > > >> something like httpbin.org but for requests/responses that use
> Arrow
> > > >> would be helpful as well. Thank you Ian for leading this effort!
> > > >>
> > > >> It has mostly been covered already, but in the (ubiquitous)
> situation
> > > >> where a response contains some schema/table and some
> non-schema/table
> > > >> information there is some tension between throughput (best served
> by a
> > > >> JSON response plus one or more IPC stream responses) and latency
> (best
> > > >> served by a single HTTP response? JSON? IPC with metadata/header?).
> In
> > > >> addition to Antoine's list, I would add:
> > > >>
> > > >> - How to serve the same table in multiple requests (e.g., to
> saturate
> > > >> a network connection, or because separate worker nodes are
> generating
> > > >> results anyway).
> > > >> - How to inline a small schema/table into a single request with
> other
> > > >> metadata (I have seen this done as base64-encoded IPC in JSON, but
> > > >> perhaps there is a better way)
> > > >>
> > > >> If anybody is interested in experimenting, I repurposed a previous
> > > >> experiment I had as a flask app that can stream IPC to a client:
> > > >>
> > > >>
> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
> > > >> .
> > > >>
> > > >>> - recommendations about compression
> > > >>
> > > >> Just a note that there is also Content-Encoding: gzip (for consumers
> > > >> like Arrow JS that don't currently support buffer compression but
> that
> > > >> can leverage the facilities of the browser/http library)
> > > >>
> > > >> Cheers!
> > > >>
> > > >> -dewey
> > > >>
> > > >>
> > > >> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com>
> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>>> But how is the performance?
> > > >>>
> > > >>> It's faster than the original JSON based API.
> > > >>>
> > > >>> I implemented Apache Arrow support for a C# client. So I
> > > >>> measured only with Apache Arrow C# but the Apache Arrow
> > > >>> based API is faster than JSON based API.
> > > >>>
> > > >>>> Have you measured the throughput of this approach to see
> > > >>>> if it is comparable to using Flight SQL?
> > > >>>
> > > >>> Sorry. I didn't measure the throughput. In the case, elapsed
> > > >>> time of one request/response pair is important than
> > > >>> throughput. And it was faster than JSON based API and enough
> > > >>> performance.
> > > >>>
> > > >>> I couldn't compare to a Flight SQL based approach because
> > > >>> Groonga doesn't support Flight SQL yet.
> > > >>>
> > > >>>> Is this approach able to saturate a fast network
> > > >>>> connection?
> > > >>>
> > > >>> I think that we can't measure this with the Groonga case
> > > >>> because the Groonga case doesn't send data without
> > > >>> stopping. Here is one of request patterns:
> > > >>>
> > > >>> 1. Groonga has log data partitioned by day
> > > >>> 2. Groonga does full text search against one partition (2023-11-01)
> > > >>> 3. Groonga sends the result to client as Apache Arrow
> > > >>>     streaming format record batches
> > > >>> 4. Groonga does full text search against the next partition
> (2023-11-02)
> > > >>> 5. Groonga sends the result to client as Apache Arrow
> > > >>>     streaming format record batches
> > > >>> 6. ...
> > > >>>
> > > >>> In the case, the result data aren't always sending. (search
> > > >>> -> send -> search -> send -> ...) So it doesn't saturate a
> > > >>> fast network connection.
> > > >>>
> > > >>> (3. and 4. can be parallel but it's not implemented yet.)
> > > >>>
> > > >>> If we optimize this approach, this approach may be able to
> > > >>> saturate a fast network connection.
> > > >>>
> > > >>>> And what about the case in which the server wants to begin sending
> > > >> batches
> > > >>>> to the client before the total number of result batches / records
> is
> > > >> known?
> > > >>>
> > > >>> Ah, sorry. I forgot to explain the case. Groonga uses the
> > > >>> above approach for it.
> > > >>>
> > > >>>> - server should not return the result data in the body of a
> response
> > > >> to a
> > > >>>> query request; instead server should return a response body that
> gives
> > > >>>> URI(s) at which clients can GET the result data
> > > >>>
> > > >>> If we want to do this, the standard "Location" HTTP headers
> > > >>> may be suitable.
> > > >>>
> > > >>>> - transmit result data in chunks (Transfer-Encoding: chunked),
> with
> > > >>>> recommendations about chunk size
> > > >>>
> > > >>> Ah, sorry. I forgot to explain this case too. Groonga uses
> > > >>> "Transfer-Encoding: chunked". But recommended chunk size may
> > > >>> be case-by-case... If a server can produce enough data as
> > > >>> fast as possible, larger chunk size may be
> > > >>> faster. Otherwise, larger chunk size may be slower.
> > > >>>
> > > >>>> - support range requests (Accept-Range: bytes) to allow clients to
> > > >> request
> > > >>>> result ranges (or not?)
> > > >>>
> > > >>> In the Groonga case, it's not supported. Because Groonga
> > > >>> drops the result after one request/response pair. Groonga
> > > >>> can't return only the specified range result after the
> > > >>> response is returned.
> > > >>>
> > > >>>> - recommendations about compression
> > > >>>
> > > >>> In the case that network is the bottleneck, LZ4 or Zstandard
> > > >>> compression will improve total performance.
> > > >>>
> > > >>>> - recommendations about TCP receive window size
> > > >>>> - recommendation to open multiple TCP connections on very fast
> networks
> > > >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput
> bottleneck
> > > >>>
> > > >>> HTTP/3 may be better for these cases.
> > > >>>
> > > >>>
> > > >>> Thanks,
> > > >>> --
> > > >>> kou
> > > >>>
> > > >>> In <CANa9GTHuXBBkn-=
> uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
> > > >>>    "Re: [DISCUSS] Protocol for exchanging Arrow data over REST
> APIs" on
> > > >> Sat, 18 Nov 2023 13:51:53 -0500,
> > > >>>    Ian Cook <ianmc...@apache.org> wrote:
> > > >>>
> > > >>>> Hi Kou,
> > > >>>>
> > > >>>> I think it is too early to make a specific proposal. I hope to
> use this
> > > >>>> discussion to collect more information about existing approaches.
> If
> > > >>>> several viable approaches emerge from this discussion, then I
> think we
> > > >>>> should make a document listing them, like you suggest.
> > > >>>>
> > > >>>> Thank you for the information about Groonga. This type of
> > > >> straightforward
> > > >>>> HTTP-based approach would work in the context of a REST API, as I
> > > >>>> understand it.
> > > >>>>
> > > >>>> But how is the performance? Have you measured the throughput of
> this
> > > >>>> approach to see if it is comparable to using Flight SQL? Is this
> > > >> approach
> > > >>>> able to saturate a fast network connection?
> > > >>>>
> > > >>>> And what about the case in which the server wants to begin sending
> > > >> batches
> > > >>>> to the client before the total number of result batches / records
> is
> > > >> known?
> > > >>>> Would this approach work in that case? I think so but I am not
> sure.
> > > >>>>
> > > >>>> If this HTTP-based type of approach is sufficiently performant
> and it
> > > >> works
> > > >>>> in a sufficient proportion of the envisioned use cases, then
> perhaps
> > > >> the
> > > >>>> proposed spec / protocol could be based on this approach. If so,
> then
> > > >> we
> > > >>>> could refocus this discussion on which best practices to
> incorporate /
> > > >>>> recommend, such as:
> > > >>>> - server should not return the result data in the body of a
> response
> > > >> to a
> > > >>>> query request; instead server should return a response body that
> gives
> > > >>>> URI(s) at which clients can GET the result data
> > > >>>> - transmit result data in chunks (Transfer-Encoding: chunked),
> with
> > > >>>> recommendations about chunk size
> > > >>>> - support range requests (Accept-Range: bytes) to allow clients to
> > > >> request
> > > >>>> result ranges (or not?)
> > > >>>> - recommendations about compression
> > > >>>> - recommendations about TCP receive window size
> > > >>>> - recommendation to open multiple TCP connections on very fast
> networks
> > > >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput
> bottleneck
> > > >>>>
> > > >>>> On the other hand, if the performance and functionality of this
> > > >> HTTP-based
> > > >>>> type of approach is not sufficient, then we might consider
> > > >> fundamentally
> > > >>>> different approaches.
> > > >>>>
> > > >>>> Ian
> > > >>
> > > >
>

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Reply via email to