I am very interested personally in the benchmarks - the claim is 20x which
seems like quite a lot.

One thing I may say is that you may consider using the arrow in your JSON
API directly.  We found that sending datasets over the wire over JSON is
fastest if you base-64 encode the binary data and send that - the arrow
javascript client has support for reading datasets of this form.
Potentially your slowdown isn't going to be getting data from the database
but rather getting the data and to the client and manipulating it there.
Gzipped base-64 encoded binary data in over JSON is a win in several
dimensions over sending raw sequences of maps in JSON form.

Specifically for the JVM we have a dataframe system that works both server
<https://github.com/techascent/tech.ml.dataset> and client side
<https://github.com/cnuernber/tmdjs> with the transport taken care of and
has similar SQL bindings to the Arrow SQL system although those bindings
are in the raw form meaning they don't federate or anything like that; they
are just dataframes directly on top of JDBC.  Arrow format just makes more
sense for sending/retrieving bulk data especially when you compress it thus
it would be great to see it supported at the driver level e.g. have the
postgres driver itself sending/taking data in arrow form.

Anyway, I would expect arrow IPC with compression to be much faster than
raw remote JDBC.  IT wouldn't expect it to be faster than colocated JDBC as
it sits on top of JDBC.  I hope the Arrow SQL pathways encourage DB vendors
to support arrow at a low level in their drivers as it is IMHO a very cheap
way engineering-wise to get great perf.  I would rather get query results
in arrow form and I would rather upload bulk inserts and bulk prepared
queries to the server in arrow form.

On Fri, Feb 25, 2022 at 6:20 PM Gavin Ray <[email protected]> wrote:

> Got it, thank you so much.
>
> With that in mind, I think I will do some experiments using the Derby demo
> FlightSQL server
> and the JDBC driver in the PR to wrap standard JDBC database connections.
>
> If anyone is interested, I could post some JMH benchmarks comparing
> queries using JDBC
> and FlightSQL JDBC on H2/HSQL if I do this?
>
> On Fri, Feb 25, 2022 at 8:09 PM James Duong <[email protected]>
> wrote:
>
>>
>> On Fri, Feb 25, 2022 at 4:55 PM Gavin Ray <[email protected]> wrote:
>>
>>> to build a FlightSQL producer that delegates to other databases
>>>> (possibly using JDBC?).
>>>> Then building an JSON-based API or service on top of the
>>>> FlightSqlClient.
>>>>
>>>
>>> Yes, exactly =D
>>>
>>> Thank you for the detailed breakdown, that makes things very clear.
>>> In this case, the JSON API and FlightSQL server would likely be the same
>>> process, so I think your scenario #1 would be my answer.
>>>
>>> You also have the option of using the Arrow JDBC driver to get the
>>>> benefit of the Arrow Flight protocol but still write your code using JDBC.
>>>>
>>>
>>> I have been following that PR on Github, it's very exciting.
>>> One thing I am not sure I understand though -- with the JDBC driver for
>>> Arrow, there still needs to be a FlightSQL Server talking to the database
>>> right?
>>>
>> You're correct. You need a FlightSQL server fronting whatever database
>> you'd like to expose to use it with the Arrow JDBC driver. Ideally you
>> could co-locate the FlightSQL Server with the database (which would
>> effectively make remote calls implemented using Arrow Flight rather than
>> the database's proprietary protocol).
>>
>>>
>>> So if I understand it, with the FlightSQL JDBC driver, it takes the
>>> place of the FlightSQL Client and the request flow would be something like:
>>>
>>>     Client <--> JSON API <--> FlightSQL JDBC Driver <--> FlightSQL
>>> Server <---> Database
>>>
>>> And to connect the FlightSQL Server to the Database would require
>>> wrapping it in
>>> JDBC (unless some native/direct wire protocol implementation was written
>>> for each DB) right?
>>>
>> Yes
>>
>>>
>>> Is this more performant than querying through regular JDBC? (Maybe
>>> because of Arrow's format?)
>>>
>> Potentially more performant due to the Arrow Flight protocol.
>>
>>>
>>>
>>> On Fri, Feb 25, 2022 at 7:33 PM James Duong <[email protected]>
>>> wrote:
>>>
>>>> Hi Gavin,
>>>>
>>>> If I'm understanding correctly, what you're thinking of is to build a
>>>> FlightSQL producer that delegates to other databases (possibly using 
>>>> JDBC?).
>>>> Then building an JSON-based API or service on top of the
>>>> FlightSqlClient.
>>>>
>>>> There are two potentially remote calls happening here:
>>>> JSON API (FlightSqlClient) -> FlightSqlServer
>>>> FlightSqlServer -> each database being federated to.
>>>>
>>>> In this scenario, the benefits of Flight vary depending on your network
>>>> topology:
>>>> 1. If the JSON API and FlightSqlServer are co-located there is little
>>>> time being spent on the network sending data through Flight, so it is not
>>>> very beneficial.
>>>> 2. If the JSON API is remote and FlighSqlServer is co-located with
>>>> federated databases, the majority of the network transmission will be using
>>>> Flight so you should see some performance benefits.
>>>> 3. If the JSON API, FlightSqlServer, and databases are all remote from
>>>> each other, you _might_ see benefits to Flight.
>>>>
>>>> If you instead built your JSON app to federate queries directly to each
>>>> database you have one remote call happening
>>>> JSON API (JDBC) -> federated database
>>>> This might be faster or slower depending on the network conditions
>>>> between the JSON API and the federated database.
>>>> The above would use the JDBC driver's protocol which may not perform as
>>>> well as Flight.
>>>>
>>>> You also have the option of using the Arrow JDBC driver to get the
>>>> benefit of the Arrow Flight protocol but still write your code using JDBC.
>>>>
>>>> On Fri, Feb 25, 2022 at 3:59 PM Gavin Ray <[email protected]>
>>>> wrote:
>>>>
>>>>> Excuse me if this question seems a bit silly, but I'm wondering
>>>>> whether it would
>>>>> make sense to use FlightSQL to power a JSON API that talks to multiple
>>>>> databases.
>>>>>
>>>>> I know that the docs around Flight/FlightSQL say that there is a marked
>>>>> performance improvement over ODBC/JDBC, but I assume this is only if
>>>>> the service
>>>>> sending the data over Flight isn't using one of these to interact with
>>>>> the
>>>>> datasource, right?
>>>>>
>>>>> If I have a JVM application using JDBC and sending the responses as
>>>>> JSON, would
>>>>> it still make sense to look towards implementing a FlightSQL Server
>>>>> because of
>>>>> the ability to distribute operations across multiple instances and it
>>>>> being
>>>>> language-agnostic? Or would I be losing most of the benefits here?
>>>>>
>>>>> Not familiar with the Arrow format and project as a whole, so still
>>>>> trying to
>>>>> wrap my head around things, sorry!
>>>>>
>>>>> Thank you =)
>>>>> Gavin Ray
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *James Duong*
>>>> Lead Software Developer
>>>> Bit Quill Technologies Inc.
>>>> Direct: +1.604.562.6082 | [email protected]
>>>> https://www.bitquilltech.com
>>>>
>>>> This email message is for the sole use of the intended recipient(s) and
>>>> may contain confidential and privileged information.  Any unauthorized
>>>> review, use, disclosure, or distribution is prohibited.  If you are not the
>>>> intended recipient, please contact the sender by reply email and destroy
>>>> all copies of the original message.  Thank you.
>>>>
>>>
>>
>> --
>>
>> *James Duong*
>> Lead Software Developer
>> Bit Quill Technologies Inc.
>> Direct: +1.604.562.6082 | [email protected]
>> https://www.bitquilltech.com
>>
>> This email message is for the sole use of the intended recipient(s) and
>> may contain confidential and privileged information.  Any unauthorized
>> review, use, disclosure, or distribution is prohibited.  If you are not the
>> intended recipient, please contact the sender by reply email and destroy
>> all copies of the original message.  Thank you.
>>
>

Reply via email to