Re: [Rust] Arrow SQL Adapters/Connectors

Julian Hyde Tue, 29 Sep 2020 00:48:48 -0700

ODBC and JDBC do not specify a wire protocol. So, while the client APIs are 
definitely row-based, any particular driver could use a protocol that is based 
on Arrow data.


There is immense investment in ODBC and JDBC drivers, and they handle complex 
cases such as connection pooling, statement canceling, auth, encryption. So if 
we are able to retrofit Arrow into existing drivers, it is a big win.

I know JDBC better than ODBC, and there is a convenient back-door that would 
allow a compliant driver to provide bulk Arrow data. The back-door is the 
Wrapper interface, implemented by ResultSet:

  Connection c;
  Statement s = c.createStatement();
  ResultSet r = s.executeQuery(“select * from myTable where x > 100”);
  If (r.isWrapperFor(ArrowStream.class)) {
    ArrowStream a = r.unwrap(ArrowStream.class);
    // process data in bulk
  } else {
    // process rows one at a time
    while (r.next()) {
      …
    }
  }
  r.close();
  s.close();

I invented ArrowStream for the purposes of this example, but we could write 
such an interface as an extension to JDBC. A JDBC application could execute the 
above code against any JDBC driver, and non-compliant drivers would just fall 
back to the old row-at-a-time API.

I am co-author of the Avatica JDBC driver (part of Calcite) and we have long 
considered adding Arrow as a data format for our transport. With the above 
extension, we would be able to hand the Arrow data directly to the client 
application, rather than going through JDBC’s row-and-column-at-a-time API.

I think other drivers might also find this a convenient way to get Arrow data 
directly to their clients.

Julian

> On Sep 29, 2020, at 12:29 AM, Sven Wagner-Boysen 
> <sven.wagner-boy...@signavio.com> wrote:
> 
> I think this is a great initiative.
> If I understand correctly, it would open up Arrow for many more use cases
> and allow for example to connect BI Tools like PowerBi, Tableau, etc to
> DataFusion. I'll also try to make some time to support this.
> Thanks!
> Sven
> 
> On Mon, Sep 28, 2020 at 3:49 AM Andy Grove <andygrov...@gmail.com> wrote:
> 
>> I didn't get a chance yet to really read this thread in detail but I am
>> definitely very interested in this conversation and will make time this
>> week to add my thoughts.
>> 
>> Thanks,
>> 
>> Andy.
>> 
>> On Sun, Sep 27, 2020, 4:01 PM Adam Lippai <a...@rigo.sk> wrote:
>> 
>>> Hi Neville,
>>> 
>>> yes, my concerns against common row based DB APIs is that I use
>>> Arrow/Parquet for OLAP too.
>>> What https://turbodbc.readthedocs.io/en/latest/ (python) or
>>> https://github.com/pacman82/odbc-api#state (rust) does is that they read
>>> large blocks of data instead of processing rows one-by-one, but indeed,
>> the
>>> ODBC and the Postgresql wire protocol is still row based.
>>> 
>>> Clickhouse is an interesting example, as it directly supports arrow and
>>> parquet *server-side* (I didn't try it yet, just read it in the docs).
>>> 
>>> Best regards,
>>> Adam Lippai
>>> 
>>> On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale <nevilled...@gmail.com>
>>> wrote:
>>> 
>>>> Thanks for the feedback
>>>> 
>>>> My interest is mainly in the narrow usecase of reading and writing
>> batch
>>>> data,
>>>> so I wouldn't want to deal with producing and consuming rows per se.
>>>> Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
>>>> row-based or OLTP case,
>>>> and I'm considering something more suitable for the OLAP case.
>>>> 
>>>> @Wes I'll have a read through the Python DB API, I've also been looking
>>> at
>>>> JDBC
>>>> as well as how Apache Spark manages to get such good performance from
>>> JDBC.
>>>> 
>>>> I haven't been an ODBC fan, but mainly because of historic struggles
>> with
>>>> getting it to work
>>>> on Linux envs where I don't have system control. WIth that said, we
>> could
>>>> still support ODBC.
>>>> 
>>>> @Jorge, I have an implementation at rust-dataframe (
>>>> 
>>> 
>> https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
>>>> which uses rust-postgres. I however don't use the row-based API as that
>>>> comes at
>>>> a serialization cost (going from bytes > Rust types > Arrow).
>>>> I instead use the
>>>> Postgres binary format (
>>>> 
>>>> 
>>> 
>> https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
>>>> ).
>>>> That postgres module would be the starting point of such separate
>> crate.
>>>> 
>>>> For Postgres <> Arrow type conversions, I leverage 2 methods:
>>>> 
>>>> 1. When reading a table, we I get schema from the *information_schema*
>>>> system
>>>> table
>>>> 2. When reading a query, I issue the query with a 1-row limit, and
>>> convert
>>>> the row's schema to an Arrow schema
>>>> 
>>>> @Adam I think async and pooling would be attainable yes, if an
>> underlying
>>>> SQL crate
>>>> uses R2D2 for pooling, an API that supports that could be provided.
>>>> 
>>>> In summary, I'm thinking along the lines of:
>>>> 
>>>> * A reader that takes connection parameters & a query or table
>>>> * The reader can handle partitioning if need be (similar to how Spark
>>> does
>>>> it)
>>>> * The reader returns a Schema, and can be iterated on to return data in
>>>> batches
>>>> 
>>>> * A writer that takes connection parameters and a table
>>>> * The writer writes batches to a table, and is able to write batches in
>>>> parallel
>>>> 
>>>> In the case of a hypothetical interfacing with column databases like
>>>> Clickhouse,
>>>> we would be able to levarage materialising arrows from columns, instead
>>> of
>>>> the
>>>> potential column-wise conversions that can be performed from row-based
>>>> APIs.
>>>> 
>>>> Neville
>>>> 
>>>> 
>>>> On Sun, 27 Sep 2020 at 22:08, Adam Lippai <a...@rigo.sk> wrote:
>>>> 
>>>>> One more universal approach is to use ODBC, this is a recent Rust
>>>>> conversation (with example) on the topic:
>>>>> https://github.com/Koka/odbc-rs/issues/140
>>>>> 
>>>>> Honestly I find the Python DB API too simple, all it provides is a
>>>>> row-by-row API. I miss four things:
>>>>> 
>>>>>   - Batched or bulk processing both for data loading and dumping.
>>>>>   - Async support (python has asyncio and async web frameworks, but
>> no
>>>>>   async DB spec). SQLAlchemy async support is coming soon and there
>> is
>>>>>   https://github.com/encode/databases
>>>>>   - Connection pooling (it's common to use TLS, connection reuse
>> would
>>>> be
>>>>>   nice as TLS 1.3 is not here yet)
>>>>>   - Failover / load balancing support (this is connected to the
>>>> previous)
>>>>> 
>>>>> Best regards,
>>>>> Adam Lippai
>>>>> 
>>>>> On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
>>>>> jorgecarlei...@gmail.com> wrote:
>>>>> 
>>>>>> That would be awesome! I agree with this, and would be really
>> useful,
>>>> as
>>>>> it
>>>>>> would leverage all the goodies that RDMS have wrt to transitions,
>>> etc.
>>>>>> 
>>>>>> I would probably go for having database-specifics outside of the
>>> arrow
>>>>>> project, so that they can be used by other folks beyond arrow, and
>>> keep
>>>>> the
>>>>>> arrow-specifics (i.e. conversion from the format from the specific
>>>>>> databases to arrow) as part of the arrow crate. Ideally as Wes
>> wrote,
>>>>> with
>>>>>> some standard to be easier to handle different DBs.
>>>>>> 
>>>>>> I think that there are two layers: one is how to connect to a
>>> database,
>>>>> the
>>>>>> other is how to serialize/deserialize. AFAIK PEP 249 covers both
>>>> layers,
>>>>> as
>>>>>> it standardizes things like `connect` and `tpc_begin`, as well as
>> how
>>>>>> things should be serialized to Python objects (e.g. dates should be
>>>>>> datetime.date). This split is done by postgres for Rust
>>>>>> <https://github.com/sfackler/rust-postgres>, as it offers 5
>> crates:
>>>>>> * postges-async
>>>>>> * postges-sync (a blocking wrapper of postgres-async)
>>>>>> * postges-types (to convert to native rust  <---- IMO this one is
>>> what
>>>> we
>>>>>> want to offer in Arrow)
>>>>>> * postges-TLS
>>>>>> * postges-openssl
>>>>>> 
>>>>>> `postges-sync` implements Iterator<Row> (`client.query`), and
>>>>> postges-async
>>>>>> implements Stream<Row>.
>>>>>> 
>>>>>> One idea is to have a generic<T> iterator/stream adapter, that
>> yields
>>>>>> RecordBatches. The implementation of this trait by different
>>> providers
>>>>>> would give support to be used in Arrow and DataFusion.
>>>>>> 
>>>>>> Besides postgres, one idea is to pick the top from this list
>>>>>> <https://db-engines.com/en/ranking>:
>>>>>> 
>>>>>> * Oracle
>>>>>> * MySQL
>>>>>> * MsSQL
>>>>>> 
>>>>>> Another idea is to start by by supporting SQLite, which is a good
>>>>>> development env to work with relational databases.
>>>>>> 
>>>>>> Best,
>>>>>> Jorge
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <
>>> nevilled...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Arrow developers
>>>>>>> 
>>>>>>> I would like to gauge the appetite for an Arrow SQL connector
>> that:
>>>>>>> 
>>>>>>> * Reads and writes Arrow data to and from SQL databases
>>>>>>> * Reads tables and queries into record batches, and writes
>> batches
>>> to
>>>>>>> tables (either append or overwrite)
>>>>>>> * Leverages binary SQL formats where available (e.g. PostgreSQL
>>>> format
>>>>> is
>>>>>>> relatively easy and well-documented)
>>>>>>> * Provides a batch interface that abstracts away the different
>>>> database
>>>>>>> semantics, and exposes a RecordBatchReader (
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
>>>>>>> ),
>>>>>>> and perhaps a RecordBatchWriter
>>>>>>> * Resides in the Rust repo as either an arrow::sql module (like
>>>>>> arrow::csv,
>>>>>>> arrow::json, arrow::ipc) or alternatively is a separate crate in
>>> the
>>>>>>> workspace  (*arrow-sql*?)
>>>>>>> 
>>>>>>> I would be able to contribute a Postgres reader/writer as a
>> start.
>>>>>>> I could make this a separate crate, but to drive adoption I would
>>>>> prefer
>>>>>>> this living in Arrow, also it can remain updated (sometimes we
>>>>> reorganise
>>>>>>> modules and end up breaking dependencies).
>>>>>>> 
>>>>>>> Also, being developed next to DataFusion could allow DF to
>> support
>>>> SQL
>>>>>>> databases, as this would be yet another datasource.
>>>>>>> 
>>>>>>> Some questions:
>>>>>>> * Should such library support async, sync or both IO methods?
>>>>>>> * Other than postgres, what other databases would be interesting?
>>>> Here
>>>>>> I'm
>>>>>>> hoping that once we've established a suitable API, it could be
>>> easier
>>>>> to
>>>>>>> natively support more database types.
>>>>>>> 
>>>>>>> Potential concerns:
>>>>>>> 
>>>>>>> * Sparse database support
>>>>>>> It's a lot of effort to write database connectors, especially if
>>>>> starting
>>>>>>> from scratch (unlike with say JDBC). What if we end up
>> supporting 1
>>>> or
>>>>> 2
>>>>>>> database servers?
>>>>>>> Perhaps in that case we could keep the module without publishing
>> it
>>>> to
>>>>>>> crates.io until we're happy with database support, or even its
>>>> usage.
>>>>>>> 
>>>>>>> * Dependency bloat
>>>>>>> We could feature-gate database types to reduce the number of
>>>>> dependencies
>>>>>>> if one only wants certain DB connectors
>>>>>>> 
>>>>>>> * Why not use Java's JDBC adapter?
>>>>>>> I already do this, but sometimes if working on a Rust project,
>>>>> creating a
>>>>>>> separate JVM service solely to extract Arrow data is a lot of
>>> effort.
>>>>>>> I also don't think it's currently possible to use the adapter to
>>> save
>>>>>> Arrow
>>>>>>> data in a database.
>>>>>>> 
>>>>>>> * What about Flight SQL extensions?
>>>>>>> There have been discussions around creating Flight SQL
>> extensions,
>>>> and
>>>>>> the
>>>>>>> Rust SQL adapter could implement that and co-exist well.
>>>>>>> From a crate dependency, *arrow-flight* depends on *arrow*, so it
>>>> could
>>>>>>> also depend on this *arrow-sql* crate.
>>>>>>> 
>>>>>>> Please let me know what you think
>>>>>>> 
>>>>>>> Regards
>>>>>>> Neville
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> *Sven Wagner-Boysen* | Lead Engineer
> 
> www.signavio.com
> Kurfürstenstraße 111, 10787 Berlin, Germany
> 
> Work with us! <https://hubs.ly/H0wwzcr0>
> 
> 
> <https://www.linkedin.com/company/signavio/>
> <https://www.twitter.com/signavio>   <https://www.facebook.com/signavio>
> <https://www.youtube.com/user/signavio>
> <https://www.xing.com/companies/signaviogmbh>
> 
> <https://t-eu.xink.io/Tracking/Index/tBsAALBtAAAqz0MA0>
> 
> HRB 121584 B Charlottenburg District Court, VAT ID: DE265675123
> Managing Directors: Dr. Gero Decker, Daniel Rosenthal

Re: [Rust] Arrow SQL Adapters/Connectors

Reply via email to