Re: [Rust] Arrow SQL Adapters/Connectors

Neville Dipale Sun, 27 Sep 2020 14:24:43 -0700

Thanks for the feedback

My interest is mainly in the narrow usecase of reading and writing batch
data,
so I wouldn't want to deal with producing and consuming rows per se.
Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
row-based or OLTP case,
and I'm considering something more suitable for the OLAP case.


@Wes I'll have a read through the Python DB API, I've also been looking at
JDBC
as well as how Apache Spark manages to get such good performance from JDBC.

I haven't been an ODBC fan, but mainly because of historic struggles with
getting it to work
on Linux envs where I don't have system control. WIth that said, we could
still support ODBC.

@Jorge, I have an implementation at rust-dataframe (
https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
which uses rust-postgres. I however don't use the row-based API as that
comes at
a serialization cost (going from bytes > Rust types > Arrow).
I instead use the
Postgres binary format (
https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
).
That postgres module would be the starting point of such separate crate.

For Postgres <> Arrow type conversions, I leverage 2 methods:

1. When reading a table, we I get schema from the *information_schema* system
table
2. When reading a query, I issue the query with a 1-row limit, and convert
the row's schema to an Arrow schema

@Adam I think async and pooling would be attainable yes, if an underlying
SQL crate
uses R2D2 for pooling, an API that supports that could be provided.

In summary, I'm thinking along the lines of:

* A reader that takes connection parameters & a query or table
* The reader can handle partitioning if need be (similar to how Spark does
it)
* The reader returns a Schema, and can be iterated on to return data in
batches

* A writer that takes connection parameters and a table
* The writer writes batches to a table, and is able to write batches in
parallel

In the case of a hypothetical interfacing with column databases like
Clickhouse,
we would be able to levarage materialising arrows from columns, instead of
the
potential column-wise conversions that can be performed from row-based APIs.

Neville


On Sun, 27 Sep 2020 at 22:08, Adam Lippai <[email protected]> wrote:

> One more universal approach is to use ODBC, this is a recent Rust
> conversation (with example) on the topic:
> https://github.com/Koka/odbc-rs/issues/140
>
> Honestly I find the Python DB API too simple, all it provides is a
> row-by-row API. I miss four things:
>
>    - Batched or bulk processing both for data loading and dumping.
>    - Async support (python has asyncio and async web frameworks, but no
>    async DB spec). SQLAlchemy async support is coming soon and there is
>    https://github.com/encode/databases
>    - Connection pooling (it's common to use TLS, connection reuse would be
>    nice as TLS 1.3 is not here yet)
>    - Failover / load balancing support (this is connected to the previous)
>
> Best regards,
> Adam Lippai
>
> On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> [email protected]> wrote:
>
> > That would be awesome! I agree with this, and would be really useful, as
> it
> > would leverage all the goodies that RDMS have wrt to transitions, etc.
> >
> > I would probably go for having database-specifics outside of the arrow
> > project, so that they can be used by other folks beyond arrow, and keep
> the
> > arrow-specifics (i.e. conversion from the format from the specific
> > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> with
> > some standard to be easier to handle different DBs.
> >
> > I think that there are two layers: one is how to connect to a database,
> the
> > other is how to serialize/deserialize. AFAIK PEP 249 covers both layers,
> as
> > it standardizes things like `connect` and `tpc_begin`, as well as how
> > things should be serialized to Python objects (e.g. dates should be
> > datetime.date). This split is done by postgres for Rust
> > <https://github.com/sfackler/rust-postgres>, as it offers 5 crates:
> > * postges-async
> > * postges-sync (a blocking wrapper of postgres-async)
> > * postges-types (to convert to native rust  <---- IMO this one is what we
> > want to offer in Arrow)
> > * postges-TLS
> > * postges-openssl
> >
> > `postges-sync` implements Iterator<Row> (`client.query`), and
> postges-async
> > implements Stream<Row>.
> >
> > One idea is to have a generic<T> iterator/stream adapter, that yields
> > RecordBatches. The implementation of this trait by different providers
> > would give support to be used in Arrow and DataFusion.
> >
> > Besides postgres, one idea is to pick the top from this list
> > <https://db-engines.com/en/ranking>:
> >
> > * Oracle
> > * MySQL
> > * MsSQL
> >
> > Another idea is to start by by supporting SQLite, which is a good
> > development env to work with relational databases.
> >
> > Best,
> > Jorge
> >
> >
> >
> >
> >
> > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <[email protected]>
> > wrote:
> >
> > > Hi Arrow developers
> > >
> > > I would like to gauge the appetite for an Arrow SQL connector that:
> > >
> > > * Reads and writes Arrow data to and from SQL databases
> > > * Reads tables and queries into record batches, and writes batches to
> > > tables (either append or overwrite)
> > > * Leverages binary SQL formats where available (e.g. PostgreSQL format
> is
> > > relatively easy and well-documented)
> > > * Provides a batch interface that abstracts away the different database
> > > semantics, and exposes a RecordBatchReader (
> > >
> >
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> > > ),
> > > and perhaps a RecordBatchWriter
> > > * Resides in the Rust repo as either an arrow::sql module (like
> > arrow::csv,
> > > arrow::json, arrow::ipc) or alternatively is a separate crate in the
> > > workspace  (*arrow-sql*?)
> > >
> > > I would be able to contribute a Postgres reader/writer as a start.
> > > I could make this a separate crate, but to drive adoption I would
> prefer
> > > this living in Arrow, also it can remain updated (sometimes we
> reorganise
> > > modules and end up breaking dependencies).
> > >
> > > Also, being developed next to DataFusion could allow DF to support SQL
> > > databases, as this would be yet another datasource.
> > >
> > > Some questions:
> > > * Should such library support async, sync or both IO methods?
> > > * Other than postgres, what other databases would be interesting? Here
> > I'm
> > > hoping that once we've established a suitable API, it could be easier
> to
> > > natively support more database types.
> > >
> > > Potential concerns:
> > >
> > > * Sparse database support
> > > It's a lot of effort to write database connectors, especially if
> starting
> > > from scratch (unlike with say JDBC). What if we end up supporting 1 or
> 2
> > > database servers?
> > > Perhaps in that case we could keep the module without publishing it to
> > > crates.io until we're happy with database support, or even its usage.
> > >
> > > * Dependency bloat
> > > We could feature-gate database types to reduce the number of
> dependencies
> > > if one only wants certain DB connectors
> > >
> > > * Why not use Java's JDBC adapter?
> > > I already do this, but sometimes if working on a Rust project,
> creating a
> > > separate JVM service solely to extract Arrow data is a lot of effort.
> > > I also don't think it's currently possible to use the adapter to save
> > Arrow
> > > data in a database.
> > >
> > > * What about Flight SQL extensions?
> > > There have been discussions around creating Flight SQL extensions, and
> > the
> > > Rust SQL adapter could implement that and co-exist well.
> > > From a crate dependency, *arrow-flight* depends on *arrow*, so it could
> > > also depend on this *arrow-sql* crate.
> > >
> > > Please let me know what you think
> > >
> > > Regards
> > > Neville
> > >
> >
>

Re: [Rust] Arrow SQL Adapters/Connectors

Reply via email to