Re: [Rust] Arrow SQL Adapters/Connectors

Adam Lippai Sun, 27 Sep 2020 13:09:03 -0700

One more universal approach is to use ODBC, this is a recent Rust
conversation (with example) on the topic:
https://github.com/Koka/odbc-rs/issues/140


Honestly I find the Python DB API too simple, all it provides is a
row-by-row API. I miss four things:

   - Batched or bulk processing both for data loading and dumping.
   - Async support (python has asyncio and async web frameworks, but no
   async DB spec). SQLAlchemy async support is coming soon and there is
   https://github.com/encode/databases
   - Connection pooling (it's common to use TLS, connection reuse would be
   nice as TLS 1.3 is not here yet)
   - Failover / load balancing support (this is connected to the previous)

Best regards,
Adam Lippai

On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> That would be awesome! I agree with this, and would be really useful, as it
> would leverage all the goodies that RDMS have wrt to transitions, etc.
>
> I would probably go for having database-specifics outside of the arrow
> project, so that they can be used by other folks beyond arrow, and keep the
> arrow-specifics (i.e. conversion from the format from the specific
> databases to arrow) as part of the arrow crate. Ideally as Wes wrote, with
> some standard to be easier to handle different DBs.
>
> I think that there are two layers: one is how to connect to a database, the
> other is how to serialize/deserialize. AFAIK PEP 249 covers both layers, as
> it standardizes things like `connect` and `tpc_begin`, as well as how
> things should be serialized to Python objects (e.g. dates should be
> datetime.date). This split is done by postgres for Rust
> <https://github.com/sfackler/rust-postgres>, as it offers 5 crates:
> * postges-async
> * postges-sync (a blocking wrapper of postgres-async)
> * postges-types (to convert to native rust  <---- IMO this one is what we
> want to offer in Arrow)
> * postges-TLS
> * postges-openssl
>
> `postges-sync` implements Iterator<Row> (`client.query`), and postges-async
> implements Stream<Row>.
>
> One idea is to have a generic<T> iterator/stream adapter, that yields
> RecordBatches. The implementation of this trait by different providers
> would give support to be used in Arrow and DataFusion.
>
> Besides postgres, one idea is to pick the top from this list
> <https://db-engines.com/en/ranking>:
>
> * Oracle
> * MySQL
> * MsSQL
>
> Another idea is to start by by supporting SQLite, which is a good
> development env to work with relational databases.
>
> Best,
> Jorge
>
>
>
>
>
> On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <nevilled...@gmail.com>
> wrote:
>
> > Hi Arrow developers
> >
> > I would like to gauge the appetite for an Arrow SQL connector that:
> >
> > * Reads and writes Arrow data to and from SQL databases
> > * Reads tables and queries into record batches, and writes batches to
> > tables (either append or overwrite)
> > * Leverages binary SQL formats where available (e.g. PostgreSQL format is
> > relatively easy and well-documented)
> > * Provides a batch interface that abstracts away the different database
> > semantics, and exposes a RecordBatchReader (
> >
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> > ),
> > and perhaps a RecordBatchWriter
> > * Resides in the Rust repo as either an arrow::sql module (like
> arrow::csv,
> > arrow::json, arrow::ipc) or alternatively is a separate crate in the
> > workspace  (*arrow-sql*?)
> >
> > I would be able to contribute a Postgres reader/writer as a start.
> > I could make this a separate crate, but to drive adoption I would prefer
> > this living in Arrow, also it can remain updated (sometimes we reorganise
> > modules and end up breaking dependencies).
> >
> > Also, being developed next to DataFusion could allow DF to support SQL
> > databases, as this would be yet another datasource.
> >
> > Some questions:
> > * Should such library support async, sync or both IO methods?
> > * Other than postgres, what other databases would be interesting? Here
> I'm
> > hoping that once we've established a suitable API, it could be easier to
> > natively support more database types.
> >
> > Potential concerns:
> >
> > * Sparse database support
> > It's a lot of effort to write database connectors, especially if starting
> > from scratch (unlike with say JDBC). What if we end up supporting 1 or 2
> > database servers?
> > Perhaps in that case we could keep the module without publishing it to
> > crates.io until we're happy with database support, or even its usage.
> >
> > * Dependency bloat
> > We could feature-gate database types to reduce the number of dependencies
> > if one only wants certain DB connectors
> >
> > * Why not use Java's JDBC adapter?
> > I already do this, but sometimes if working on a Rust project, creating a
> > separate JVM service solely to extract Arrow data is a lot of effort.
> > I also don't think it's currently possible to use the adapter to save
> Arrow
> > data in a database.
> >
> > * What about Flight SQL extensions?
> > There have been discussions around creating Flight SQL extensions, and
> the
> > Rust SQL adapter could implement that and co-exist well.
> > From a crate dependency, *arrow-flight* depends on *arrow*, so it could
> > also depend on this *arrow-sql* crate.
> >
> > Please let me know what you think
> >
> > Regards
> > Neville
> >
>

Re: [Rust] Arrow SQL Adapters/Connectors

Reply via email to