One more universal approach is to use ODBC, this is a recent Rust conversation (with example) on the topic: https://github.com/Koka/odbc-rs/issues/140
Honestly I find the Python DB API too simple, all it provides is a row-by-row API. I miss four things: - Batched or bulk processing both for data loading and dumping. - Async support (python has asyncio and async web frameworks, but no async DB spec). SQLAlchemy async support is coming soon and there is https://github.com/encode/databases - Connection pooling (it's common to use TLS, connection reuse would be nice as TLS 1.3 is not here yet) - Failover / load balancing support (this is connected to the previous) Best regards, Adam Lippai On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > That would be awesome! I agree with this, and would be really useful, as it > would leverage all the goodies that RDMS have wrt to transitions, etc. > > I would probably go for having database-specifics outside of the arrow > project, so that they can be used by other folks beyond arrow, and keep the > arrow-specifics (i.e. conversion from the format from the specific > databases to arrow) as part of the arrow crate. Ideally as Wes wrote, with > some standard to be easier to handle different DBs. > > I think that there are two layers: one is how to connect to a database, the > other is how to serialize/deserialize. AFAIK PEP 249 covers both layers, as > it standardizes things like `connect` and `tpc_begin`, as well as how > things should be serialized to Python objects (e.g. dates should be > datetime.date). This split is done by postgres for Rust > <https://github.com/sfackler/rust-postgres>, as it offers 5 crates: > * postges-async > * postges-sync (a blocking wrapper of postgres-async) > * postges-types (to convert to native rust <---- IMO this one is what we > want to offer in Arrow) > * postges-TLS > * postges-openssl > > `postges-sync` implements Iterator<Row> (`client.query`), and postges-async > implements Stream<Row>. > > One idea is to have a generic<T> iterator/stream adapter, that yields > RecordBatches. The implementation of this trait by different providers > would give support to be used in Arrow and DataFusion. > > Besides postgres, one idea is to pick the top from this list > <https://db-engines.com/en/ranking>: > > * Oracle > * MySQL > * MsSQL > > Another idea is to start by by supporting SQLite, which is a good > development env to work with relational databases. > > Best, > Jorge > > > > > > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <nevilled...@gmail.com> > wrote: > > > Hi Arrow developers > > > > I would like to gauge the appetite for an Arrow SQL connector that: > > > > * Reads and writes Arrow data to and from SQL databases > > * Reads tables and queries into record batches, and writes batches to > > tables (either append or overwrite) > > * Leverages binary SQL formats where available (e.g. PostgreSQL format is > > relatively easy and well-documented) > > * Provides a batch interface that abstracts away the different database > > semantics, and exposes a RecordBatchReader ( > > > https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html > > ), > > and perhaps a RecordBatchWriter > > * Resides in the Rust repo as either an arrow::sql module (like > arrow::csv, > > arrow::json, arrow::ipc) or alternatively is a separate crate in the > > workspace (*arrow-sql*?) > > > > I would be able to contribute a Postgres reader/writer as a start. > > I could make this a separate crate, but to drive adoption I would prefer > > this living in Arrow, also it can remain updated (sometimes we reorganise > > modules and end up breaking dependencies). > > > > Also, being developed next to DataFusion could allow DF to support SQL > > databases, as this would be yet another datasource. > > > > Some questions: > > * Should such library support async, sync or both IO methods? > > * Other than postgres, what other databases would be interesting? Here > I'm > > hoping that once we've established a suitable API, it could be easier to > > natively support more database types. > > > > Potential concerns: > > > > * Sparse database support > > It's a lot of effort to write database connectors, especially if starting > > from scratch (unlike with say JDBC). What if we end up supporting 1 or 2 > > database servers? > > Perhaps in that case we could keep the module without publishing it to > > crates.io until we're happy with database support, or even its usage. > > > > * Dependency bloat > > We could feature-gate database types to reduce the number of dependencies > > if one only wants certain DB connectors > > > > * Why not use Java's JDBC adapter? > > I already do this, but sometimes if working on a Rust project, creating a > > separate JVM service solely to extract Arrow data is a lot of effort. > > I also don't think it's currently possible to use the adapter to save > Arrow > > data in a database. > > > > * What about Flight SQL extensions? > > There have been discussions around creating Flight SQL extensions, and > the > > Rust SQL adapter could implement that and co-exist well. > > From a crate dependency, *arrow-flight* depends on *arrow*, so it could > > also depend on this *arrow-sql* crate. > > > > Please let me know what you think > > > > Regards > > Neville > > >