Thanks all for the valuable input!

I agree following the plugin / model makes a lot of sense for now (either
in arrow-datafusion repo or somewhere external, for example in delta-rs if
we're OK it not being part of Apache right now).

In order to support certain Delta Lake features including SQL syntax we
probably need to do make DataFusion a bit more extensible besides what is
currently possible with the TableProvider, for example:

* Allow registering a custom data format (for supporting things like *create
external table t stored as parquet*)
* Allow parsing and/or handling custom SQL syntax like *optimize*  /
*vacuum* / *select * from t version as of n* , etc.

And probably some more I don't think of currently. I think this is useful
work as it also would enable other "extensions" to work in a similar way
(e.g. Apache Iceberg and other formats / readers / writers / syntax) and
make DataFusion a more flexible engine.

Best, Daniël

Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <nevilled...@gmail.com>:

> The correct approach might be to improve DataFusion support in
> delta-rs. TableProvider is already implemented here:
> https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
>
> I've pinged QP to ask for their advice.
>
> Neville
>
> On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote:
>
> > I think the idea of DataFusion + DeltaLake is quite compelling and likely
> > useful.
> >
> > However, I think DataFusion is ideally an  "embeddable query engine"
> rather
> > than a database system in itself, so in that mental model Delta Lake
> > integration belongs somewhere other than the core DataFusion crate.
> >
> > My ideal structure would be a new crate (maybe not even part of the
> Apache
> > Arrow Project), perhaps called `datafusion-delta-rs`, that contained the
> > TableProvider and whatever else was needed to integrate DataFusion with
> > DeltaLake
> >
> > This structure could also start a pattern of publishing plugins for
> > DataFusion separately from the core.
> >
> > Andrew
> > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, 4.2.0,
> > etc), I think delta-rs[1] and datafusion both only specify `4.x` so they
> > should work together nicely
> >
> > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> >
> > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <danielhe...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I would like to receive some feedback about adding Delta Lake support
> to
> > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
> > > As you might know, Delta Lake <https://delta.io/> is a format adding
> > > features like ACID transactions, statistics, and storage optimization
> to
> > > Parquet and is getting quite some traction for managing data lakes.
> > > It seems a great feature to have in DataFusion as well.
> > >
> > > The delta-rs <https://github.com/delta-io/delta-rs> project provides a
> > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > supporting a large part of the format and operations.
> > >
> > > The first integration I would like to propose is adding read support
> via
> > a
> > > new TableProvider. There might be some work to do around dependencies
> as
> > > both DataFusion and delta-rs rely on (certain versions of) Arrow and
> > > Parquet.
> > >
> > > Let me know if you have any further ideas or concerns.
> > >
> > > Best regards,
> > >
> > > Daniël Heres
> > >
> >
>


-- 
Daniël Heres

Reply via email to