Re: Delta Lake support for DataFusion

Jorge Cardoso Leitão Thu, 10 Jun 2021 00:01:47 -0700

Hi,

I agree with all of you. ^_^


I created https://github.com/apache/arrow-datafusion/issues/533 to track
this. I tried to encapsulate the three main use-cases for the SQL
extension. Feel free to edit at will.

Best,
Jorge




On Thu, Jun 10, 2021 at 8:37 AM QP Hou <[email protected]> wrote:

> Thanks Daniël for starting the discussion!
>
> Looks like we are on the same page to take this as an opportunity to
> make datafusion more extensible :)
>
> I think Neville and Daniël nailed the biggest missing piece at the
> moment: being able to extend SQL parser and planner with new syntaxes
> and map them to custom plan/expression nodes.
>
> Another thing that I think we should do is to come up with a way to
> better surface these datafusion extensions to help with discoveries.
> For example, pandas has a dedicated section [1] in their official doc
> for this. Perhaps we could start with adding a list of extensions in
> the readme.
>
> After thinking more on this, I feel like it's better to keep the
> extension within delta-rs for now. In the future, delta-rs will likely
> need to depend on ballista for processing delta table metadata using
> distributed compute. So if we move the extension code into
> arrow-datafusion, it might result in circular dependency. I don't see
> a lot of benefits in creating a dedicated datafusion-delta-rs repo at
> the moment. But I am happy to go that route if there are compelling
> reasons. My main goal is just to make sure we have a single officially
> maintained datafusion extension for delta lake.
>
> [1]: https://pandas.pydata.org/docs/ecosystem.html#io
>
> Thanks,
> QP Hou
>
> On Wed, Jun 9, 2021 at 11:30 AM Daniël Heres <[email protected]>
> wrote:
> >
> > Thanks all for the valuable input!
> >
> > I agree following the plugin / model makes a lot of sense for now (either
> > in arrow-datafusion repo or somewhere external, for example in delta-rs
> if
> > we're OK it not being part of Apache right now).
> >
> > In order to support certain Delta Lake features including SQL syntax we
> > probably need to do make DataFusion a bit more extensible besides what is
> > currently possible with the TableProvider, for example:
> >
> > * Allow registering a custom data format (for supporting things like
> *create
> > external table t stored as parquet*)
> > * Allow parsing and/or handling custom SQL syntax like *optimize*  /
> > *vacuum* / *select * from t version as of n* , etc.
> >
> > And probably some more I don't think of currently. I think this is useful
> > work as it also would enable other "extensions" to work in a similar way
> > (e.g. Apache Iceberg and other formats / readers / writers / syntax) and
> > make DataFusion a more flexible engine.
> >
> > Best, Daniël
> >
> > Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <[email protected]
> >:
> >
> > > The correct approach might be to improve DataFusion support in
> > > delta-rs. TableProvider is already implemented here:
> > >
> https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
> > >
> > > I've pinged QP to ask for their advice.
> > >
> > > Neville
> > >
> > > On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <[email protected]> wrote:
> > >
> > > > I think the idea of DataFusion + DeltaLake is quite compelling and
> likely
> > > > useful.
> > > >
> > > > However, I think DataFusion is ideally an  "embeddable query engine"
> > > rather
> > > > than a database system in itself, so in that mental model Delta Lake
> > > > integration belongs somewhere other than the core DataFusion crate.
> > > >
> > > > My ideal structure would be a new crate (maybe not even part of the
> > > Apache
> > > > Arrow Project), perhaps called `datafusion-delta-rs`, that contained
> the
> > > > TableProvider and whatever else was needed to integrate DataFusion
> with
> > > > DeltaLake
> > > >
> > > > This structure could also start a pattern of publishing plugins for
> > > > DataFusion separately from the core.
> > > >
> > > > Andrew
> > > > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0,
> 4.2.0,
> > > > etc), I think delta-rs[1] and datafusion both only specify `4.x` so
> they
> > > > should work together nicely
> > > >
> > > > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> > > >
> > > > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <[email protected]>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to receive some feedback about adding Delta Lake
> support
> > > to
> > > > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525
> ).
> > > > > As you might know, Delta Lake <https://delta.io/> is a format
> adding
> > > > > features like ACID transactions, statistics, and storage
> optimization
> > > to
> > > > > Parquet and is getting quite some traction for managing data lakes.
> > > > > It seems a great feature to have in DataFusion as well.
> > > > >
> > > > > The delta-rs <https://github.com/delta-io/delta-rs> project
> provides a
> > > > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > > > supporting a large part of the format and operations.
> > > > >
> > > > > The first integration I would like to propose is adding read
> support
> > > via
> > > > a
> > > > > new TableProvider. There might be some work to do around
> dependencies
> > > as
> > > > > both DataFusion and delta-rs rely on (certain versions of) Arrow
> and
> > > > > Parquet.
> > > > >
> > > > > Let me know if you have any further ideas or concerns.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Daniël Heres
> > > > >
> > > >
> > >
> >
> >
> > --
> > Daniël Heres
>

Re: Delta Lake support for DataFusion

Reply via email to