Re: Delta Lake support for DataFusion

Jorge Cardoso Leitão Wed, 09 Jun 2021 07:46:31 -0700

Hi,

Some questions that come to mind:


1. If we add vendor X to datafusion, will we be open to other vendor Y? How
do we compare vendors? How do we draw the line of "not sufficiently
relevant"?
2. How do we ensure that we do not distort the same level playing field
that some people expect from DataFusion?
3. What is the challenge of creating a binary that uses DataFusion +
Delta-lake custom table provider outside of DataFusion?

I see DataFusion's plugin system,

* custom nodes
* custom table providers
* custom physical optimizers
* custom logical optimizers
* UDFs
* UDAFs

as our answer to not bundle vendor-specific implementations (e.g. s3,
azure, Oracle, IOx, IBM, google, delta lake), and instead allow users to
build applications on top of, with whatever vendor-specific requirements
they have. Rust lends itself really well to this, as dependencies are
maintained in Cargo.toml, and binaries compiled with DataFusion can be
built with plugins and deployed in prod environments as a single binary.

AFAIK delta-lake itself is not bundled with spark, and is instead installed
separately (e.g. on the POM for java, pip install delta-spark for Python)
[1]. I think that this is a sustainable model whereby we do not have to
know about delta-lake specifics to be able to maintain the code, and
instead declare contracts for extensions, which others maintain for their
specific formats/systems.

Best,
Jorge

[1] https://docs.delta.io/1.0.0/quick-start.html

Re: Delta Lake support for DataFusion

Reply via email to