Re: [DataFusion] [Discuss] Output Schema for queries with multiple relations

QP Hou Wed, 19 May 2021 00:09:23 -0700

Hi all,

Following up on this.

We have updated the output schema doc [1] and updated invariant doc
[2] for the final round of review.

In the updated invariant doc, the main change we introduced compared
to the previous version is as follows:

We now enforce strict schema equality in all plan optimization
invariants. As a result, optimizations like reordering join sides need
to add an extra projection to maintain the schema field order. We
believe the extra projection should have minimal overhead. The upside
is it will help keep the field order semantic simple and easy for end
users to understand.

In the draft PR [3], Andy raised a concern that by referring to
physical columns using indices instead of names, it might limit our
ability to support schemaless data sources in the future. After
thinking more on this, I think the current design can be extended to
support schemaless data sources in the future by going one of the
following two routes:

* Make the index field in physical columns optional. During physical
plan execution, we could fallback to the name field for schemaless
data sources while keep using indices for data sources that have
static schemas.
* Introduce a new type of physical column expression to refer columns
in schemaless data sources

I intentionally left out discussion of schemaless data sources in the
updated invariant doc to keep the scope manageable for smaller
incremental deliverables and ease of review. My main goal here is to
make sure whatever design change we propose for multi-relations
support won't prevent us from supporting schemaless use-cases in the
future.

If you have any feedback or concern with the current design, now is a
good time to raise them :)

I am aiming to get the implementation PR out of draft mode in a week or so.

[1]: 
https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/
[2]: 
https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/
[3]: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-829296665

Thanks,
QP Hou

On Wed, May 5, 2021 at 3:52 AM Andrew Lamb <al...@influxdata.com> wrote:
>
> I wanted to bring some additional attention to some discussion occurring on
> a PR [1], specifically the proposal of how to construct output field names
> from queries that have multiple relations (that may have the same input
> field).
>
> The documents are:
> * Document for output schema field name semantics with examples: [2]
> * Proposed change to @jorgecarleitao 's invariant doc [3]
> * Updated invariant doc with proposed changes applied [4]
>
> Please comment on the PR / in the docs if you are interested.
>
> Andrew
>
> [1]
> https://github.com/apache/arrow-datafusion/pull/55#issuecomment-831405269
> [2]
> https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit?usp=sharing
> [3]
> https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit?usp=sharing
> [4]
> https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit?usp=sharing

Re: [DataFusion] [Discuss] Output Schema for queries with multiple relations

Reply via email to