Hi all, Following up on this.
We have updated the output schema doc [1] and updated invariant doc [2] for the final round of review. In the updated invariant doc, the main change we introduced compared to the previous version is as follows: We now enforce strict schema equality in all plan optimization invariants. As a result, optimizations like reordering join sides need to add an extra projection to maintain the schema field order. We believe the extra projection should have minimal overhead. The upside is it will help keep the field order semantic simple and easy for end users to understand. In the draft PR [3], Andy raised a concern that by referring to physical columns using indices instead of names, it might limit our ability to support schemaless data sources in the future. After thinking more on this, I think the current design can be extended to support schemaless data sources in the future by going one of the following two routes: * Make the index field in physical columns optional. During physical plan execution, we could fallback to the name field for schemaless data sources while keep using indices for data sources that have static schemas. * Introduce a new type of physical column expression to refer columns in schemaless data sources I intentionally left out discussion of schemaless data sources in the updated invariant doc to keep the scope manageable for smaller incremental deliverables and ease of review. My main goal here is to make sure whatever design change we propose for multi-relations support won't prevent us from supporting schemaless use-cases in the future. If you have any feedback or concern with the current design, now is a good time to raise them :) I am aiming to get the implementation PR out of draft mode in a week or so. [1]: https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/ [2]: https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/ [3]: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-829296665 Thanks, QP Hou On Wed, May 5, 2021 at 3:52 AM Andrew Lamb <al...@influxdata.com> wrote: > > I wanted to bring some additional attention to some discussion occurring on > a PR [1], specifically the proposal of how to construct output field names > from queries that have multiple relations (that may have the same input > field). > > The documents are: > * Document for output schema field name semantics with examples: [2] > * Proposed change to @jorgecarleitao 's invariant doc [3] > * Updated invariant doc with proposed changes applied [4] > > Please comment on the PR / in the docs if you are interested. > > Andrew > > [1] > https://github.com/apache/arrow-datafusion/pull/55#issuecomment-831405269 > [2] > https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit?usp=sharing > [3] > https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit?usp=sharing > [4] > https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit?usp=sharing