jorgecarleitao commented on pull request #55: URL: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-828928481
@houqp , let me clarify the context of the invariants: back in August 2020, I started looking into improving the schema invariants, as DataFusion was not preserving any. Before committing to any work, I outlined the invariants that I though were relevant, shared it around, and worked to enforce them. Please by no means consider it to be like something that "the community" agreed; it was more like a guideline for me, which at the time some people read it through. I felt the need to have some design choices to guide my development at the time to arrive at a consistent state. Wrt to your question, I think that you are right: taking the assumption `column names are unique`, it follows that the order of columns in a schema does not need to be preserved because it is expected that consumers access columns by name, not by index. So, I think that the `==` in `schema(plan) == schema(opt(plan))` should represent the same metadata, set (in a mathematical sense) of fields. We may want to define this equality somewhere in DataFusion so that we can use it whenever we want to assert the invariant (since the compiler can't do it for us). I also agree that we could also preserve column order: in the case of the join, we would need to make the side probe optimization to be a physical, not logical, optimization (and write the batches with the correct order at the end), and in `*` we would define a rule for column order. I do not have an opinion in either direction; I just think that it is useful to be explicit about these so that our users can have a mental model on how column names behave and how they access them (can I rely on stable column names?, can I rely on stable indexes?, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
