jorgecarleitao commented on pull request #55:
URL: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-828928481


   @houqp , let me clarify the context of the invariants: back in August 2020, 
I started looking into improving the schema invariants, as DataFusion was not 
preserving any. Before committing to any work, I outlined the invariants that I 
though were relevant, shared it around, and worked to enforce them. Please by 
no means consider it to be like something that "the community" agreed; it was 
more like a guideline for me, which at the time some people read it through. I 
felt the need to have some design choices to guide my development at the time 
to arrive at a consistent state.
   
   Wrt to your question, I think that you are right: taking the assumption 
`column names are unique`, it follows that the order of columns in a schema 
does not need to be preserved because it is expected that consumers access 
columns by name, not by index. So, I think that the `==` in `schema(plan) == 
schema(opt(plan))` should represent the same metadata, set (in a mathematical 
sense) of fields. We may want to define this equality somewhere in DataFusion 
so that we can use it whenever we want to assert the invariant (since the 
compiler can't do it for us).
   
   I also agree that we could also preserve column order: in the case of the 
join, we would need to make the side probe optimization to be a physical, not 
logical, optimization (and write the batches with the correct order at the 
end), and in `*` we would define a rule for column order.
   
   I do not have an opinion in either direction; I just think that it is useful 
to be explicit about these so that our users can have a mental model on how 
column names behave and how they access them (can I rely on stable column 
names?, can I rely on stable indexes?, etc.).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to