[GitHub] [arrow-datafusion] houqp commented on pull request #55: Support qualified columns in queries

GitBox Mon, 26 Apr 2021 00:10:57 -0700


houqp commented on pull request #55:
URL: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-826380176



   @jorgecarleitao your understanding on column changes in logical and physical 
planes are correct. I would add that in physical plane, the string column name 
is probably not needed. We are currently only using the index field for 
evaluation. I kept it mostly for debugging purpose. But given the column name 
info is also available in physical schema fields, I think it should be safe to 
only store index in physical column expressions.
   
   The answer to your schema change question is a little bit tricky, let me try 
to clarify the new behavior. In short, it changes the field names in logical 
plan schemas because we require all columns to be normalized when building the 
plan. For physical schemas, there should be no change for column names except 
when columns are wrapped with operators.
   
   Use your SQL as an example:
   
   ```sql
   SELECT a FROM t1
   ```
   
   The logical schema field will be normalized to `t1.a`. However, the final 
execution output will have a physical/arrow schema with field `a`. The 
qualifier is stripped during physical planning at: 
https://github.com/houqp/arrow-datafusion/blob/8ecc215bb7fe44d8cf9dcb4b90df753f0c50afb7/datafusion/src/physical_plan/planner.rs#L483-L486
   
   For  DataFrame API, the behavior is the same since both SQL and Dataframe go 
through the same query builder interface:
   
   ```rust
   df = ctx.table("temp")?;
   df.select("a").collect().schema().fields()[0].name()
   ```
   
   The above code will result in `a`. So far this is the same as what 
datafusion does today. The difference comes in when operators are used, for 
example:
   
   ```sql
   SELECT a, MAX(b) FROM t1
   ```
   
   This will result in two unqualified fields `a` and `MAX(t1.b)`.
   
   Basically I made sure the behavior is consistent with MySQL, Postgresql and 
Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on pull request #55: Support qualified columns in queries

Reply via email to