Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/9548#issuecomment-155835390
  
    @cloud-fan Before discussing the solution details, let us first talk about 
the design issues.  
    
    IMO, the `DataFrame` is a query language, kind of a dialect of SQL. Or, 
maybe, SQL is a dialect of `DataFrame`. We need to formalize it and clearly 
define the concepts of each major classes like `DataFrame` and `Column`. If 
`Column` represents a concept independent of `DataFrame`, could you define what 
it is? If one `Column` with the same ID can appear in different `DataFrame`, 
how to enforce such a "referential integrity" between different `DataFrame`? If 
two `Column` with different ID could represent the same entity, should we keep 
such a relation for generating a better physical plan? 
    
    In the current implementation, each `Column` actually corresponds to an 
expression in logical plans, but we are unable to apply an expression above 
`Column` instances to generate a new expression. So far, `Column` is kind of a 
wrapper, but it is not a subclass of `TreeNode`.   
    
    When more components are built on `DataFrame` to access and operate, we 
have to carefully think about this problem. If possible, I think we need to 
resolve it in the release of Spark 2.0.  
    
    Will answer your design suggestion in a separate post.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to