[GitHub] spark issue #20387: [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immut...

cloud-fan Mon, 29 Jan 2018 11:16:01 -0800

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20387
  
    I dig into the commit history and recalled why I made these decisions:
    
    * having an mutable `DataSourceV2Relation`. This is mostly to avoid to keep 
adding more constructor parameters to `DataSourceV2Relation`, make the code 
easy to maintain. I'm ok to make it immutable if there is an significant 
benefit.
    * not using `PhysicalOperation`. This is because we will add more push down 
optimizations(e.g. limit, aggregate, join), and we have a specify push down 
order for them. It's hard to improve `PhysicalOperation` to support more 
operators and specific push down orders, so I created the new one. Eventually 
all data sources will be implemented as data source v2, so `PhysicalOperation` 
will go away.
    
    
    > The output of DataSourceV2Relation should be what is returned by the 
reader, in case the reader can only partially satisfy the requested schema 
projection
    
    Good catch! Since `DataSourceV2Reader` is mutable, the output can't be 
fixed, as it may change when we apply data source optimizations. Using `lazy 
val output ...` can fix this.
    
    
    > The requested projection passed to the DataSourceV2Reader should include 
filter columns
    
    I did this intentionally. If a column is only refered by pushed filters, 
Spark doesn't need this column. Even if we require this column from the data 
source, we just read it out and wait it to be pruned by the next operator.
    
    
    > The push-down rule may be run more than once if filters are not pushed 
through projections
    
    This looks weird, do you have a query to reproduce this issue?
    
    
    > This updates DataFrameReader to parse locations that do not look like 
paths as table names and pass the result as "database" and "table" keys in v2 
options.
    
    Personally I'd suggest to use `spark.read.format("iceberg").option("table", 
"db.table").load()`, as `load` is defined as `def load(paths: String*)`, but I 
think your usage looks better. The communition protocol between Spark and data 
source is options, I'd suggest that we just propogate the `paths` parameter to 
options, and data source implementations are free to interprete the path option 
to whatever they want, e.g. table and database names.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immut...

Reply via email to