[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

juliuszsompolski Thu, 15 Feb 2018 18:07:36 -0800

GitHub user juliuszsompolski opened a pull request:

    https://github.com/apache/spark/pull/20624


    [SPARK-23445] ColumnStat refactoring

    ## What changes were proposed in this pull request?
    
    Refactor ColumnStat to be more flexible.
    
    * Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` 
is split from `Statistics`. This detaches how the statistics are stored from 
how they are processed in the query plan. `CatalogColumnStat` keeps `min` and 
`max` as `String`, making it not depend on dataType information.
    * For `CatalogColumnStat`, parse column names from property names in the 
metastore (`KEY_VERSION` property), not from metastore schema. This means that 
`CatalogColumnStat`s can be created for columns even if the schema itself is 
not stored in the metastore.
    * Make all fields optional. `min`, `max` and `histogram` for columns were 
optional already. Having them all optional is more consistent, and gives 
flexibility to e.g. drop some of the fields through transformations if they are 
difficult / impossible to calculate.
    
    The added flexibility will make it possible to have alternative 
implementations for stats, and separates stats collection from stats and 
estimation processing in plans.
    
    ## How was this patch tested?
    
    Refactored existing tests to work with refactored `ColumnStat` and 
`CatalogColumnStat`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/juliuszsompolski/apache-spark SPARK-23445

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20624
    
----
commit cf3602075dcee35494c72975e361b739939079b4
Author: Juliusz Sompolski <julek@...>
Date:   2018-01-19T13:57:46Z

    column stat refactoring

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

Reply via email to