[GitHub] spark pull request #16274: [SPARK-18853][SQL] Project (UnaryNode) is way too...

rxin Tue, 13 Dec 2016 17:37:27 -0800

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/16274


    [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating 
statistics

    ## What changes were proposed in this pull request?
    This patch reduces the default number element estimation for arrays and 
maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an 
array of map), 100 * 100 would be used as the default size. This sounds like 
just an overestimation which doesn't seem that bad (since it is usually better 
to overestimate than underestimate). However, due to the way we assume the size 
output for Project (new estimated column size / old estimated column size), 
this overestimation can become underestimation. It is actually in general in 
this case safer to assume 1 default element.
    
    ## How was this patch tested?
    This should be covered by existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-18853

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16274
    
----
commit 4d33dd8211fc7279cdb2a90a40ce237838f27e25
Author: Reynold Xin <r...@databricks.com>
Date:   2016-12-14T01:33:45Z

    [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating 
statistics

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16274: [SPARK-18853][SQL] Project (UnaryNode) is way too...

Reply via email to