[ 
https://issues.apache.org/jira/browse/IMPALA-7608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-7608:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Estimate row count from file size when no stats available
> ---------------------------------------------------------
>
>                 Key: IMPALA-7608
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7608
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Impala makes heavy use of stats, which is a good thing. Stats feed into query 
> planning where they allow the planner to choose among a fixed set of 
> alternatives such as: do I put t1 on the build or probe side of a join?
> Because the planner decisions tend to be discrete, we only need enough 
> information to decide whether to do A or B (or, more generally, to choose 
> among a set of choices A, B, C, ... N).
> Often data sizes are vastly different on different paths. Stats help refine 
> these numbers, but much of the information just needs to be in the ball park: 
> is table t1 larger or smaller than t2? Often, one table is much larger than 
> the other, so even a rough size estimate will force the right decision (put 
> the smaller table on the build side of a join.)
> Today, if Impala has no stats, it refuses to even consider table size. 
> Consider the following unit test:
> {noformat}
>     runTest("SELECT a FROM functional.tinytable;", -1);
> {noformat}
> This plans the given query, then verifies that the expected result 
> cardinality is the number given. In this case, {{tinytable}} has no stats. 
> So, we don't know the cardinality. OK...
> The table turns out to be 3 rows. Perhaps I join this to a hypothetical 
> {{hugetable}} of 1 million rows. Without even a guess at cardinality, Impala 
> can't choose a good plan.
> The suggestion is to use table size to estimate row cardinality. Come up with 
> some assumed row width, say 100. Then, estimate row count as {{file size / 
> est. row width}}. This gives a ballpark number that would be plenty good for 
> the planner to choose the proper plan much of the time. 
> Since this is such an easy estimate to make, and will address the occasional 
> case in which stats are not available, it seems a shame to not take advantage 
> of this information.
> In terms of implementation, {{HdfsScanNode.computeCardinalities()}} already 
> uses some extrapolation, if enabled. It can be extended to do the last-ditch 
> extrapolation suggested above if, after the current techniques, the 
> cardinality is still undefined.
> If we apply this simple fix in a prototype build, the new test result is 
> closer to reality:
> {noformat}
>     runTest("SELECT a FROM functional.tinytable;", 1);
> {noformat}
> Given that the fix is so simple, any reason not to use the file size, when 
> available? Is 100 a reasonable assumed row width? Should this functionality 
> always be on, not just when enabled using the back-end config?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to