[ https://issues.apache.org/jira/browse/IMPALA-7608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned IMPALA-7608: ----------------------------------- Assignee: (was: Paul Rogers) > Estimate row count from file size when no stats available > --------------------------------------------------------- > > Key: IMPALA-7608 > URL: https://issues.apache.org/jira/browse/IMPALA-7608 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 3.0 > Reporter: Paul Rogers > Priority: Major > > Impala makes heavy use of stats, which is a good thing. Stats feed into query > planning where they allow the planner to choose among a fixed set of > alternatives such as: do I put t1 on the build or probe side of a join? > Because the planner decisions tend to be discrete, we only need enough > information to decide whether to do A or B (or, more generally, to choose > among a set of choices A, B, C, ... N). > Often data sizes are vastly different on different paths. Stats help refine > these numbers, but much of the information just needs to be in the ball park: > is table t1 larger or smaller than t2? Often, one table is much larger than > the other, so even a rough size estimate will force the right decision (put > the smaller table on the build side of a join.) > Today, if Impala has no stats, it refuses to even consider table size. > Consider the following unit test: > {noformat} > runTest("SELECT a FROM functional.tinytable;", -1); > {noformat} > This plans the given query, then verifies that the expected result > cardinality is the number given. In this case, {{tinytable}} has no stats. > So, we don't know the cardinality. OK... > The table turns out to be 3 rows. Perhaps I join this to a hypothetical > {{hugetable}} of 1 million rows. Without even a guess at cardinality, Impala > can't choose a good plan. > The suggestion is to use table size to estimate row cardinality. Come up with > some assumed row width, say 100. Then, estimate row count as {{file size / > est. row width}}. This gives a ballpark number that would be plenty good for > the planner to choose the proper plan much of the time. > Since this is such an easy estimate to make, and will address the occasional > case in which stats are not available, it seems a shame to not take advantage > of this information. > In terms of implementation, {{HdfsScanNode.computeCardinalities()}} already > uses some extrapolation, if enabled. It can be extended to do the last-ditch > extrapolation suggested above if, after the current techniques, the > cardinality is still undefined. > If we apply this simple fix in a prototype build, the new test result is > closer to reality: > {noformat} > runTest("SELECT a FROM functional.tinytable;", 1); > {noformat} > Given that the fix is so simple, any reason not to use the file size, when > available? Is 100 a reasonable assumed row width? Should this functionality > always be on, not just when enabled using the back-end config? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org