[Impala-ASF-CR] IMPALA-2373: Extrapolate row counts for HDFS tables.

Alex Behm (Code Review) Mon, 22 May 2017 23:37:06 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.
......................................................................

Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/catalog/View.java
File fe/src/main/java/org/apache/impala/catalog/View.java:

PS3, Line 118: tableStats_ = new TTableStats(-1);
: tableStats_.setTotal_file_bytes(-1);
> I believe you don't need it now that you put the initialization in Table's
Correct. Done.

http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

PS3, Line 617: computeStats(Analyzer analyzer) {
> Significantly better. Thanks
Thanks for the suggestion!

PS3, Line 655: statsNumRows_, totalBytes_
> There are class members, why are you passing them as parameter? Is this fun
I find params easier to reason about than member vars. Removed the params.

PS3, Line 673: are ignored
> "because...". You may want to briefly mention why this is the case.
Done

PS3, Line 683: extrapolatedNumRows_ = tbl_.getExtrapolatedNumRows(totalBytes);
> This doesn't necessarily belong in this function. I'd move it to L654 since
Done. I think this means I need to move the comment around extrapolation to
computeStats(). Did that.

http://gerrit.cloudera.org:8080/#/c/6840/3/tests/metadata/test_explain.py
File tests/metadata/test_explain.py:

PS3, Line 127: 50
> Hm, I would expect this to be '100' since the query doesn't filter any part
This specific test case may seem clear cut, but I don't think the general issue
is that simple.

I'm not sure it is a good idea to special case the "all partitions selected"
case because I think it makes the behavior harder to reason about for users and
for us. Think about the following scenarios:

1. Select all partitions; all partitions have stats
2. Select all partitions; some partitions have stats
3. Select all partitions; no partitions have stats
4. Select some partitions; all selected partitions have stats
5. Select some partitions; some selected partitions have stats
6. Select some partitions; no selected partitions have stats

In the current code there are two cases: Always use the partition stats and
only fall back to table stats when no partition stats are available. That
behavior is easy to understand and explain.

The proposed change might break existing workflows around manually setting
stats. For example, if someone has a workflow that involves setting only the
per-partition row counts (but not the table row counts), then using the
table-level stats may break them. You would basically always need to update
both the partition and table-level stats or else risk inconsistent planning
behavior depending on how many partitions you are selecting.

We can continue thinking about this change, but if we decide to change anything
I think it should be done as part of a separate JIRA and documented very
carefully.

PS3, Line 131:
> nit: extra space
Done

--
To view, visit http://gerrit.cloudera.org:8080/6840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2373: Extrapolate row counts for HDFS tables.

Reply via email to