Alex Behm has posted comments on this change. Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables. ......................................................................
Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/catalog/View.java File fe/src/main/java/org/apache/impala/catalog/View.java: PS3, Line 118: tableStats_ = new TTableStats(-1); : tableStats_.setTotal_file_bytes(-1); > I believe you don't need it now that you put the initialization in Table's Correct. Done. http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: PS3, Line 617: computeStats(Analyzer analyzer) { > Significantly better. Thanks Thanks for the suggestion! PS3, Line 655: statsNumRows_, totalBytes_ > There are class members, why are you passing them as parameter? Is this fun I find params easier to reason about than member vars. Removed the params. PS3, Line 673: are ignored > "because...". You may want to briefly mention why this is the case. Done PS3, Line 683: extrapolatedNumRows_ = tbl_.getExtrapolatedNumRows(totalBytes); > This doesn't necessarily belong in this function. I'd move it to L654 since Done. I think this means I need to move the comment around extrapolation to computeStats(). Did that. http://gerrit.cloudera.org:8080/#/c/6840/3/tests/metadata/test_explain.py File tests/metadata/test_explain.py: PS3, Line 127: 50 > Hm, I would expect this to be '100' since the query doesn't filter any part This specific test case may seem clear cut, but I don't think the general issue is that simple. I'm not sure it is a good idea to special case the "all partitions selected" case because I think it makes the behavior harder to reason about for users and for us. Think about the following scenarios: 1. Select all partitions; all partitions have stats 2. Select all partitions; some partitions have stats 3. Select all partitions; no partitions have stats 4. Select some partitions; all selected partitions have stats 5. Select some partitions; some selected partitions have stats 6. Select some partitions; no selected partitions have stats In the current code there are two cases: Always use the partition stats and only fall back to table stats when no partition stats are available. That behavior is easy to understand and explain. The proposed change might break existing workflows around manually setting stats. For example, if someone has a workflow that involves setting only the per-partition row counts (but not the table row counts), then using the table-level stats may break them. You would basically always need to update both the partition and table-level stats or else risk inconsistent planning behavior depending on how many partitions you are selecting. We can continue thinking about this change, but if we decide to change anything I think it should be done as part of a separate JIRA and documented very carefully. PS3, Line 131: > nit: extra space Done -- To view, visit http://gerrit.cloudera.org:8080/6840 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-HasComments: Yes