Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/catalog/View.java
File fe/src/main/java/org/apache/impala/catalog/View.java:

PS3, Line 118: tableStats_ = new TTableStats(-1);
             :       tableStats_.setTotal_file_bytes(-1);
> I believe you don't need it now that you put the initialization in Table's 
Correct. Done.


http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

PS3, Line 617: computeStats(Analyzer analyzer) {
> Significantly better. Thanks
Thanks for the suggestion!


PS3, Line 655: statsNumRows_, totalBytes_
> There are class members, why are you passing them as parameter? Is this fun
I find params easier to reason about than member vars. Removed the params.


PS3, Line 673: are ignored
> "because...". You may want to briefly mention why this is the case.
Done


PS3, Line 683: extrapolatedNumRows_ = tbl_.getExtrapolatedNumRows(totalBytes);
> This doesn't necessarily belong in this function. I'd move it to L654 since
Done. I think this means I need to move the comment around extrapolation to 
computeStats(). Did that.


http://gerrit.cloudera.org:8080/#/c/6840/3/tests/metadata/test_explain.py
File tests/metadata/test_explain.py:

PS3, Line 127: 50
> Hm, I would expect this to be '100' since the query doesn't filter any part
This specific test case may seem clear cut, but I don't think the general issue 
is that simple.

I'm not sure it is a good idea to special case the "all partitions selected" 
case because I think it makes the behavior harder to reason about for users and 
for us. Think about the following scenarios:

1. Select all partitions; all partitions have stats
2. Select all partitions; some partitions have stats
3. Select all partitions; no partitions have stats
4. Select some partitions; all selected partitions have stats
5. Select some partitions; some selected partitions have stats
6. Select some partitions; no selected partitions have stats

In the current code there are two cases: Always use the partition stats and 
only fall back to table stats when no partition stats are available. That 
behavior is easy to understand and explain.

The proposed change might break existing workflows around manually setting 
stats. For example, if someone has a workflow that involves setting only the 
per-partition row counts (but not the table row counts), then using the 
table-level stats may break them. You would basically always need to update 
both the partition and table-level stats or else risk inconsistent planning 
behavior depending on how many partitions you are selecting.

We can continue thinking about this change, but if we decide to change anything 
I think it should be done as part of a separate JIRA and documented very 
carefully.


PS3, Line 131:   
> nit: extra space
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to