Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6840/1/testdata/workloads/functional-query/queries/QueryTest/compute-stats.test
File testdata/workloads/functional-query/queries/QueryTest/compute-stats.test:

Line 20: '2009','1',310,305,1,'24.56KB','NOT CACHED','NOT 
CACHED','TEXT','false',regex:.*
> what's the reason for the small deviations here, rounding? people might thi
Sorry, just saw this comment. The reason for this deviation is that we use the 
table-level #rows and #bytes to estimate the number of rows for any given 
number of bytes. So for a strict subset of partitions that math adds 
inaccuracy. I thought this was a design choice we were willing to live with?

We could potentially avoid this by using the stored stats if we know they are 
accurate. So the problem becomes detecting changes in a partition. That could 
be done by storing a hash over the file names+sizes for each partition at the 
time of computing the stats. That way we can compare the current hash and the 
stored hash to detect underlying changes in the files. It can be done but adds 
complexity at various places in the code.

Should I proceed with this route? Other ideas/suggestions?


-- 
To view, visit http://gerrit.cloudera.org:8080/6840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to