Alex Behm has posted comments on this change. Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables. ......................................................................
Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6840/1/testdata/workloads/functional-query/queries/QueryTest/compute-stats.test File testdata/workloads/functional-query/queries/QueryTest/compute-stats.test: Line 20: '2009','1',310,305,1,'24.56KB','NOT CACHED','NOT CACHED','TEXT','false',regex:.* > what's the reason for the small deviations here, rounding? people might thi Sorry, just saw this comment. The reason for this deviation is that we use the table-level #rows and #bytes to estimate the number of rows for any given number of bytes. So for a strict subset of partitions that math adds inaccuracy. I thought this was a design choice we were willing to live with? We could potentially avoid this by using the stored stats if we know they are accurate. So the problem becomes detecting changes in a partition. That could be done by storing a hash over the file names+sizes for each partition at the time of computing the stats. That way we can compare the current hash and the stored hash to detect underlying changes in the files. It can be done but adds complexity at various places in the code. Should I proceed with this route? Other ideas/suggestions? -- To view, visit http://gerrit.cloudera.org:8080/6840 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-HasComments: Yes