[Impala-ASF-CR] IMPALA-2373: Extrapolate row counts for HDFS tables.

Alex Behm (Code Review) Mon, 15 May 2017 22:46:54 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.
......................................................................

Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6840/1/testdata/workloads/functional-query/queries/QueryTest/compute-stats.test
File testdata/workloads/functional-query/queries/QueryTest/compute-stats.test:

Line 20: '2009','1',310,305,1,'24.56KB','NOT CACHED','NOT
CACHED','TEXT','false',regex:.*
> what's the reason for the small deviations here, rounding? people might thi
Sorry, just saw this comment. The reason for this deviation is that we use the
table-level #rows and #bytes to estimate the number of rows for any given
number of bytes. So for a strict subset of partitions that math adds
inaccuracy. I thought this was a design choice we were willing to live with?

We could potentially avoid this by using the stored stats if we know they are
accurate. So the problem becomes detecting changes in a partition. That could
be done by storing a hash over the file names+sizes for each partition at the
time of computing the stats. That way we can compare the current hash and the
stored hash to detect underlying changes in the files. It can be done but adds
complexity at various places in the code.

Should I proceed with this route? Other ideas/suggestions?

--
To view, visit http://gerrit.cloudera.org:8080/6840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2373: Extrapolate row counts for HDFS tables.

Reply via email to