[ https://issues.apache.org/jira/browse/IMPALA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844229#comment-16844229 ]
Todd Lipcon commented on IMPALA-8458: ------------------------------------- I paged this code back into my head and remember why we had the weird workaround. The hack there was to deal with our odd handling of boolean stats. The LocalCatalog flow is: - catalogd fetches stats from Hive, and converts them to our own internal ColumnStats object via ColumnStats.update:{code} {code} BooleanColumnStatsData boolStats = statsData.getBooleanStats(); numNulls_ = boolStats.getNumNulls(); numDistinctValues_ = (numNulls_ > 0) ? 3 : 2; {code} - impalad fetches stats from catalogd in CatalogdMetaProvider. This interface was originally built towards the "fetch directly from HMS" code path, so in this case, the wire protocol consists of the catalogd needing to send back the Hive ColumnStatitisticsObj type. So, we call ColumnStats.createHiveColStatsData() to convert the bool stats back to the Hive type: {code} case BOOLEAN: colStatsData.setBooleanStats(new BooleanColumnStatsData(1, -1, numNulls)); break; {code} When this hive object gets to the Impalad, it gets converted _back_ to Impala's ColumnStats type with the first code snippet above. This Hive->Impala->Hive->Impala conversion round tripping is somewhat lossy, particularly for bools since Hive stores a numFalse/numTrue whereas we want to have an NDV. I think we also end up with "lossiness" in the case that we didn't find compatible stats in the HMS, since we don't really have a clear distinction from "we have stats with unknown NDV" vs "we dont' have stats at all". I'll see if I can clean this up. Perhaps the easiest route is to have the wire-protocol for fetch-from-catalogd just use the impala-internal stats object. > Can't set numNull/maxSize/avgSize column stats with local catalog without > also setting NDV > ------------------------------------------------------------------------------------------ > > Key: IMPALA-8458 > URL: https://issues.apache.org/jira/browse/IMPALA-8458 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Affects Versions: Impala 3.3.0 > Reporter: Tim Armstrong > Assignee: Todd Lipcon > Priority: Critical > > Repro: > {noformat} > [tarmstrong-box2.ca.cloudera.com:21000] default> create table test_stats2(s > string); > +-------------------------+ > | summary | > +-------------------------+ > | Table has been created. | > +-------------------------+ > Fetched 1 row(s) in 0.36s > [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats > test_stats2; > +--------+--------+------------------+--------+----------+----------+ > | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | > +--------+--------+------------------+--------+----------+----------+ > | s | STRING | -1 | -1 | -1 | -1 | > +--------+--------+------------------+--------+----------+----------+ > Fetched 1 row(s) in 0.02s > [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set > column stats s('avgSize'='1234'); > +-----------------------------------------+ > | summary | > +-----------------------------------------+ > | Updated 0 partition(s) and 1 column(s). | > +-----------------------------------------+ > Fetched 1 row(s) in 0.14s > [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats > test_stats2; > +--------+--------+------------------+--------+----------+----------+ > | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | > +--------+--------+------------------+--------+----------+----------+ > | s | STRING | -1 | -1 | -1 | -1 | > +--------+--------+------------------+--------+----------+----------+ > Fetched 1 row(s) in 0.02s > [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set > column stats s('maxSize'='1234'); > +-----------------------------------------+ > | summary | > +-----------------------------------------+ > | Updated 0 partition(s) and 1 column(s). | > +-----------------------------------------+ > Fetched 1 row(s) in 0.10s > [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats > test_stats2; > +--------+--------+------------------+--------+----------+----------+ > | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | > +--------+--------+------------------+--------+----------+----------+ > | s | STRING | -1 | -1 | -1 | -1 | > +--------+--------+------------------+--------+----------+----------+ > Fetched 1 row(s) in 0.02s > [tarmstrong-box2.ca.cloudera.com:21000] default> invalidate metadata > test_stats2; > Fetched 0 row(s) in 0.03s > [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats > test_stats2; > Query: show column stats test_stats2 > +--------+--------+------------------+--------+----------+----------+ > | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | > +--------+--------+------------------+--------+----------+----------+ > | s | STRING | -1 | -1 | -1 | -1 | > +--------+--------+------------------+--------+----------+----------+ > Fetched 1 row(s) in 0.07s > {noformat} > I expected that the updates would take effect. Weirdly it doesn't happen for > NDV and NULLS: > {noformat} > [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set > column stats s('numDVs'='1234','numNulls'='12345'); > Query: alter table test_stats2 set column stats > s('numDVs'='1234','numNulls'='12345') > +-----------------------------------------+ > | summary | > +-----------------------------------------+ > | Updated 0 partition(s) and 1 column(s). | > +-----------------------------------------+ > Fetched 1 row(s) in 0.12s > [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats > test_stats2; > Query: show column stats test_stats2 > +--------+--------+------------------+--------+----------+----------+ > | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | > +--------+--------+------------------+--------+----------+----------+ > | s | STRING | 1234 | 12345 | -1 | -1 | > +--------+--------+------------------+--------+----------+----------+ > Fetched 1 row(s) in 0.02s > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org