[ 
https://issues.apache.org/jira/browse/TRAFODION-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674873#comment-15674873
 ] 

David Wayne Birdsall commented on TRAFODION-2359:
-------------------------------------------------

After debugging the issue, I have come to a different conclusion. The logic in 
question is in HSGlobalsClass::FixSamplingCounts. There is a heuristic in that 
method, that if it comes upon an interval where the computed UEC and rowcount 
are equal, it bypasses the estimators and instead scales up UEC in the same way 
as rowcount. So, the observation about the last interval UEC being set to the 
ratio of table size to sample size was spot on. But the guess about the 
estimators was incorrect; our problem is we didn't use them!

I will guess that this heuristic is there for the case where we are 
constructing histograms on a column with unique values. Using the estimators 
would consistently underestimate UEC (chances are), while using the heuristic 
would give correct results.

We may want to put in a second-order heuristic. If we used the UEC == rowcount 
heuristic on some interval, we should see how often we are doing it, and also 
whether the rowcounts for such intervals are abnormally small. If so, we should 
make a second pass and redo the estimation using the estimators.

> Unusually large UEC in last histogram interval for divisioning column
> ---------------------------------------------------------------------
>
>                 Key: TRAFODION-2359
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2359
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: any
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> On a large table with a divisioning column, it was noticed that occasionally 
> UPDATE STATISTICS using sampling produces an unusually large UEC in the last 
> interval of the histogram. The circumstances under which this happened was 
> that the last divisioning value had just one row and, by luck of the draw, 
> that row happened to be included in the sample.
> This can be reproduced by manually creating such a sample (and using CQD 
> USTAT_SAMPLE_TABLE_NAME to cause UPDATE STATS to use this sample). For 
> example, run the following three scripts:
> Script ddl.sql:
> drop table if exists weird_table;
> CREATE TABLE weird_table
>   (
>     TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
>       DROPPABLE NOT SERIALIZED
>   , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
>   )
>   SALT USING 4 PARTITIONS
>        ON (TIME_STAMP)
>   DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
>      NAMED AS ("_DIVISION_1_"))
>  ATTRIBUTES ALIGNED FORMAT
>   HBASE_OPTIONS
>   (
>     DATA_BLOCK_ENCODING = 'FAST_DIFF',
>     COMPRESSION = 'GZ',
>     MEMSTORE_FLUSH_SIZE = '1073741824'
>   )
> ;
> drop table if exists weird_sample;
> CREATE TABLE weird_sample
>   (
>     TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
>       DROPPABLE NOT SERIALIZED
>   , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
>   )
>   SALT USING 4 PARTITIONS
>        ON (TIME_STAMP)
>   DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
>      NAMED AS ("_DIVISION_1_"))
>  ATTRIBUTES ALIGNED FORMAT
>   HBASE_OPTIONS
>   (
>     DATA_BLOCK_ENCODING = 'FAST_DIFF',
>     COMPRESSION = 'GZ',
>     MEMSTORE_FLUSH_SIZE = '1073741824'
>   )
> ;
> Script populate.sql:
> -- populate the table in such a way that the divisioning column
> -- has a few very frequent values but just one infrequent value
> upsert using load into weird_table
> select timestamp '1970-01-01 00:00:00' + 
> cast(c0+10*c1+100*c2+1000*c3+10000*c4 as interval minute(8)),
>        3*c0 + 7*c4,
>        c4,
>        11*c2 + 2*c3 + 3*c1
> from (values(1)) t
> transpose 0,1,2,3,4,5,6,7,8,9 as c0
> transpose 0,1,2,3,4,5,6,7,8,9 as c1
> transpose 0,1,2,3,4,5,6,7,8,9 as c2
> transpose 0,1,2,3,4,5,6,7,8,9 as c3
> transpose 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 as c4;
> insert into weird_table select max(TIME_STAMP) + interval '1' day, 999, 999, 
> 999 from weird_table;
> upsert using load into weird_sample select * from weird_table sample random 
> 10 percent;
> upsert into weird_sample select * from weird_table where time_stamp = (select 
> max(time_stamp) from weird_table);
> Script repro.sql:
> cqd USTAT_SAMPLE_TABLE_NAME 'trafodion.sch.weird_sample' ; 
> -- below, use the number of rows in weird_sample
> update statistics for table weird_table on "_DIVISION_1_" sample 20080 rows; 
> showstats for table weird_table on "_DIVISION_1_" detail;
> In this particular example, the "showstats" showed a UEC of 9 for the last 
> interval. More dramatic values can be obtained by increasing the table size 
> relative to sample size; from the examples I have seen the UEC of the last 
> interval seems to be about equal to the ratio of table size to sample size.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to