Hi Dave, I am open to set the default to On for the CQD.
But before taking the action, we probably should run a performance test to make sure 1), the increased time is negligible now, and b), no bad plans because of the extra stats. Thanks --Qifan On Fri, Feb 12, 2016 at 12:19 PM, Dave Birdsall <[email protected]> wrote: > Hi Qifan, > > Thanks! So, we were making a trade-off of reducing the number of > multi-column histograms on salted tables gaining a reduction in UPDATE > STATS > overhead. > > Since that time, some improvements have been made to UPDATE STATS elapsed > time particularly for large tables. (I'm thinking primarily of JIRA > TRAFODION-1740, where we increase the degree of parallelism used by the > sampling query.) Perhaps this mitigates all or in part the original > concern? > > I'd like to propose leaving the CQD in the code, but changing the default > to > 'ON'. Does this sound reasonable? > > Hi all, > > A bit more background for others new to this issue. > > What we are describing here is the default behavior of UPDATE STATISTICS ON > EVERY KEY and UPDATE STATISTICS ON EVERY COLUMN. These two ON clauses are > syntactic sugars. Without salting (and before salting), the behavior of ON > EVERY KEY was: create a single-column histogram on each key column, and > multi-column histograms for prefixes of the key (so, the first two columns, > the first three, the first four, and up to five). ON EVERY COLUMN was > similar: It does a single column histogram on every column, along with the > key prefixes that ON EVERY KEY provides. > > For salted tables, this behavior was changed to only do a single > multi-column histogram on the entire primary key. > > Now, one could still get the old behavior by either flipping a CQD, or by > specifying the histograms directly (e.g., UPDATE STATSITICS ON > ("_SALT_",A), > ("_SALT_",A,B), etc.) > > So we are only talking about what semantics to associate with a syntactic > sugar. (A very convenient syntactic sugar though.) > > The trade-off made was to reduce UPDATE STATS overhead for salted tables by > doing fewer multi-column histograms by default, at the price of possibly > bad > query plans (which is the complaint of JIRA TRAFODION-1467). The question > here is, was this the right trade-off? And now that UPDATE STATS elapsed > times have improved for large tables, should we trade-off the other way? > That's the proposal I make above. > > Thanks, > > Dave > > -----Original Message----- > From: Qifan Chen [mailto:[email protected]] > Sent: Friday, February 12, 2016 9:15 AM > To: dev <[email protected]> > Subject: Re: Question: Multi-column histograms for key columns on salted > tables > > Hi Dave, > > The intension was to reduce the load on US and Barry and I figured the best > way is to use the CQD to control it. > > thanks --Qifan > > On Fri, Feb 12, 2016 at 11:08 AM, Dave Birdsall <[email protected]> > wrote: > > > Hi, > > > > > > > > Reaching out to the development community to learn a little history. > > > > > > > > For a single-partitioned table T, with primary key columns A, B, C, if > > I do an UPDATE STATISTICS ON EVERY KEY (or ON EVERY COLUMN), I will > > get multi-column histograms for prefixes of the key. So, I’ll get a > > multi-column histogram for (A,B) and for (A,B,C). > > > > > > > > However, if I now salt that table, then an additional column, > > “_SALT_”, is prepended to the primary key. > > > > > > > > And the behavior of UPDATE STATISTICS is different. Instead of getting > > multi-column histograms for (“_SALT_”,A), (“_SALT_”,A,B), and > > (“_SALT_”,A,B,C), I get only (“_SALT_”,A,B,C). > > > > > > > > Indeed, JIRA TRAFODION-1467 complains about this very behavior. See > > https://issues.apache.org/jira/browse/TRAFODION-1467. > > > > > > > > Now, I was looking into fixing this JIRA, and I found the following > > code in > > sql/ustat/hs_parser.cpp: > > > > > > > > // For salted table, generate only the longest MC for the key > > (subject > > > > // to max cols determined above) unless a cqd is set to gen > > all MCs of > > > > // allowable sizes. > > > > if > > (CmpCommon::getDefault(USTAT_ADD_SALTED_KEY_PREFIXES_FOR_MC) == DF_OFF > > && > > > > hs_globals->objDef->getColNum("_SALT_", FALSE) >= 0) > > > > minMCGroupSz = numKeys; > > > > > > > > And indeed this CQD has the default value of ‘OFF’. > > > > > > > > So it appears this behavior is intentional. > > > > > > > > My question is, why? On the surface it seems a bit arbitrary to me. > > > > > > > > Depending on the answer, I’ll address the JIRA in one of the following > > ways: > > > > > > > > 1. Say, it’s working as intended. And if you want all the > > multi-column histograms, set CQD USTAT_ADD_SALTED_KEY_PREFIXES_FOR_MC > ‘ON’ > > prior to issuing the UPDATE STATISTICS command. > > > > 2. Change the default to ‘ON’. And you’ll have to set CQD > > USTAT_ADD_SALTED_KEY_PREFIXES_FOR_MC ‘OFF’ to reduce the number of > > multi-column histograms on key prefixes in salted tables. > > > > 3. Remove the CQD altogether if there was no good reason for it. > > Which would give us the ‘ON’ behavior all the time. > > > > > > > > Dave > > > > > > -- > Regards, --Qifan > -- Regards, --Qifan
