Re: Question: Multi-column histograms for key columns on salted tables

Qifan Chen Fri, 12 Feb 2016 12:53:47 -0800

Hi Dave,

I am open to set the default to On for the CQD.


But before taking the action, we probably should run a performance test to
make sure 1), the increased time is negligible now, and b), no bad plans
because of the extra stats.

Thanks --Qifan


On Fri, Feb 12, 2016 at 12:19 PM, Dave Birdsall <[email protected]>
wrote:

> Hi Qifan,
>
> Thanks! So, we were making a trade-off of reducing the number of
> multi-column histograms on salted tables gaining a reduction in UPDATE
> STATS
> overhead.
>
> Since that time, some improvements have been made to UPDATE STATS elapsed
> time particularly for large tables. (I'm thinking primarily of JIRA
> TRAFODION-1740, where we increase the degree of parallelism used by the
> sampling query.) Perhaps this mitigates all or in part the original
> concern?
>
> I'd like to propose leaving the CQD in the code, but changing the default
> to
> 'ON'. Does this sound reasonable?
>
> Hi all,
>
> A bit more background for others new to this issue.
>
> What we are describing here is the default behavior of UPDATE STATISTICS ON
> EVERY KEY and UPDATE STATISTICS ON EVERY COLUMN. These two ON clauses are
> syntactic sugars. Without salting (and before salting), the behavior of ON
> EVERY KEY was: create a single-column histogram on each key column, and
> multi-column histograms for prefixes of the key (so, the first two columns,
> the first three, the first four, and up to five). ON EVERY COLUMN was
> similar: It does a single column histogram on every column, along with the
> key prefixes that ON EVERY KEY provides.
>
> For salted tables, this behavior was changed to only do a single
> multi-column histogram on the entire primary key.
>
> Now, one could still get the old behavior by either flipping a CQD, or by
> specifying the histograms directly (e.g., UPDATE STATSITICS ON
> ("_SALT_",A),
> ("_SALT_",A,B), etc.)
>
> So we are only talking about what semantics to associate with a syntactic
> sugar. (A very convenient syntactic sugar though.)
>
> The trade-off made was to reduce UPDATE STATS overhead for salted tables by
> doing fewer multi-column histograms by default, at the price of possibly
> bad
> query plans (which is the complaint of JIRA TRAFODION-1467). The question
> here is, was this the right trade-off? And now that UPDATE STATS elapsed
> times have improved for large tables, should we trade-off the other way?
> That's the proposal I make above.
>
> Thanks,
>
> Dave
>
> -----Original Message-----
> From: Qifan Chen [mailto:[email protected]]
> Sent: Friday, February 12, 2016 9:15 AM
> To: dev <[email protected]>
> Subject: Re: Question: Multi-column histograms for key columns on salted
> tables
>
> Hi Dave,
>
> The intension was to reduce the load on US and Barry and I figured the best
> way is to use the CQD to control it.
>
> thanks --Qifan
>
> On Fri, Feb 12, 2016 at 11:08 AM, Dave Birdsall <[email protected]>
> wrote:
>
> > Hi,
> >
> >
> >
> > Reaching out to the development community to learn a little history.
> >
> >
> >
> > For a single-partitioned table T, with primary key columns A, B, C, if
> > I do an UPDATE STATISTICS ON EVERY KEY (or ON EVERY COLUMN), I will
> > get multi-column histograms for prefixes of the key. So, I’ll get a
> > multi-column histogram for (A,B) and for (A,B,C).
> >
> >
> >
> > However, if I now salt that table, then an additional column,
> > “_SALT_”, is prepended to the primary key.
> >
> >
> >
> > And the behavior of UPDATE STATISTICS is different. Instead of getting
> > multi-column histograms for (“_SALT_”,A), (“_SALT_”,A,B), and
> > (“_SALT_”,A,B,C), I get only (“_SALT_”,A,B,C).
> >
> >
> >
> > Indeed, JIRA TRAFODION-1467 complains about this very behavior. See
> > https://issues.apache.org/jira/browse/TRAFODION-1467.
> >
> >
> >
> > Now, I was looking into fixing this JIRA, and I found the following
> > code in
> > sql/ustat/hs_parser.cpp:
> >
> >
> >
> >         // For salted table, generate only the longest MC for the key
> > (subject
> >
> >         // to max cols determined above) unless a cqd is set to gen
> > all MCs of
> >
> >         // allowable sizes.
> >
> >         if
> > (CmpCommon::getDefault(USTAT_ADD_SALTED_KEY_PREFIXES_FOR_MC) == DF_OFF
> > &&
> >
> >             hs_globals->objDef->getColNum("_SALT_", FALSE) >= 0)
> >
> >           minMCGroupSz = numKeys;
> >
> >
> >
> > And indeed this CQD has the default value of ‘OFF’.
> >
> >
> >
> > So it appears this behavior is intentional.
> >
> >
> >
> > My question is, why? On the surface it seems a bit arbitrary to me.
> >
> >
> >
> > Depending on the answer, I’ll address the JIRA in one of the following
> > ways:
> >
> >
> >
> > 1.       Say, it’s working as intended. And if you want all the
> > multi-column histograms, set CQD USTAT_ADD_SALTED_KEY_PREFIXES_FOR_MC
> ‘ON’
> > prior to issuing the UPDATE STATISTICS command.
> >
> > 2.       Change the default to ‘ON’. And you’ll have to set CQD
> > USTAT_ADD_SALTED_KEY_PREFIXES_FOR_MC ‘OFF’ to reduce the number of
> > multi-column histograms on key prefixes in salted tables.
> >
> > 3.       Remove the CQD altogether if there was no good reason for it.
> > Which would give us the ‘ON’ behavior all the time.
> >
> >
> >
> > Dave
> >
>
>
>
> --
> Regards, --Qifan
>



-- 
Regards, --Qifan

Re: Question: Multi-column histograms for key columns on salted tables

Reply via email to