Re: [PATCH] TODO “Allow LISTEN on patterns”

2024-07-28 Thread Alexander Cheshev
Hi Emanuel,

I did a review on the new patch version and I observed that the identifier
> passed to the LISTEN command is handled differently between outer and
> inner
> levels.
>

We have the following grammar:

notify_channel:
   ColId
   { $$ = $1; }
   | notify_channel '.' ColId
   { $$ = psprintf("%s.%s", $1, $3); }

And ColId is truncated in core scanner:

 ident = downcase_truncate_identifier(yytext, yyleng, true);

So each level is truncated independently. For this reason we observe the
behaviour which you described above.

Another observation, probably not strictly related to this patch itself but
> the async-notify tests, is that there is no test for
> "payload too long". Probably there is a reason on why isn't in the specs?
>

I believe that simply because not all functionality is covered with tests.
But I have noticed a very interesting test "channel name too long":

SELECT
pg_notify('notify_async_channel_name_too_long__','sample_message1');
ERROR:  channel name too long

But the behaviour is inconsistent with NOTIFY command:

NOTIFY notify_async_channel_name_too_long__
NOTICE:  identifier
"notify_async_channel_name_too_long__" will be
truncated to ...

I guess that the expected behavior would be that if the outer level is
> truncated, the rest of the
> channel name should be ignored, as there won't be possible to notify it
> anyway.
>
> In the case of the inner levels creating a channel name too long, it may
> probably sane to just
> check the length of the entire identifier, and truncate -- ensuring that
> channel name doesn't
> end with the level separator.
>
>
Well, I believe that we can forbid too long channel names at all. So it
provides consistent behaviour among different ways we can send
notifications, and I agree with you that "there won't be possible to notify
it anyway". I created a patch for that and attached it to the email. In the
patch I relocated truncation from core scanner to parser. And as the same
core scanner is also used in plsql I added three lines of code to its
scanner to basically truncate too long identifiers in there. Here is an
example of the new behaviour:

-- Should fail. Too long channel names
NOTIFY notify_async_channel_name_too_long_._;
ERROR:  channel name too long
LISTEN notify_async_channel_name_too_long_%._;
ERROR:  channel name too long
UNLISTEN notify_async_channel_name_too_long_%._;
ERROR:  channel name too long

Regards,
Alexander Cheshev


On Sun, 21 Jul 2024 at 21:36, Emanuel Calvo <3man...@gmail.com> wrote:

>
> Hi Alexander,
>
> I did a review on the new patch version and I observed that the identifier
> passed to the LISTEN command is handled differently between outer and
> inner
> levels.
>
> When the outer level exceeds the 64 characters limitation, the outer level
> of the
> channel name is truncated, but leaves the inner levels in the channel name
> due
> that isn't parsed in the same way.
>
> Also, even if the outer level isn't truncated, it is allowed to add
> channels names
> that exceeds the allowed identifier size.
>
> It can be reproduced just by:
>
>   # LISTEN a.a.a.a.a.lot.of.levels..; -- this doesn't fail at LISTEN,
> but fails in NOTIFY due to channel name too long
>
> In the following, the outer level is truncated, but it doesn't cut out the
> inner levels. This leaves
> listening channels that cannot receive any notifications in the queue:
>
>   # LISTEN
> notify_async_channel_name_too_long.a.a.
> ...
>   NOTICE: identifier  will be truncated
>
>   # select substring(c.channel,0,66), length(c.channel) from
> pg_listening_channels() c(channel) where length(c.channel) > 64;
>   substring |
> notify_async_channel_name_too_long_.a
>   length| 1393
>
>
> I guess that the expected behavior would be that if the outer level is
> truncated, the rest of the
> channel name should be ignored, as there won't be possible to notify it
> anyway.
>
> In the case of the inner levels creating a channel name too long, it may
> probably sane to just
> check the length of the entire identifier, and truncate -- ensuring that
> channel name doesn't
> end with the level separator.
>
> Another observation, probably not strictly related to this patch itself
> but the async-notify tests, is that there is no test for
> "payload too long". Probably there is a reason on why isn't in the specs?
&g

Re: [PATCH] TODO “Allow LISTEN on patterns”

2024-07-15 Thread Alexander Cheshev
Hi Emanuel,

Changed implementation of the function Exec_UnlistenCommit . v2 of the path
contained a bug in the function Exec_UnlistenCommit (added a test case for
that) and also it was not implemented in natural to C form using pointers.
Now it looks fine and works as expected.

In the previous email I forgot to mention that the new implementation of
the function Exec_UnlistenCommit has the same space and time complexities
as the original implementation (which doesn't support wildcards).

Regards,
Alexander Cheshev


On Sat, 13 Jul 2024 at 13:26, Alexander Cheshev 
wrote:

> Hi Emanuel,
>
> I did a test over the "UNLISTEN >" behavior, and I'm not sure if this is
>> expected.
>> This command I assume should free all the listening channels, however, it
>> doesn't
>> seem to do so:
>
>
> TODO “Allow LISTEN on patterns” [1] is a bit vague about that feature. So
> I didn't implement it in the first version of the patch. Also I see that I
> made a mistake in the documentation and mentioned that it is actually
> supported. Sorry for the confusion.
>
> Besides obvious reasons I think that your finding is especially attractive
> for the following reason. We have an UNLISTEN * command. If we replace >
> with * in the patch (which I actually did in the new version of the patch)
> then we have a generalisation of the above command. For example, UNLISTEN
> a* cancels registration on all channels which start with a.
>
> I attached to the email the new version of the patch which supports the
> requested feature. Instead of > I use * for the reason which I mentioned
> above. Also I added test cases, changed documentation, etc.
>
> I appreciate your work, Emanuel! If you have any further findings I will
> be glad to adjust the patch accordingly.
>
> [1]
> https://www.postgresql.org/message-id/flat/52693FC5.7070507%40gmail.com
>
> Regards,
> Alexander Cheshev
>
> Regards,
> Alexander Cheshev
>
>
> On Tue, 9 Jul 2024 at 11:01, Emanuel Calvo <3man...@gmail.com> wrote:
>
>>
>> Hello there,
>>
>>
>> El vie, 15 mar 2024 a las 9:01, Alexander Cheshev (<
>> alex.ches...@gmail.com>) escribió:
>>
>>> Hello Hackers,
>>>
>>> I have implemented TODO “Allow LISTEN on patterns” [1] and attached
>>> the patch to the email. The patch basically consists of the following
>>> two parts.
>>>
>>> 1. Support wildcards in LISTEN command
>>>
>>> Notification channels can be composed of multiple levels in the form
>>> ‘a.b.c’ where ‘a’, ‘b’ and ‘c’ are identifiers.
>>>
>>> Listen channels can be composed of multiple levels in the form ‘a.b.c’
>>> where ‘a’, ‘b’ and ‘c’ are identifiers which can contain the following
>>> wildcards:
>>>  *  ‘%’ matches everything until the end of a level. Can only appear
>>> at the end of a level. For example, the notification channels ‘a.b.c’,
>>> ‘a.bc.c’ match against the listen channel ‘a.b%.c’.
>>>  * ‘>’ matches everything to the right. Can only appear at the end of
>>> the last level. For example, the notification channels ‘a.b’, ‘a.bc.d’
>>> match against the listen channel ‘a.b>’.
>>>
>>>
>> I did a test over the "UNLISTEN >" behavior, and I'm not sure if this is
>> expected.
>> This command I assume should free all the listening channels, however, it
>> doesn't
>> seem to do so:
>>
>> postgres=# LISTEN device1.alerts.%;
>> LISTEN
>> postgres=# ;
>> Asynchronous notification "device1.alerts.temp" with payload "80"
>> received from server process with PID 237.
>> postgres=# UNLISTEN >;
>> UNLISTEN
>> postgres=# ; -- Here I send a notification over the same channel
>> Asynchronous notification "device1.alerts.temp" with payload "80"
>> received from server process with PID 237.
>>
>> The same happens with "UNLISTEN %;", although I'm not sure if this should
>> have
>> the same behavior.
>>
>> It stops listening correctly if I do explicit UNLISTEN (exact channel
>> matching).
>>
>> I'll be glad to conduct more tests or checks on this.
>>
>> Cheers,
>>
>>
>> --
>> --
>> Emanuel Calvo
>> Database Engineering
>> https://tr3s.ma/aobut
>>
>>
From 403a432c129084ac4f17a92e29aa5398de0125f0 Mon Sep 17 00:00:00 2001
From: Alexander Cheshev 
Date: Thu, 14 Mar 2024 21:53:29 +0100
Subject: [PATCH v3] Support wildcards in LISTEN command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Notification channels

Re: [PATCH] TODO “Allow LISTEN on patterns”

2024-07-13 Thread Alexander Cheshev
Hi Emanuel,

I did a test over the "UNLISTEN >" behavior, and I'm not sure if this is
> expected.
> This command I assume should free all the listening channels, however, it
> doesn't
> seem to do so:


TODO “Allow LISTEN on patterns” [1] is a bit vague about that feature. So I
didn't implement it in the first version of the patch. Also I see that I
made a mistake in the documentation and mentioned that it is actually
supported. Sorry for the confusion.

Besides obvious reasons I think that your finding is especially attractive
for the following reason. We have an UNLISTEN * command. If we replace >
with * in the patch (which I actually did in the new version of the patch)
then we have a generalisation of the above command. For example, UNLISTEN
a* cancels registration on all channels which start with a.

I attached to the email the new version of the patch which supports the
requested feature. Instead of > I use * for the reason which I mentioned
above. Also I added test cases, changed documentation, etc.

I appreciate your work, Emanuel! If you have any further findings I will be
glad to adjust the patch accordingly.

[1] https://www.postgresql.org/message-id/flat/52693FC5.7070507%40gmail.com

Regards,
Alexander Cheshev

Regards,
Alexander Cheshev


On Tue, 9 Jul 2024 at 11:01, Emanuel Calvo <3man...@gmail.com> wrote:

>
> Hello there,
>
>
> El vie, 15 mar 2024 a las 9:01, Alexander Cheshev ()
> escribió:
>
>> Hello Hackers,
>>
>> I have implemented TODO “Allow LISTEN on patterns” [1] and attached
>> the patch to the email. The patch basically consists of the following
>> two parts.
>>
>> 1. Support wildcards in LISTEN command
>>
>> Notification channels can be composed of multiple levels in the form
>> ‘a.b.c’ where ‘a’, ‘b’ and ‘c’ are identifiers.
>>
>> Listen channels can be composed of multiple levels in the form ‘a.b.c’
>> where ‘a’, ‘b’ and ‘c’ are identifiers which can contain the following
>> wildcards:
>>  *  ‘%’ matches everything until the end of a level. Can only appear
>> at the end of a level. For example, the notification channels ‘a.b.c’,
>> ‘a.bc.c’ match against the listen channel ‘a.b%.c’.
>>  * ‘>’ matches everything to the right. Can only appear at the end of
>> the last level. For example, the notification channels ‘a.b’, ‘a.bc.d’
>> match against the listen channel ‘a.b>’.
>>
>>
> I did a test over the "UNLISTEN >" behavior, and I'm not sure if this is
> expected.
> This command I assume should free all the listening channels, however, it
> doesn't
> seem to do so:
>
> postgres=# LISTEN device1.alerts.%;
> LISTEN
> postgres=# ;
> Asynchronous notification "device1.alerts.temp" with payload "80" received
> from server process with PID 237.
> postgres=# UNLISTEN >;
> UNLISTEN
> postgres=# ; -- Here I send a notification over the same channel
> Asynchronous notification "device1.alerts.temp" with payload "80" received
> from server process with PID 237.
>
> The same happens with "UNLISTEN %;", although I'm not sure if this should
> have
> the same behavior.
>
> It stops listening correctly if I do explicit UNLISTEN (exact channel
> matching).
>
> I'll be glad to conduct more tests or checks on this.
>
> Cheers,
>
>
> --
> --
> Emanuel Calvo
> Database Engineering
> https://tr3s.ma/aobut
>
>
From e6e0bb6eb58723c8e82332fcf0e32c171d9f50f0 Mon Sep 17 00:00:00 2001
From: Alexander Cheshev 
Date: Thu, 14 Mar 2024 21:53:29 +0100
Subject: [PATCH v2] Support wildcards in LISTEN command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Notification channels can be composed of multiple levels in the form ‘a.b.c’ where ‘a’, ‘b’ and ‘c’ are identifiers.

Listen and unlisten channels can be composed of multiple levels in the form ‘a.b.c’ where ‘a’, ‘b’ and ‘c’ are identifiers which can contain the following wildcards:
* The wildcard ‘%’ matches everything until the end of a level. Can only appear at the end of a level. For example, the notification channels ‘a.b.c’, ‘a.bc.c’ match against the notification channel ‘a.b%.c’.
* The wildcard ‘*’ matches everything to the right. Can only appear at the end of the last level. For example, the notification channels ‘a.b’, ‘a.bc.d’ match against the notification channel ‘a.b*’.

Use binary trie to match notification channels against listen channels.
---
 doc/src/sgml/ref/listen.sgml |  41 +-
 doc/src/sgml/ref/notify.sgml |   9 +-
 doc/src/sgml/ref/unlisten.sgml   |  46 +-
 src/backend/commands/async.c | 522 ++-
 src/backend/parser/gram.y|  52 +-
 src/backend

Re: Multidimensional Histograms

2024-01-08 Thread Alexander Cheshev
Hi Andrei,

> Maybe my wording needed to be more precise. I didn't implement
> multidimensional histograms before, so I don't know how expensive they
> are. I meant that for dependency statistics over about six columns, we
> have a lot of combinations to compute.

Equi-Depth Histogram in a 6 dimensional case requires 6 times more
iterations. Postgres already uses Equi-Depth Histogram. Even if you
increase the number of buckets from 100 to 1000 then there will be no
overhead as the time complexity of Equi-Depth Histogram has no
dependence on the number of buckets. So, no overhead at all!

Regards,
Alexander Cheshev

On Mon, 8 Jan 2024 at 04:12, Andrei Lepikhov  wrote:
>
> On 8/1/2024 01:36, Tomas Vondra wrote:
> > On 1/7/24 18:26, Andrei Lepikhov wrote:
> >> On 7/1/2024 17:51, Tomas Vondra wrote:
> >>> On 1/7/24 11:22, Andrei Lepikhov wrote:
> >>>> On 7/1/2024 06:54, Tomas Vondra wrote:
> >>>>> It's an interesting are for experiments, no doubt about it. And if you
> >>>>> choose to explore it, that's fine. But it's better to be aware it may
> >>>>> not end with a commit.
> >>>>> For the multi-dimensional case, I propose we first try to experiment
> >>>>> with the various algorithms, and figure out what works etc. Maybe
> >>>>> implementing them in python or something would be easier than C.
> >>>>
> >>>> Curiously, trying to utilize extended statistics for some problematic
> >>>> cases, I am experimenting with auto-generating such statistics by
> >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
> >>>> statistics like a multidimensional histogram or just a histogram which
> >>>> could help to perform estimation over a set of columns/expressions.
> >>>> I realized that current hooks get_relation_stats_hook and
> >>>> get_index_stats_hook are insufficient if I want to perform an estimation
> >>>> over a set of ANDed quals on different columns.
> >>>> In your opinion, is it possible to add a hook into the extended
> >>>> statistics to allow for an extension to propose alternative estimation?
> >>>>
> >>>> [1] https://github.com/danolivo/pg_index_stats
> >>>>
> >>>
> >>> No idea, I haven't thought about that very much. Presumably the existing
> >>> hooks are insufficient because they're per-attnum? I guess it would make
> >>> sense to have a hook for all the attnums of the relation, but I'm not
> >>> sure it'd be enough to introduce a new extended statistics kind ...
> >>
> >> I got stuck on the same problem Alexander mentioned: we usually have
> >> large tables with many uniformly distributed values. In this case, MCV
> >> doesn't help a lot.
> >>
> >> Usually, I face problems scanning a table with a filter containing 3-6
> >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
> >> less than 1 tuple selectivity. But such scans, in reality, mostly have
> >> some physical sense and return a bunch of tuples. It looks like the set
> >> of columns representing some value of composite type.
> >
> > Understood. That's a fairly common scenario, and it can easily end up
> > with rather terrible plan due to the underestimate.
> >
> >> Sometimes extended statistics on dependency helps well, but it expensive
> >> for multiple columns.
> >
> > Expensive in what sense? Also, how would a multidimensional histogram be
> > any cheaper?
> Maybe my wording needed to be more precise. I didn't implement
> multidimensional histograms before, so I don't know how expensive they
> are. I meant that for dependency statistics over about six columns, we
> have a lot of combinations to compute.
> >
> >> And sometimes I see that even a trivial histogram
> >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
> >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
> >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
> >> ROW(N1, N2,...).
> >
> > Are you guessing the histogram would help, or do you know it'd help? I
> > have no problem believing that for range queries, but I think it's far
> > less clear for simple equalities. I'm not sure a histograms would work
> > for that ...
>
> I added Teodor Sigaev to the CC of this email - He has much more user
> feedback on this technique. As I see, having a histogram over a set of
> columns, we have to

Re: Multidimensional Histograms

2024-01-07 Thread Alexander Cheshev
Hi Tomas,

> See section 3.2 in this paper (the "view PDF" does not work for me, but
> the "source PDF" does download a postscript):

I believe that you are referring to a dynamic programming approach. It
is a 1-dimensional case! To find an optimal solution in the
M-dimensional case is an NP-hard problem.

Regards,
Alexander Cheshev

On Mon, 8 Jan 2024 at 00:29, Tomas Vondra  wrote:
>
>
>
> On 1/7/24 23:53, Alexander Cheshev wrote:
> > Hi Tomas,
> >
> >> The thing I was proposing is that it should be possible to build
> >> histograms with bins adapted to density in the given region. With
> >> smaller buckets in areas with high density. So that queries intersect
> >> with fewer buckets in low-density parts of the histogram.
> >
> > This is how Equi-Depth Histogram works. Postgres has maller buckets in
> > areas with high density:
> >
> > values[(i * (nvals - 1)) / (num_hist - 1)]
> >
> True, but the boundaries are somewhat random. Also, I was referring to
> the multi-dimensional case, it wasn't clear to me if the proposal is to
> do the same thing.
>
> >> I don't recall the details of the MHIST-2 scheme, but it's true
> >> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).
> >
> > In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard
> > problem. In other words it is not possible to build it in polynomial
> > time. How did you come up with the estimate?!
> >
> See section 3.2 in this paper (the "view PDF" does not work for me, but
> the "source PDF" does download a postscript):
>
> https://citeseerx.ist.psu.edu/doc_view/pid/35e29cbc2bfe6662653bdae1fb89c091e2ece560
>
> >> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> >> not going to produce the optimal V-optimal histogram, but so what?
> >
> > Greedy/approximate algorithm has time complexity O(M*N*B), where M
> > equals the number of dimensions. MHIST-2 is a greedy/approximate
> > algorithm.
> >
> >> And how does this compare to the approximate/greedy algorithms, both in
> >> terms of construction time and accuracy?
> >
> > Time complexity of Equi-Depth Histogram has no dependence on B.
> >
> Really? I'd expect that to build B buckets, the algorithm repeat some
> O(M*N) action B-times, roughly. I mean, it needs to pick a dimension by
> which to split, then do some calculation on the N tuples, etc.
>
> Maybe I'm imagining that wrong, though. It's been ages since I looked ad
> big-O complexity and/or the histograms. I'd have to play with those
> algorithms for a bit again.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Re: Multidimensional Histograms

2024-01-07 Thread Alexander Cheshev
Hi Tomas,

> The thing I was proposing is that it should be possible to build
> histograms with bins adapted to density in the given region. With
> smaller buckets in areas with high density. So that queries intersect
> with fewer buckets in low-density parts of the histogram.

This is how Equi-Depth Histogram works. Postgres has maller buckets in
areas with high density:

values[(i * (nvals - 1)) / (num_hist - 1)]

> I don't recall the details of the MHIST-2 scheme, but it's true
> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).

In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard
problem. In other words it is not possible to build it in polynomial
time. How did you come up with the estimate?!

> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> not going to produce the optimal V-optimal histogram, but so what?

Greedy/approximate algorithm has time complexity O(M*N*B), where M
equals the number of dimensions. MHIST-2 is a greedy/approximate
algorithm.

> And how does this compare to the approximate/greedy algorithms, both in
> terms of construction time and accuracy?

Time complexity of Equi-Depth Histogram has no dependence on B.

> But all of this can apply to histograms in general, no? It's not somehow
> special to equi-depth histograms, a v-optimal histogram could be stored
> as an r-tree etc.

I agree.

Regards,
Alexander Cheshev

On Sun, 7 Jan 2024 at 00:54, Tomas Vondra  wrote:
>
>
>
> On 1/6/24 01:00, Alexander Cheshev wrote:
> > Hi Tomas,
> >
> >> Another reason was that the algorithm described in the two papers you
> >> reference (1988 paper by DeWitt and the evaluation by Carlson and
> >> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> >> processes the columns one by one - first build bucket on column "a",
> >> then splits each bucket into buckets on "b". So it's not symmetrical,
> >> and it results in lower accuracy compared to an "ideal" histogram with
> >> the same number of buckets (especially for the dimensions split early).
> >
> > As stated in [1] Sum Square Error (SSE) is one of the most natural
> > error metrics. Equi-Depth Histogram is not ideal because it doesn't
> > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> > SSE and from this point of view can be considered as an ideal
> > solution.
> >
> >> This does indeed produce an equi-depth histogram, which seems nice, but
> >> the mesh is regular in such a way that any value of the first dimension
> >> intersects all buckets on the second dimension. So for example with a
> >> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> >> 100 buckets on "b", each representing 1/1 of tuples. But we don't
> >> know how the tuples are distributed in the tuple - so we end up using
> >> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> >> the wrong dimension.
> >
> > Suppose that we have a query box a=X and we know how data is
> > distributed in buckets. Lets consider only the buckets which are
> > intersected by the query box a=X. As we know how data is distributes
> > in buckets we can exclude the buckets which have no tuples which
> > intersect the query box a=X.
> >
> > As we store buckets with no information about data distribution we
> > have to make reasonable assumptions. If we keep buckets relativly
> > small then we can assume that buckets have uniform distribution.
> >
> > What I am trying to say is that the problem which you pointed out
> > comes from the fact that we store buckets with no information about
> > data distribution. Even in one dimensional case we have to assume how
> > data is distributed in buckets. By the way Postgres assumes that data
> > has uniform distribution in buckets.
> >
>
> It's not just what Postgres assumes, the assumption bucket uniformity is
> somewhat inherent to the whole concept of a histogram. Yes, maybe we
> could keep some "distribution" info about each bucket, but then maybe we
> could simply build histogram with more bins occupying the same space?
>
> The thing I was proposing is that it should be possible to build
> histograms with bins adapted to density in the given region. With
> smaller buckets in areas with high density. So that queries intersect
> with fewer buckets in low-density parts of the histogram.
>
> >> However, this is not the only way to build an equi-depth histogram,
> >> there are ways to make it more symmetrical. Also, it's not clear
> >> equi-de

Re: Multidimensional Histograms

2024-01-06 Thread Alexander Cheshev
Hi Tomas,

I am sorry I didn't look into the code carefully. Indeed Postgres uses
Equi-Depth Histogram:

delta = (nvals - 1) / (num_hist - 1);


Regards,
Alexander Cheshev

On Sat, 6 Jan 2024 at 01:00, Alexander Cheshev  wrote:
>
> Hi Tomas,
>
> > Another reason was that the algorithm described in the two papers you
> > reference (1988 paper by DeWitt and the evaluation by Carlson and
> > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> > processes the columns one by one - first build bucket on column "a",
> > then splits each bucket into buckets on "b". So it's not symmetrical,
> > and it results in lower accuracy compared to an "ideal" histogram with
> > the same number of buckets (especially for the dimensions split early).
>
> As stated in [1] Sum Square Error (SSE) is one of the most natural
> error metrics. Equi-Depth Histogram is not ideal because it doesn't
> minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> SSE and from this point of view can be considered as an ideal
> solution.
>
> > This does indeed produce an equi-depth histogram, which seems nice, but
> > the mesh is regular in such a way that any value of the first dimension
> > intersects all buckets on the second dimension. So for example with a
> > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> > 100 buckets on "b", each representing 1/1 of tuples. But we don't
> > know how the tuples are distributed in the tuple - so we end up using
> > 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> > the wrong dimension.
>
> Suppose that we have a query box a=X and we know how data is
> distributed in buckets. Lets consider only the buckets which are
> intersected by the query box a=X. As we know how data is distributes
> in buckets we can exclude the buckets which have no tuples which
> intersect the query box a=X.
>
> As we store buckets with no information about data distribution we
> have to make reasonable assumptions. If we keep buckets relativly
> small then we can assume that buckets have uniform distribution.
>
> What I am trying to say is that the problem which you pointed out
> comes from the fact that we store buckets with no information about
> data distribution. Even in one dimensional case we have to assume how
> data is distributed in buckets. By the way Postgres assumes that data
> has uniform distribution in buckets.
>
> > However, this is not the only way to build an equi-depth histogram,
> > there are ways to make it more symmetrical. Also, it's not clear
> > equi-depth histograms are ideal with multiple columns. There are papers
> > proposing various other types of histograms (using different criteria to
> > build buckets optimizing a different metric). The most interesting one
> > seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> > [4], [5] and [6]. I'm sure there are more. The drawback of course is
> > that it's more expensive to build such histograms.
>
> Tomas thank you for shearing with me your ideas regarding V-Optimal
> Histogram. I read through the papers which you gave me and came up
> with the following conclusion.
>
> The problem can be formulated like this. We have N tuples in
> M-dimensional space. We need to partition space into buckets
> iteratively until SSE is less than E or we reach the limit of buckets
> B.
>
> In the case of M-dimensional space it seems to me like an NP-hard
> problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> sort N tuples in ascending order. Then we iteratively select a bucket
> which leads to the largest SSE reduction and split it into two parts.
> We repeat the process until SSE is less than E or we reach the limit
> of buckets B.
>
> If we assume that B is significantly less than N then the time
> complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> equals 3, B equals 1000 and N equals 300*B then it will take slightly
> over 0.9*10^9 iterations to build a V-Optimal Histogram.
>
> You can see that we have to keep B as low as possible in order to
> build V-Optimal Histogram in feasible time. And here is a paradox.
> From one side we use V-Optimal Histogram in order to minimize SSE but
> from the other side we have to keep B as low as possible which
> eventually leads to increase in SSE.
>
> On the other hand time complexity required to build an Equi-Depth
> Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> can be arbitrarily reduced by increasing B which in turn is only
> limited by the storage limit. Experimental results show low error
> metr

Re: Multidimensional Histograms

2024-01-05 Thread Alexander Cheshev
tersect a query box we can
store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
takes O(logB) iterations to find buckets which intersect a query box.
As storage requirements are dominated by leaf nodes we can assume that
it takes slightly more than 2*integer_size*M*B.

> IIRC the patch tried to do something like V-optimal histograms, but
> using a greedy algorithm and not the exhaustive stuff described in the
> various papers.

We should only consider computationally tackable solutions. In one
dimensional case V-Optimal Histogram is probably a good solution but
in multi-dimensional case I would only consider Equi-Width or
Equi-Depth Histograms. As stated in multiple papers Equi-Depth
Histogram proves to be more accurate than Equi-Width Histogram. By the
way Postgres uses something like Equi-Width Histogram.

> FWIW I did not intend to reject the idea of adding multi-dimensional
> histograms, but rather to provide some insight into the history of the
> past attempt, and also point some weaknesses of the algorithm described
> in the 1988 paper. If you choose to work on this, I can do a review etc.

Thank you very much Tomas. I am new in the community and I definitely
didn't expect to have such a warm welcome.

As I indicated above Equi-Depth Histogram proves to be more accurate
than Equi-Width Histogram and both have the same time and space
requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
that:
 * I will create a patch which will replace Equi-Width Histogram with
Equi-Depth Histogram but only in 1-dimensional case.
 * I will show experimental results which will demonstrate improvement
of selectivity estimation.
Then will the path be accepted by the community?

If the above path is accepted by the community then I will proceed
further with M-dimensional Equi-Depth Histogram...


[1] https://www.vldb.org/conf/1998/p275.pdf
[2] https://www.vldb.org/conf/1997/P486.PDF
[3] https://dl.acm.org/doi/pdf/10.1145/50202.50205

Regards,
Alexander Cheshev

On Thu, 28 Dec 2023 at 15:25, Tomas Vondra
 wrote:
>
> On 12/27/23 22:19, Tomas Vondra wrote:
> > Hello Alexander,
> >
> > We actually had histograms in the early patches adding multivariate
> > statistics [1]. We however ended up removing histograms and only kept
> > the simpler types, for a couple reasons.
> >
> > It might be worth going through the discussion - I'm sure one of the
> > reasons was we needed to limit the scope of the patch, and histograms
> > were much more complex and possibly less useful compared to the other
> > statistics types.
> >
> > ...
>
> FWIW I did not intend to reject the idea of adding multi-dimensional
> histograms, but rather to provide some insight into the history of the
> past attempt, and also point some weaknesses of the algorithm described
> in the 1988 paper. If you choose to work on this, I can do a review etc.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Multidimensional Histograms

2023-12-26 Thread Alexander Cheshev
Hello Hackers,

To improve selectivities of queries I suggest to add support of
multidimensional histograms as described in paper [1].

To query multidimensional histograms efficiently we can use H-trees as
described in paper [2].

Postgres has limited support of multivariate statistics:
 * MCV only useful for columns with small number of distinct values;
 * functional dependencies only reflect dependencies among columns
(not column values).

[1] http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_databases.pdf
[2] https://dl.acm.org/doi/pdf/10.1145/50202.50205

-- 
Regards,
Alexander Cheshev