from:"Floris Van Nee"

RE: temp table on commit delete rows performance issue

2024-07-18 Thread Floris Van Nee



> It seems that in your patch, WAL logging is skipped for all tables, not just
> temporary tables.

This code path is only used in two cases though:
* For the temporary tables ON COMMIT DROP
* For truncating tables that were created in the same transaction, or which
were already truncated in the same transaction (this is some special case
in the TRUNCATE command)
In both cases I believe it's not necessary to log the lock, as the table 
doesn't exist
on replica yet or the exclusive lock has already been obtained and logged 
previously.
Regular TRUNCATE commands go through a completely different code path,
as these need to be rollbackable if the transaction aborts.

> Upon further consideration, do we really need to acquire AccessExclusiveLocks
> for temporary tables? Since temporary tables can only be accessed within the
> current session, perhaps we can make the following optimizations:

This one I'm less sure of if it's correct in all cases. Logically it makes 
sense that no other
backends can access it, however I see some threads [1] that suggest that it's 
technically
possible for other backends to take locks on these tables, so it's not *that* 
obvious there
are no edge cases.

[1] https://postgrespro.com/list/thread-id/2477885

RE: temp table on commit delete rows performance issue

2024-07-18 Thread Floris Van Nee


> I also encountered the similar performance issue with temporary tables
> andprovided a patch to optimize the truncate performance during commit
> in [1].

Interesting, that is definitely another good way to improve the performance,
especially with a large number of temp tables. I think the two optimizations
can actually work well together.
Your optimization on only truncating the tables that are actually used.
Combined with a patch like attached which makes sure that no WAL is generated 
at all
for the ON COMMIT DELETE ROWS operation.

On my test system this reduces WAL generation for the pgbench test case
I posted previously to 0 (and therefore brought WAL replay process CPU usage
from 100% CPU and lagging behind to only 0% CPU usage)

-Floris



0001-Remove-any-WAL-logging-for-ON-COMMIT-DELETE-ROWS.patch
Description: 0001-Remove-any-WAL-logging-for-ON-COMMIT-DELETE-ROWS.patch

RE: temp table on commit delete rows performance issue

2024-07-16 Thread Floris Van Nee


> 
> I didn't investigate your particular issue but generally speaking creating a
> table, even a temporary one, is an expensive operation.
> 
> Note that it's far from being a seperate file on the disk. It affects catalog
> tables, shared buffers, all the corresponding locks, etc. If you have indexes
> for a temporary table it makes the situation ever worse. Sooner or later
> VACUUM will happen for your bloated catalog, and this is not fun under
> heavy load.

Yes, creation is expensive, but note that this test case does not create 
the temp table for every transaction. It creates it once on startup of the
connection. So there'll be no catalog bloat as the connections are generally
long-lived. The part that remains to be expensive (but which was unexpected
to me) is the truncate that does happen for every transaction.

> 
> Is there any particular reason why you don't want to simply change the target
> table directly? If you do it in a transaction you are safe.
> 

This is a fair point which I didn't explain in my earlier message. We find that 
for
various use cases the temp table approach is a really convenient way of 
interacting
with Postgres. You can just easily binary COPY whatever data you want into it 
without
looking too much at data size and then handling it from there on the db-side.
Doing the same without this, you'd need to resort to passing the data as query 
parameters, but 
this has complications like a limit in number of bound params per query. It's 
definitely possible
to use as well, but the binary COPY to temp table is quite convenient.

-Floris

temp table on commit delete rows performance issue

2024-07-16 Thread Floris Van Nee

Hi hackers,

I'm looking for some input on an issue I've observed. A common pattern
I've seen is using temporary tables to put data in before updating the
real tables. Something roughly like:

On session start:
CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;

On update:
BEGIN;
COPY temp_t1 FROM STDIN (FORMAT BINARY);
INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
-- potentially some other operations on temp table to put data into real table 
t1
COMMIT;

This pattern starts to break down under certain exceptional circumstances of
high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
fairly expensive and doesn't work well in high-concurrency scenarios. It's
especially noticeable under following circumstances:
- high max_connections setting
- high number of temp tables per session
- concurrent writers at fairly short intervals
Impact is on both TPS on primary as well as that the WAL replay process 
on replica becomes completely overloaded (100% cpu even though not
a lot of WAL is being generated)

A very simple pgbench example that showcases degradation (taken
with max_connections=2000 to clearly show it).

test_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text) on commit delete 
rows;
commit;

test_no_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text); -- do not do 
anything on commit
commit;

pgbench -n -f  test_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 2693.806376 (without initial connection time)

pgbench -n -f test_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 12119.358458 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 22685.877281 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 200359.458154 (without initial connection time)

For the test_truncate.sql with 10 threads, WAL replay process is spinning at 
100% CPU.

The reason seems to be that for a large part, the 'on commit delete rows'
follows the same code path as a regular truncate.
This means:
* taking an exclusive lock on the temp table (WAL logged)
* invalidating rel cache (WAL logged)

On replica process, these exclusive locks are very expensive to replay,
leading to spending replica most of its cycles trying to get LWLocks on 
non-fast-path.
On primary, degradation on concurrency I believes comes from the rel cache 
invalidation.

There doesn't seem to be a good reason for this behavior, as the replica
has nothing to do for a temporary table truncation, so this lock doesn't
need to be WAL logged.
The generated WAL is just fully this stuff. LOCK+INVALIDATION+COMMIT.
And it's spinning at 100% CPU for trying to replay this.

rmgr: Standby len (rec/tot): 42/42, tx:8154875, lsn: 
0/3AFFE270, prev 0/3AFFE218, desc: LOCK xid 8154875 db 5 rel 18188 
rmgr: Standby len (rec/tot): 42/42, tx:8154875, lsn: 
0/3AFFE2A0, prev 0/3AFFE270, desc: LOCK xid 8154875 db 5 rel 18191 
rmgr: Standby len (rec/tot): 42/42, tx:8154875, lsn: 
0/3AFFE2D0, prev 0/3AFFE2A0, desc: LOCK xid 8154875 db 5 rel 18192 
rmgr: Transaction len (rec/tot): 62/62, tx:8154875, lsn: 
0/3AFFE300, prev 0/3AFFE2D0, desc: INVALIDATION ; inval msgs: relcache 18191 
relcache 18192
rmgr: Transaction len (rec/tot): 82/82, tx:8154875, lsn: 
0/3AFFE340, prev 0/3AFFE300, desc: COMMIT 2024-07-16 12:56:42.878147 CEST; 
inval msgs: relcache 18191 relcache 18192
rmgr: Standby len (rec/tot): 42/42, tx:8154876, lsn: 
0/3AFFE398, prev 0/3AFFE340, desc: LOCK xid 8154876 db 5 rel 18188 
rmgr: Standby len (rec/tot): 42/42, tx:8154876, lsn: 
0/3AFFE3C8, prev 0/3AFFE398, desc: LOCK xid 8154876 db 5 rel 18191 
rmgr: Standby len (rec/tot): 42/42, tx:8154876, lsn: 
0/3AFFE3F8, prev 0/3AFFE3C8, desc: LOCK xid 8154876 db 5 rel 18192 
rmgr: Transaction len (rec/tot): 62/62, tx:8154876, lsn: 
0/3AFFE428, prev 0/3AFFE3F8, desc: INVALIDATION ; inval msgs: relcache 18191 
relcache 18192
rmgr: Transaction len (rec/tot): 82/82, tx:8154876, lsn: 
0/3AFFE468, prev 0/3AFFE428, desc: COMMIT 2024-07-16 12:56:42.878544 CEST; 
inval msgs: relcache 18191 relcache 18192

Have people observed this behavior before? Would the community be
open to accepting a patch that changes the behavior of ON COMMIT ROWS
to not WAL log the lock taken and/or the relcache inval?
I think:
* An exclusive lock needs to be taken on primary, but does not
need to be WAL logged
* Rel cache invalidation should not be necessary I think (it currently
happens just on the TOAST table+index, not on the regular TEMP table)

-Floris

RE: pg_init_privs corruption.

2023-02-17 Thread Floris Van Nee

> Kirill Reshke  writes:
> > As you can see, after drop role there is invalid records in
> > pg_init_privs system relation. After this, pg_dump generate sql
> > statements, some of which are based on content of pg_init_privs, resulting
> in invalid dump.
> 

This is as far as I can see the same case as what I reported a few years ago 
here: 
https://www.postgresql.org/message-id/flat/1574068566573.13088%40Optiver.com#488bd647ce6f5d2c92764673a7c58289
There was a discussion with some options, but no fix back then. 

-Floris

RE: Commitfest 2021-11 Patch Triage - Part 2

2021-11-09 Thread Floris Van Nee

> Daniel Gustafsson  writes:
> > 2773: libpq compression
> > ===
> > This patch intended to provide libpq connection compression to
> > "replace SSL compression" which was doomed when the patch was
> written,
> > and have since been removed altogether.  The initial approach didn't
> > get much traction but there was significant discussion and work, which
> > has since fizzled out.  The patch has been updated but there hasn't
> > been meaningful review the past months, the last comments seem to
> > imply there being a fair amount of questionmarks left in here.
> > Robert, having been very involved in this do you have any thoughts on
> where we are and where to go (if at all IYO)?
> 
> I'm not Robert, but I still have an opinion here, and that it's that this 
> feature
> would at best be an attractive nuisance.  If you need compression on a
> database session, it probably means that the connection is over the open
> internet, which means that you need encryption even more.  And we know
> that compression and encryption do not play well together.  The reason
> compression was taken out of the latest TLS standards is not that they
> wouldn't have liked to have it, nor that applying compression in a separate
> code layer would be any safer.  I fear offering this would merely lead people
> to build CVE-worthy setups.
> 

I don't have a strong opinion on the patch but I'd just like to mention that 
there are definitely use cases for using compression on private networks (with 
ssl off). We have cases where we have a PG extension that hooks into the COPY 
... FROM command and accept lz4-compressed data for the COPY data message 
specifically. This has proven invaluable to us even on private networks, 
because it can give a good compression ratio for very little CPU decompression 
overhead.

RE: Why timestamptz_pl_interval and timestamptz_mi_interval are not immutable?

2021-08-16 Thread Floris Van Nee

> 
> What do I miss?
> 
> --
> Best regards,
> Alexander Pyhalov,
> Postgres Professional
> 

See for example around DST changes

postgres=# begin;
BEGIN
postgres =# show timezone;
 TimeZone 
--
 Europe/Amsterdam
(1 row)

postgres=# select '2021-03-27 15:00 +0100'::timestamptz + interval '1d';
?column?

 2021-03-28 15:00:00+02
(1 row)

postgres =# set timezone to UTC;
SET
postgres =# select '2021-03-27 15:00 +0100'::timestamptz + interval '1d';
?column?

 2021-03-28 14:00:00+00
(1 row)

postgres =# select '2021-03-28 15:00:00+02' = '2021-03-28 14:00:00+00';
 ?column? 
--
 f
(1 row)

RE: visibility map corruption

2021-07-04 Thread Floris Van Nee

> 
> I wonder if it's related to this issue:
> 
> https://www.postgresql.org/message-
> id/20210423234256.hwopuftipdmp3...@alap3.anarazel.de
> 
> Have you increased autovacuum_freeze_max_age from its default? This
> already sounds like the kind of database where that would make sense.
> 

autovacuum_freeze_max_age is increased in our setup indeed (it is set to 500M). 
However, we do regularly run manual VACUUM (FREEZE) on individual tables in  
the database, including this one. A lot of tables in the database follow an 
INSERT-only pattern and since it's not running v13 yet, this meant that these 
tables would only rarely be touched by autovacuum. Autovacuum would sometimes 
kick in on some of these tables at the same time at unfortunate moments. 
Therefore we have some regular job running that VACUUM (FREEZE)s tables with a 
xact age higher than a (low, 10M) threshold ourselves.

RE: visibility map corruption

2021-07-04 Thread Floris Van Nee

> On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee 
> wrote:
> > We recently ran into an issue where the visibility map of a relation was
> corrupt, running Postgres 12.4. The error we'd get when running a SELECT *
> from this table is:
> >
> > could not access status of transaction 3704450152
> > DETAIL:  Could not open file "pg_xact/0DCC": No such file or directory.
> 
> Have you ever used pg_upgrade on this database?
> 

Yes. The last time (from v11 to v12) was in October 2020. The transaction id in 
the tuples (the one PG is trying to check in the tx log) dated from February 
2021. I do believe (but am not 100% certain) that the affected table already 
existed at the time of the last pg_upgrade though.

visibility map corruption

2021-07-04 Thread Floris Van Nee

Hi hackers,

We recently ran into an issue where the visibility map of a relation was 
corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from 
this table is:

could not access status of transaction 3704450152
DETAIL:  Could not open file "pg_xact/0DCC": No such file or directory.

On the lists I could find several similar reports, but corruption like this 
could obviously have a very wide range of root causes.. see [1] [2] [3] for 
example - not all of them have their root cause known.

This particular case was similar to reported cases above, but also has some 
differences.

The following query returns ~21.000 rows, which indicates something 
inconsistent between the visibility map and the pd_all_visible flag on the page:

select * from pg_check_frozen('tbl');

Looking at one of the affected pages with pageinspect:

=# SELECT 
lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_oid
 FROM heap_page_items(get_raw_page('tbl', 726127));
┌┬┬──┬┬┬┬──┬┬─┬┬┬───┐
│ lp │ lp_off │ lp_flags │ lp_len │   t_xmin   │   t_xmax   │ t_field3 │   
t_ctid   │ t_infomask2 │ t_infomask │ t_hoff │ t_oid │
├┼┼──┼┼┼┼──┼┼─┼┼┼───┤
│  1 │   6328 │1 │   1864 │ 3704450155 │ 3704450155 │1 │ 
(726127,1) │ 249 │   8339 │ 56 │ ∅ │
│  2 │   4464 │1 │   1864 │ 3704450155 │ 3704450155 │1 │ 
(726127,2) │ 249 │   8339 │ 56 │ ∅ │
│  3 │   2600 │1 │   1864 │ 3704450155 │ 3704450155 │1 │ 
(726127,3) │ 249 │   8339 │ 56 │ ∅ │
│  4 │680 │1 │   1920 │ 3704450155 │ 3704450155 │1 │ 
(726127,4) │ 249 │   8339 │ 56 │ ∅ │
└┴┴──┴┴┴┴──┴┴─┴┴┴───┘

t_infomask shows that HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID bits are both 
unset.
This pg_visibility() call shows the inconsistency between VM and page, with 
PD_ALL_VISIBLE=false

=# select * from pg_visibility('tbl', 726127);
┌─┬┬┐
│ all_visible │ all_frozen │ pd_all_visible │
├─┼┼┤
│ t   │ t  │ f  │
└─┴┴┘

Looking at other pages show the same information.
What's interesting is that out of the affected tuples returned by 
pg_check_frozen, over 99% belong to 1 transaction (3704450155 as seen above) 
and the remaining few are from one other transaction that occurred at roughly 
the same time.
I find it hard to believe that this is due to some random bit flipping, because 
many pages are affected, but the "incorrect" ones are in total only from two 
specific transactions which occurred at roughly the same time. There were also 
no server crashes or other known failures around the time of this transaction. 
I'm not ruling out any other kind of failure still, but I also cannot really 
explain how this could have happened.

The server has PG12.4 with full_page_writes=on, data_checksums=off. It's a 
large analytics database. The VM inconsistencies also occur on the streaming 
replicas.

I realize these cases are pretty rare and hard to debug, but I wanted to share 
the information I found so far here for reference. Maybe someone has an idea 
what occurred, or maybe someone in the future finds it useful when he finds 
something similar.

I have no idea how the inconsistency between VM and PD_ALL_VISIBLE started - 
from looking through the code I can't really find any way how this could occur. 
However, for it to lead to the problem described here, I believe there should 
be *no* SELECT that touches that particular page after the insert/update 
transaction and before the transaction log gets truncated. If the page is read 
before the transaction log gets truncated, then the hint bit 
HEAP_XMIN_COMMITTED will get set and future reads will succeed regardless of tx 
log truncation. One of the replica's had this happen to it: the affected pages 
are identical to the primary except that the HEAP_XMIN_COMMITTED flag is set 
(note that the VM inconsistency is still there on the replica though: 
PD_ALL_VISIBLE=false even though VM shows that all_frozen=all_visible=true). 
But I can query these rows on the replica without issues, because it doesn't 
check the tx log when it sees that HEAP_XMIN_COMMITTED is set.

-Floris

[1] https://postgrespro.com/list/thread-id/2422376
[2] https://postgrespro.com/list/thread-id/2501800
[3] https://postgrespro.com/list/thread-id/2321949

RE: non-HOT update not looking at FSM for large tuple update

2021-03-27 Thread Floris Van Nee

Hi Noah,

Thanks for taking a look at this patch.

> 
> In evaluating whether this is a good choice of value, I think about the
> expected page lifecycle.  A tuple barely larger than fillfactor (roughly
> len=1+BLCKSZ*fillfactor/100) will start on a roughly-empty page.  As long as
> the tuple exists, the server will skip that page for inserts.  Updates can 
> cause
> up to floor(99/fillfactor) same-size versions of the tuple to occupy the page
> simultaneously, creating that many line pointers.  At the fillfactor=10
> minimum, it's good to accept otherwise-empty pages having at least nine line
> pointers, so a page can restart the aforementioned lifecycle.  Tolerating even
> more line pointers helps when updates reduce tuple size or when the page
> was used for smaller tuples before it last emptied.  At the BLCKSZ=8192
> default, this maxPaddedFsmRequest allows 36 line pointers (good or
> somewhat high).  At the BLCKSZ=1024 minimum, it allows 4 line pointers
> (low).  At the BLCKSZ=32768 maximum, 146 (likely excessive).  I'm not
> concerned about optimizing non-default block sizes, so let's keep your
> proposal.
> 

Agreed. You briefly mention this already, but the case that caused me to report 
this was exactly the one where under normal circumstances each UPDATE would be 
small. However, in rare cases, the tuple that is updated grows in size to 1k 
bytes (the specific case we encountered sometimes would under specific 
circumstances write extra info in a field, which would otherwise be NULL). 
Suppose that this 1k UPDATE does not fit into the current page (so no HOT 
update), then a new page would be created (HEAD behavior). However, it is very 
likely that the next updates to this same tuple will be the regular size again. 
This causes the higher number of line pointers on the page.

-Floris

RE: non-HOT update not looking at FSM for large tuple update

2021-03-09 Thread Floris Van Nee

Hi,

> 
> This patch fails to consider that len may be bigger than MaxHeapTupleSize *
> 0.98, which in this case triggers a reproducable
> PANIC:

Good catch! I've adapted the patch with your suggested fix.

> 
> One different question I have, though, is why we can't "just" teach vacuum
> to clean up trailing unused line pointers. As in, can't we trim the line 
> pointer
> array when vacuum detects that the trailing line pointers on the page are all
> unused?
> 
> The only documentation that I could find that this doesn't happen is in the
> comment on PageIndexTupleDelete and PageRepairFragmentation, both not
> very descriptive on why we can't shrink the page->pd_linp array. One is
> "Unlike heap pages, we compact out the line pointer for the removed tuple."
> (Jan. 2002), and the other is "It doesn't remove unused line pointers! Please
> don't change this." (Oct. 2000), but I can't seem to find the documentation /
> conversations on the implications that such shrinking would have.
> 

This is an interesting alternative indeed. I also can't find any 
documentation/conversation about this and the message is rather cryptic.
Hopefully someone on the list still remembers the reasoning behind this rather 
cryptic comment in PageRepairFragmentation.

-Floris


v3-Allow-inserting-tuples-into-almost-empty-pages.patch
Description: v3-Allow-inserting-tuples-into-almost-empty-pages.patch

RE: non-HOT update not looking at FSM for large tuple update

2021-03-08 Thread Floris Van Nee

> I've added this to the commitfest as a bug fix and added you as an author.

Thanks. Patch looks good to me, but I guess there needs to be someone else 
reviewing too?
Also, would this be a backpatchable bugfix?

-Floris

RE: non-HOT update not looking at FSM for large tuple update

2021-02-24 Thread Floris Van Nee


> That makes sense, although the exact number seems precisely tailored to your 
> use case. 2% gives 164 bytes of slack and doesn't seem too large. Updated 
> patch attached.

Yeah, I tried picking it as conservative as I could, but 2% is obviously great 
too. :-) I can't think of any large drawbacks either of having a slightly 
larger value.
Thanks for posting the patch!

-Floris

RE: non-HOT update not looking at FSM for large tuple update

2021-02-24 Thread Floris Van Nee

> In this case, the vast majority has between 8050-8099 bytes free according to 
> the FSM. That means that, for this particular case, for a fillfactor of 10, 
> we’d need to subtract ~120 bytes or so in order to properly recycle the pages.

Also, I think this "fudge" factor would need to be defined as a percentage of 
the page size as well. 100 bytes on an 8kB page is quite different than 100 
bytes on a 1kB page (although I have no idea if people ever actually compile PG 
with a different page size, but it is supposed to be supported?).

I also understand the temptation to define it based on the relation's fill 
factor, as you did in the patch. However, upon some further thought I wonder if 
that's useful? A relation with a higher fill factor will have a lower 
'saveFreeSpace' variable, so it's less likely to run into issues in finding a 
fresh page, except if the tuple you're inserting/updating is even larger. 
However, if that case happens, you'll still be wanting to look for a page 
that's completely empty (except for the line items). So the only proper metric 
is 'how many unused line items do we expect on empty pages' and the fillfactor 
doesn't say much about this. Since this is probably difficult to estimate at 
all, we may be better off just defining it off MaxHeapTupleSize completely?
For example, we expect 1.5% of the page could be line items, then:

targetFreeSpace = MaxHeapTupleSize * 0.985

-Floris

RE: non-HOT update not looking at FSM for large tuple update

2021-02-24 Thread Floris Van Nee

Hi John,



> One idea is to take your -50 idea and make it more general and safe, by 
> scaling the fudge factor based on fillfactor, such that if fillfactor is less 
> than 100, the requested freespace is a bit smaller than the max. It's still a 
> bit of a hack, though. I've attached a draft of this idea.



You’re right, that’d work better. Though, one thing I'd forgot to mention 
earlier is that in the "wild" where this occurred, the UPDATEs with these large 
tuple sizes are much rarer than UPDATEs with a much smaller tuple size. So this 
means that in reality, when this happens, the empty pages contain more unused 
line pointers and we’d need to subtract more bytes in order to find those pages 
in the fsm.



This is the (partial) output of pg_freespace function, bucketed by free space, 
for a real-life table with fillfactor=10 under the mixed load that I've 
described.

│ free │   count │

│ 7750 │2003 │

│ 7800 │7113 │

│ 7850 │1781 │

│ 7900 │6803 │

│ 7950 │   13643 │

│ 8000 │   64779 │

│ 8050 │ 2469665 │

│ 8100 │   61869 │

└──┴─┘

(163 rows)



The ‘free’ column is the bucket where the value is the lower limit. So, 
free=7500 means between 7500-7549 bytes free, and count is the number of pages 
that have this amount free according to the fsm.

In this case, the vast majority has between 8050-8099 bytes free according to 
the FSM. That means that, for this particular case, for a fillfactor of 10, 
we’d need to subtract ~120 bytes or so in order to properly recycle the pages.



-Floris

non-HOT update not looking at FSM for large tuple update

2021-02-24 Thread Floris Van Nee

Hi hackers,

Recently we found a table that was slowly, but consistently increasing in size. 
The table has a low fill-factor set and was updated very frequently. As 
expected, almost all updates are HOT updates, but for some of the non-HOT 
updates it always wanted to use a new page, rather than reuse an existing empty 
page. This led to a steady growth in table size (and a steady growth in the 
number of empty pages in the table).

I've managed to create a very simple reproducing example that shows the problem 
(original problem occurred on 12.4, but I've tested this example on latest 
master). It only occurs for updates where the new tuple is larger than the size 
of what "fillfactor" would normally allow. In real life, this would only be a 
very small portion of the updates to a certain table of course, but in this 
example every update will be this large.

Create a table with a low fill-factor and insert one row into it. Note that, in 
this case, the row that we're inserting is by itself larger than the "max fill 
factor space".

create table t1 (a int primary key, b text) with (fillfactor=10);
insert into t1 select 1, (select string_agg('1', '') from 
generate_series(1,1000)); -- 1000 byte text field
vacuum t1;

postgres=# select * from pg_freespace('t1');
blkno | avail
---+---
 0 |  7104
(1 row)

This looks alright - there's 1 page and the available space is indeed roughly 
1000 bytes less, because of our tuple and page header.

Now, in a different backend, initiate a longer query.

select pg_sleep(600); -- just sleep 600 seconds so that we have enough time to 
do some updates during this

Then, in the original backend, update the tuple 7 times.

-- execute this 7 times
update t1 set b=(select string_agg((random()*9)::int::text, '') from 
generate_series(1,1000)) where a=1;

Cancel the pg_sleep call.
Then execute

vacuum t1; -- cleans rows and updates the fsm

postgres=# select * from pg_freespace('t1');
blkno | avail
---+---
 0 |  8128
 1 |  7104
(2 rows)

This still looks OK. There's an extra page, because a total of 8 tuples needed 
to kept alive for the pg_sleep query. These didn't fit on one page, so a new 
page was created.

Now, repeat it (the pg_sleep, update 7 times, cancel the pg_sleep and vacuum).

postgres=# select * from pg_freespace('t1');
blkno | avail
---+---
 0 |  8128
 1 |  8128
 2 |  7104
(3 rows)

This does not look good anymore. The tuple was on page 1, so at first there 
were several HOT updates on page 1. Then, when page 1 was full, it needed to 
search for another page to put the tuple. It did not consider page 0, but 
instead decided to create a new page 2.

Repeating this process would create a new page each time, never reusing the 
empty old pages.

The reason it does not consider page 0 is because of this piece of code in 
function RelationGetBufferForTuple in hio.c:

 /* Compute desired extra freespace due to fillfactor option */
 saveFreeSpace = RelationGetTargetPageFreeSpace(relation, 
HEAP_DEFAULT_FILLFACTOR);
...
 if (len + saveFreeSpace > MaxHeapTupleSize)
 {
   /* can't fit, don't bother asking FSM */
   targetBlock = InvalidBlockNumber;
   use_fsm = false;
 }

The problem here is two-folded: for any non-HOT update of a tuple that's larger 
than the size of the fillfactor, the fsm will not be used, but instead a new 
page will be chosen.
This seems to rely on the false assumption that every existing page has at last 
one tuple on it.
Secondly, and this is a bit trickier.. Even if we would ask the FSM to come up 
with a free page with a free size of "MaxHeapTupleSize", it wouldn't find 
anything... This is, because the FSM tracks free space excluding any unused 
line pointers. In this example, if we look at block 0:

postgres=# select * from page_header(get_raw_page('t1', 0));
lsn| checksum | flags | lower | upper | special | pagesize | version | 
prune_xid
---+--+---+---+---+-+--+-+---
0/16D75A0 |0 | 5 |52 |  8192 |8192 | 8192 |   4 |   
  0
(1 row)

postgres=# select * from heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
++--++++--++-++++---+
  1 |  0 |0 |  0 |||  ||
 ||||   |
  2 |  0 |0 |  0 |||  ||
 ||||   |
  3 |  0 |0 |  0 |||  ||
 ||||   |
  4 |  0 |0 |  0 |||  ||
 ||||   |

RE: Index Skip Scan (new UniqueKeys)

2020-07-23 Thread Floris Van Nee

> 
> One UniqueKey can have multiple corresponding expressions, which gives us
> also possibility of having one unique key with (t1.a, t2.a) and it looks now
> similar to EquivalenceClass.
> 

I believe the current definition of a unique key with two expressions (t1.a, 
t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker 
guarantees than uniqueness on (t1.a) and uniqueness on (t2.a).

> 
> The idea behind this query sounds questionable to me, more transparent
> would be to do this without distinct, skipping will actually do exactly the 
> same
> stuff just under another name. But if allowing skipping on constants do not
> bring significant changes in the code probably it's fine.
> 

Yeah indeed, I didn't say it's a query that people should generally write. :-) 
It's better to write as a regular SELECT with LIMIT 1 of course. However, it's 
more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) 
* FROM t1 runs fast, then it doesn't make sense to the user if a SELECT 
DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the 
implementation more consistent with little code changes.

> >
> > Yeah, there's definitely some double work there, but the actual impact may
> be limited - it doesn't actually allocate a new path key, but it looks it up 
> in
> root->canon_pathkeys and returns that path key.
> > I wrote it like this, because I couldn't find a way to identify from a 
> > certain
> PathKey the actual location in the index of that column. The constructed path
> keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes
> path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But
> how to get from PathKey=b to that number 3, I didn't find a solid way except
> doing this. Maybe there is though?
> 
> I don't think there is a direct way, but why not modify build_index_paths to
> also provide this information, or compare index_pathkeys expressions with
> indextlist without actually create those pathkeys again?
> 

I agree there could be other ways - I don't currently have a strong preference 
for either. I can have a look at this later.

> And couple of words about this thread [1]. It looks to me like a strange way
> of interacting with the community. Are you going to duplicate there
> everything, or what are your plans? At the very least you could try to include
> everyone involved in the recipients list, not exclude some of the authors.
> 

When I wrote the first mail in the thread, I went to this thread [1] and 
included everyone from there, but I see now that I only included the to: and 
cc: people and forgot the original thread author, you. I'm sorry about that - I 
should've looked better to make sure I had everyone.
In any case, my plan is to keep the patch at least applicable to master, as I 
believe it can be helpful for discussions about both patches.

[1] 
https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost

RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-07-22 Thread Floris Van Nee

Hi Andy,

A small thing I found:

+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+   
 List *const_exprs,
+   
 List *const_expr_opfamilies,
+   
 Bitmapset *used_varattrs,
+   
 bool *useful,
+   
 bool *multi_nullvals)
…
+ indexpr_item = list_head(unique_index->indexprs);
+ for(c = 0; c < unique_index->ncolumns; c++)
+ {

I believe the for loop must be over unique_index->nkeycolumns, rather than 
columns. It shouldn’t include the extra non-key columns. This can currently 
lead to invalid memory accesses as well a few lines later when it does an array 
access of unique_index->opfamily[c] – this array only has nkeycolumns entries.

-Floris


From: Andy Fan 
Sent: Sunday 19 July 2020 5:03 AM
To: Dmitry Dolgov <9erthali...@gmail.com>
Cc: David Rowley ; PostgreSQL Hackers 
; Tom Lane ; Ashutosh 
Bapat ; rushabh.lat...@gmail.com
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey [External]

Fixed a test case in v10.

--
Best Regards
Andy Fan

RE: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Floris Van Nee

> 
> > - the uniquekeys that is built, still contains some redundant keys, that are
> normally eliminated from the path keys lists.
> 
> I guess you're talking about:
> 
> + if (EC_MUST_BE_REDUNDANT(ec))
> +   continue;
> 
> Can you add some test cases to your changes to show the effect of it? It
> seem to me redundant keys are already eliminated at this point by either
> make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
> I've missed something.
> 

The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks 
for uniqueness using pathkey_is_unique, but not for constantness. Consider a 
query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c

All the regular path keys filter out 'a' for constantness here - so they would 
end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd 
compared in create_distinct_paths, the pathkeys to the unique keys, so it 
wouldn't consider the skip scan.
Due to the other changes I made in 
create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan 
now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to 
give an actual failing test case now. However, since all code filters out 
constant keys, I think uniqueness should do it too - otherwise you could get 
into problems later on. It's also more consistent. If you already know 
something is unique by just (b), it doesn't make sense to store that it's 
unique by (a,b). Now that I think of it, the best place to do this 
EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, 
rather than build_uniquekeys though. It's probably good to move it there.

> Along the lines I'm also curious about this part:
> 
> - ListCell   *k;
> - List *exprs = NIL;
> -
> - foreach(k, ec->ec_members)
> - {
> - EquivalenceMember *mem = (EquivalenceMember *)
> lfirst(k);
> - exprs = lappend(exprs, mem->em_expr);
> - }
> -
> - result = lappend(result, makeUniqueKey(exprs, false, false));
> + EquivalenceMember *mem = (EquivalenceMember*)
> +lfirst(list_head(ec->ec_members));
> 
> I'm curious about this myself, maybe someone can clarify. It looks like
> generaly speaking there could be more than one member (if not
> ec_has_volatile), which "representing knowledge that multiple items are
> effectively equal". Is this information is not interesting enough to preserve 
> it
> in unique keys?

Yeah, that's a good question. Hence my question about the choice for Expr 
rather than EquivalenceClass for the Unique Keys patch to Andy/David. When 
storing just Expr, it is rather difficult to check equivalence in joins for 
example. Suppose, later on we decide to add join support to the distinct skip 
scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I 
think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results 
in a UniqueKey with expr (t1.a) (because currently we only take the first Expr 
in the list to construct the UniqueKey). We could also construct *two* unique 
keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's 
the correct approach either, as it will explode when you have multiple 
pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on 
table t2 as well (theoretically we could, but for that we need to check 
equivalence classes rather than expressions).

> 
> > - the distinct_pathkeys may be NULL, even though there's a possibility for
> skipping. But it wouldn't create the uniquekeys in this case. This makes the
> planner not choose skip scans even though it could. For example in queries
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> is constant, it's eliminated from regular pathkeys.
> 
> What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one 
though. The traditional non-skip approach would still scan all records with 
a=1. Skip would just fetch the first one with a=1 and then skip to the next 
prefix and stop the scan.

> 
> > - to combat the issues mentioned earlier, there's now a check in
> build_index_paths that checks if the query_pathkeys matches the
> useful_pathkeys. Note that we have to use the path keys here rather than
> any of the unique keys. The unique keys are only Expr nodes - they do not
> contain the necessary information about ordering. Due to elimination of
> some constant path keys, we have to search the attributes of the index to
> find the correct prefix to use in skipping.
> 
> IIUC here you mean this function, right?
> 
> + prefix = find_index_prefix_for_pathkey(root,

RE: Index Skip Scan (new UniqueKeys)

2020-07-10 Thread Floris Van Nee

Hi Dmitry,

Also took another look at the patch now, and found a case of incorrect data. It 
looks related to the new way of creating the paths, as I can't recall seeing 
this in earlier versions.

create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 
10) a, generate_series(1,100) b;
create index on t1 (a,b,c);

postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
  QUERY PLAN   
---
 Sort  (cost=2.92..2.94 rows=10 width=20)
   Sort Key: a, b DESC, c
   ->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
 Skip scan: true
(4 rows)


With the 'order by a, b desc, c' we expect the value of column 'b' to always be 
100. With index_skipscan enabled, it always gives 1 though. It's not correct 
that the planner chooses a skip scan followed by sort in this case.

-Floris

RE: Index Skip Scan

2020-04-07 Thread Floris Van Nee

> 
> * Suspicious performance difference between different type of workload,
>   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> 

His benchmark results indeed most likely point to multiple comparisons being 
done. Since the most likely place where these occur is _bt_readpage, I suspect 
this is called multiple times. Looking at your patch, I think that's indeed the 
case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a 
complete misestimation and chooses a skip scan here. First call to _bt_readpage 
will compare every tuple on the page already and store everything in the 
workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the 
elements on the page (not the workspace) are compared to find the next one. 
Then, another _bt_readpage is done, starting at the new offnum. So we'll 
compare every tuple (except 1) on the page again. Workspace now contains 
[2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets 
compared 5 times in _bt_readpage alone.

-Floris

RE: Index Skip Scan

2020-04-06 Thread Floris Van Nee



> 
> Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I
> would appreciate if in the future you can make it more visible that changes 
> you
> suggest contain some fixes. E.g. it wasn't clear for me from your previous 
> email
> that that's the case, and it doesn't make sense to pull into different 
> direction
> when we're trying to achieve the same goal :)

I wasn't aware that this particular case could be triggered before I saw 
Dilip's email, otherwise I'd have mentioned it here of course. It's just that 
because my patch handles filter conditions in general, it works for this case 
too.

> 
> > > In the patch I posted a week ago these cases are all handled
> > > correctly, as it introduces this extra logic in the Executor.
> >
> > Okay, So I think we can merge those fixes in Dmitry's patch set.
> 
> I'll definitely take a look at suggested changes in filtering part.

It may be possible to just merge the filtering part into your patch, but I'm 
not entirely sure. Basically you have to pull the information about skipping 
one level up, out of the node, into the generic IndexNext code. 

I'm eager to get some form of skip scans into master - any kind of patch that 
makes this possible is fine by me. Long term I think my version provides a more 
generic approach, with which we can optimize a much broader range of queries. 
However, since many more eyes have seen your patch so far, I hope yours can be 
committed much sooner. My knowledge on this committer process is limited 
though. That's why I've just posted mine so far in the hope of collecting some 
feedback, also on how we should continue with the process.

-Floris

RE: Index Skip Scan

2020-04-06 Thread Floris Van Nee

> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> > >
> > > I was just wondering how the distinct will work with the "skip scan"
> > > if we have some filter? I mean every time we select the unique row
> > > based on the prefix key and that might get rejected by an external
> > > filter right?
> >

Yeah, you're correct. This patch only handles the index conditions and doesn't 
handle any filters correctly. There's a check in the planner for the IndexScan 
for example that only columns that exist in the index are used. However, this 
check is not sufficient as your example shows. There's a number of ways we can 
force a 'filter' rather than an 'index condition' and still choose a skip scan 
(WHERE b!=0 is another one I think). This leads to incorrect query results.

This patch would need some logic in the planner to never choose the skip scan 
in these cases. Better long-term solution is to adapt the rest of the executor 
to work correctly in the cases of external filters (this ties in with the 
previous visibility discussion as well, as that's basically also an external 
filter, although a special case).
In the patch I posted a week ago these cases are all handled correctly, as it 
introduces this extra logic in the Executor.

-Floris

RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-03-08 Thread Floris Van Nee

> Attached is v5, which inlines in a targeted fashion, pretty much in the same
> way as the earliest version. This is the same as v4 in every other way.
> Perhaps you can test this.
> 

Thank you for the new patch. With the new one I am indeed able to reproduce a 
performance increase. It is very difficult to reliably reproduce it when 
running on a large number of cores though, due to the NUMA architecture.
For tests with a small number of cores, I pin all of them to the same node. 
With that, I see a significant performance increase for v5 compared to master. 
However, if I pin pgbench to a different node than the node that Postgres is 
pinned to, this leads to a 20% performance degradation compared to having all 
of them on the same node, as well as the stddev increasing by a factor of 2 
(regardless of patch). With that, it becomes very difficult to see any kind of 
performance increase due to the patch. For a large number of pgbench workers, I 
cannot specifically pin the pgbench worker on the same node as the Postgres 
backend connection it's handling. Leaving it to the OS gives very unreliable 
results - when I run the 30 workers / 30 connections test, I sometimes see 
periods of up to 30 minutes where it's doing it 'correctly', but it could also 
randomly run at the -20% performance for a long time. So far my best bet at 
explaining this is the NUMA performance hit. I'd like to be able to 
specifically schedule some Postgres backends to run on one node, while other 
Postgres backends run on a different node, but this isn't straightforward.

For now, I see no issues with the patch though. However, in real life 
situations there may be other, more important, optimizations for people that 
use big multi-node machines.

Thoughts?

-Floris

RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-03-01 Thread Floris Van Nee

> The cfbot thinks it doesn't even apply anymore --- conflict with the dedup
> patch, no doubt?

Minor conflict with that patch indeed. Attached is rebased version. I'm running 
some tests now - will post the results here when finished.

-Floris



v4-0001-Avoid-pipeline-stall-in-_bt_compare.patch
Description: v4-0001-Avoid-pipeline-stall-in-_bt_compare.patch


v4-0002-Inline-_bt_compare.patch
Description: v4-0002-Inline-_bt_compare.patch

RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-02-10 Thread Floris Van Nee

> 
> The interesting thing now is the role of the "negative infinity test"
> patch (the 0003-* patch) in all of this. I suspect that it may not be helping 
> us
> much here. I wonder, could you test the following configurations to settle
> this question?
> 
> *  with 30 clients (i.e. repeat the test that you reported on most
> recently)
> 
> *  with 30 clients (i.e. repeat the test that you reported got us
> that nice ~8.6% increase in TPS)
> 
> *  with 30 clients -- a new test, to see if performance is at all
> helped by the "negative infinity test" patch (the 0003-* patch).
> 
> It seems like a good idea to repeat the other two tests as part of performing
> this third test, out of general paranoia. Intel seem to roll out a microcode
> update for a spectre-like security issue about every other day.
> 

I ran all the tests on two different machines, several times for 1 hour each 
time. I'm still having a hard time getting reliable results for the 30 clients 
case though. I'm pretty certain the patches bring a performance benefit, but 
how high exactly is difficult to say. As for applying only patch 1+2 or all 
three patches - I found no significant difference between these two cases. It 
looks like all the performance benefit is in the first two patches.

-Floris

RE: Index Skip Scan

2020-02-04 Thread Floris Van Nee



> this point me and Jesper inclined to go with the second option. But maybe
> I'm missing something, are there any other suggestions?

Unfortunately I figured this would need a more invasive fix. I tend to agree 
that it'd be better to not skip in situations like this. I think it'd make most 
sense to make any plan for these 'prepare/fetch' queries would not use skip, 
but rather a materialize node, right?

-Floris

RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-01-30 Thread Floris Van Nee


> 
> Can you test this version, Floris? The second two patches are probably not
> helping here, so it would be useful if you could just test 0001-*, and then 
> test
> all three together. I can toss the latter two patches if there is no 
> additional
> speedup.
> 

Here's the results for runs with respectively 1 client, 9 clients and 30 
clients on master, v2-0001, v2-0001+0002+0003 and for completeness also the 
previous v1 version of the patches.
I ran the tests for 45 minutes each this time which seems to give more stable 
results.
I'd say applying just v2-0001 is actually slightly hurtful for single-core 
performance. Applying all of them gives a good improvement though. It looks 
like the performance improvement is also more noticeable at higher core counts 
now.


number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 139314796
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 51598.071835 (including connections establishing)
tps = 51598.098715 (excluding connections establishing)


number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 137257492
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 50836.107076 (including connections establishing)
tps = 50836.133137 (excluding connections establishing)


number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 141721881
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 52489.584928 (including connections establishing)
tps = 52489.611373 (excluding connections establishing)


number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 141663780
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 52468.065549 (including connections establishing)
tps = 52468.093018 (excluding connections establishing)




number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1197242115
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 443422.987601 (including connections establishing)
tps = 443423.306495 (excluding connections establishing)


number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1187890004
latency average = 0.020 ms
latency stddev = 0.002 ms
tps = 439959.241392 (including connections establishing)
tps = 439959.588125 (excluding connections establishing)


number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1203412941
latency average = 0.020 ms
latency stddev = 0.002 ms
tps = 445708.478915 (including connections establishing)
tps = 445708.798583 (excluding connections establishing)


number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1195359533
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 442725.734267 (including connections establishing)
tps = 442726.052676 (excluding connections establishing)



number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2617037081
latency average = 0.031 ms
latency stddev = 0.011 ms
tps = 969272.811990 (including connections establishing)
tps = 969273.960316 (excluding connections establishing)


number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2736881585
latency average = 0.029 ms
latency stddev = 0.011 ms
tps = 1013659.581348 (including connections establishing)
tps = 1013660.819277 (excluding connections establishing)


number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2844199686
latency average = 0.028 ms
latency stddev = 0.011 ms
tps = 1053407.074721 (including connections establishing)
tps = 1053408.220093 (excluding connections establishing)


number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2693765822
latency average = 0.030 ms
latency stddev = 0.011 ms
tps = 997690.883117 (including connections establishing)
tps = 997692.051005 (excluding connections establishing)

RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-01-28 Thread Floris Van Nee


> 
> I could do some tests with the patch on some larger machines. What exact
> tests do you propose? Are there some specific postgresql.conf settings and
> pgbench initialization you recommend for this? And was the test above just
> running 'pgbench -S' select-only with specific -T, -j and -c parameters?
> 

With Andres' instructions I ran a couple of tests. With your patches I can 
reproduce a speedup of ~3% on single core tests reliably on a dual-socket 
36-core machine for the pgbench select-only test case. When using the full 
scale test my results are way too noisy even for large runs unfortunately. I 
also tried some other queries (for example select's that return 10 or 100 rows 
instead of just 1), but can't see much of a speed-up there either, although it 
also doesn't hurt.

So I guess the most noticeable one is the select-only benchmark for 1 core:


transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 1
number of threads: 1
duration: 600 s
number of transactions actually processed: 30255419
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 50425.693234 (including connections establishing)
tps = 50425.841532 (excluding connections establishing)


transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 1
number of threads: 1
duration: 600 s
number of transactions actually processed: 31363398
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 52272.326597 (including connections establishing)
tps = 52272.476380 (excluding connections establishing)

This is the one with 40 clients, 40 threads. Not really an improvement, and 
quite still quite noisy.

transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 40
number of threads: 40
duration: 600 s
number of transactions actually processed: 876846915
latency average = 0.027 ms
latency stddev = 0.015 ms
tps = 1461407.539610 (including connections establishing)
tps = 1461422.084486 (excluding connections establishing)


transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 40
number of threads: 40
duration: 600 s
number of transactions actually processed: 872633979
latency average = 0.027 ms
latency stddev = 0.038 ms
tps = 1454387.326179 (including connections establishing)
tps = 1454396.879195 (excluding connections establishing)

For tests that don't use the full machine (eg. 10 clients, 10 threads) I see 
speed-ups as well, but not as high as the single-core run. It seems there are 
other bottlenecks (on the machine) coming into play.

-Floris

RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-01-27 Thread Floris Van Nee


> He reported that the problem went away with the patches applied. The
> following pgbench SELECT-only result was sent to me privately:
> 
> before:
> single: tps = 30300.202363 (excluding connections establishing)
> all cores:  tps = 1026853.447047 (excluding connections establishing)
> 
> after:
> single: tps = 31120.452919 (excluding connections establishing)
> all cores:  tps = 1032376.361006 (excluding connections establishing)
> 
> (Actually, he tested something slightly different -- he inlined
> _bt_compare() in his own way instead of using my 0002-*, and didn't use the
> 0003-* optimization at all.)
> 
> Apparently this was a large multi-socket machine. Those are hard to come by.
> 

I could do some tests with the patch on some larger machines. What exact tests 
do you propose? Are there some specific postgresql.conf settings and pgbench 
initialization you recommend for this? And was the test above just running 
'pgbench -S' select-only with specific -T, -j and -c parameters?

-Floris

RE: Index Skip Scan

2020-01-27 Thread Floris Van Nee

Hi Dmitry,

Thanks for the new patch! I tested it and managed to find a case that causes 
some issues. Here's how to reproduce:

drop table if exists t; 
create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, 
generate_series(1,1000) b;
create index on t (a,b,c,d);

-- correct
postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
from t order by a desc, b desc; fetch forward all from c; fetch backward all 
from c; commit; 
BEGIN
DECLARE CURSOR
 a |  b   | c | d  
---+--+---+
 5 | 1000 | 0 | 10
 4 | 1000 | 0 | 10
 3 | 1000 | 0 | 10
 2 | 1000 | 0 | 10
 1 | 1000 | 0 | 10
(5 rows)

 a |  b   | c | d  
---+--+---+
 1 | 1000 | 0 | 10
 2 | 1000 | 0 | 10
 3 | 1000 | 0 | 10
 4 | 1000 | 0 | 10
 5 | 1000 | 0 | 10
(5 rows)

-- now delete some rows
postgres=# delete from t where a=3;
DELETE 1000

-- and rerun: error is thrown
postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
from t order by a desc, b desc; fetch forward all from c; fetch backward all 
from c; commit; 
BEGIN
DECLARE CURSOR
 a |  b   | c | d  
---+--+---+
 5 | 1000 | 0 | 10
 4 | 1000 | 0 | 10
 2 | 1000 | 0 | 10
 1 | 1000 | 0 | 10
(4 rows)

ERROR:  lock buffer_content is not held
ROLLBACK


A slightly different situation arises when executing the cursor with an ORDER 
BY a, b instead of the ORDER BY a DESC, b DESC:
-- recreate table again and execute the delete as above

postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d 
from t order by a, b; fetch forward all from c; fetch backward all from c; 
commit; 
BEGIN
DECLARE CURSOR
 a | b | c | d  
---+---+---+
 1 | 1 | 1 | 10
 2 | 1 | 1 | 10
 4 | 1 | 1 | 10
 5 | 1 | 1 | 10
(4 rows)

 a |  b  | c | d  
---+-+---+
 5 |   1 | 1 | 10
 4 |   1 | 1 | 10
 2 | 827 | 1 | 10
 1 |   1 | 1 | 10
(4 rows)

COMMIT

And lastly, you'll also get incorrect results if you do the delete slightly 
differently:
-- leave one row where a=3 and b=1000
postgres=# delete from t where a=3 and b<=999;
-- the cursor query above won't show any of the a=3 rows even though they should


-Floris

RE: Index Skip Scan

2020-01-22 Thread Floris Van Nee

> Note in particular that index scans cannot return the same index tuple twice -
> - processing a page at a time ensures that that cannot happen.
> 
> Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> TID) to the executor more than once?
> 

The loose index scan shouldn't return a tuple twice. It should only be able to 
skip 'further', so that shouldn't be a problem. Out of curiosity, why can't 
index scans return the same tuple twice? Is there something in the executor 
that isn't able to handle this?

-Floris

Re: Index Skip Scan

2020-01-22 Thread Floris Van Nee

Hi Dmitry,

> > On Wed, Jan 22, 2020 at 07:50:30AM +, Floris Van Nee wrote:
> >
> > Anyone please correct me if I'm wrong, but I think one case where the 
> > current patch relies on some data from the page it has locked before it in 
> > checking this hi/lo key. I think it's possible for the following sequence 
> > to happen. Suppose we have a very simple one leaf-page btree containing 
> > four elements: leaf page 1 = [2,4,6,8]
> > We do a backwards index skip scan on this and have just returned our first 
> > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes 
> > in and inserts a tuple (value 5) into this page, but suppose the page 
> > happens to be full. So a page split occurs. As far as I know, a page split 
> > could happen at any random element in the page. One of the situations we 
> > could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1.

> In case if we just returned a tuple, the next action would be either
>> check the next page for another key or search down to the tree. Maybe

But it won't look at the 'next page for another key', but rather at the 'same 
page or another key', right? In the _bt_scankey_within_page shortcut we're 
taking, there's no stepping to a next page involved. It just locks the page 
again that it previously also locked.

> I'm missing something in your scenario, but the latter will land us on a
> required page (we do not point to any leaf here), and before the former
> there is a check for high/low key. Is there anything else missing?

Let me try to clarify. After we return the first tuple, so->currPos.buf is 
pointing to page=1 in my example (it's the only page after all). We've returned 
item=8. Then the split happens and the items get rearranged as in my example. 
We're still pointing with so->currPos.buf to page=1, but the page now contains 
[2,4]. The split happened to the right, so there's a page=2 with [5,6,8], 
however the ongoing index scan is unaware of that.
Now _bt_skip gets called to fetch the next tuple. It starts by checking 
_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the 
result of which will be 'true': we're comparing the skip key to the low key of 
the page. So it thinks the next key can be found on the current page. It locks 
the page and does a _binsrch, finding item=4 to be returned.

The problem here is that _bt_scankey_within_page mistakenly returns true, 
thereby limiting the search to just the page that it's pointing to already.
It may be fine to just fix this function to return the proper value (I guess 
it'd also need to look at the high key in this example). It could also be fixed 
by not looking at the lo/hi key of the page, but to use the local tuple buffer 
instead. We already did a _read_page once, so if we have any matching tuples on 
that specific page, we have them locally in the buffer already. That way we 
never need to lock the same page twice.

RE: Index Skip Scan

2020-01-21 Thread Floris Van Nee

> 
> Could you please elaborate why does it sound like that? If I understand
> correctly, to stop a scan only "between" pages one need to use only
> _bt_readpage/_bt_steppage? Other than that there is no magic with scan
> position in the patch, so I'm not sure if I'm missing something here.

Anyone please correct me if I'm wrong, but I think one case where the current 
patch relies on some data from the page it has locked before it in checking 
this hi/lo key. I think it's possible for the following sequence to happen. 
Suppose we have a very simple one leaf-page btree containing four elements: 
leaf page 1 = [2,4,6,8]
We do a backwards index skip scan on this and have just returned our first 
tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in 
and inserts a tuple (value 5) into this page, but suppose the page happens to 
be full. So a page split occurs. As far as I know, a page split could happen at 
any random element in the page. One of the situations we could be left with is:
Leaf page 1 = [2,4]
Leaf page 2 = [5,6,8]
However, our scan is still pointing to leaf page 1. For non-skip scans this is 
not a problem, as we already read all matching elements in our local buffer and 
we'll return those. But the skip scan currently:
a) checks the lo-key of the page to see if the next prefix can be found on the 
leaf page 1
b) finds out that this is actually true
c) does a search on the page and returns value=4 (while it should have returned 
value=6)

Peter, is my understanding about the btree internals correct so far?

Now that I look at the patch again, I fear there currently may also be such a 
dependency in the "Advance forward but read backward"-case. It saves the offset 
number of a tuple in a variable, then does a _bt_search (releasing the lock and 
pin on the page). At this point, anything can happen to the tuples on this page 
- the page may be compacted by vacuum such that the offset number you have in 
your variable does not match the actual offset number of the tuple on the page 
anymore. Then, at the check for (nextOffset == startOffset) later, there's a 
possibility the offsets are different even though they relate to the same tuple.


-Floris

RE: Index Skip Scan

2020-01-15 Thread Floris Van Nee

Hi all,

I reviewed the latest version of the patch. Overall some good improvements I 
think. Please find my feedback below.

- I think I mentioned this before - it's not that big of a deal, but it just 
looks weird and inconsistent to me:
create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, 
generate_series(1,100) b, generate_series(1,1) c); create index on t2 
(a,b,c desc);

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 
and b<=5 order by a,b,c desc;
   QUERY PLAN   
 
-
 Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..216.25 rows=500 width=12)
   Skip scan: true
   Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5))
(3 rows)

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 
order by a,b,c desc;
   QUERY PLAN   
 
-
 Unique  (cost=0.43..8361.56 rows=500 width=12)
   ->  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..8361.56 rows=9807 
width=12)
 Index Cond: ((a = 2) AND (b = 5))
(3 rows)

When doing a distinct on (params) and having equality conditions for all 
params, it falls back to the regular index scan even though there's no reason 
not to use the skip scan here. It's much faster to write b between 5 and 5 now 
rather than writing b=5. I understand this was a limitation of the unique-keys 
patch at the moment which could be addressed in the future. I think for the 
sake of consistency it would make sense if this eventually gets addressed.

- nodeIndexScan.c, line 126
This sets xs_want_itup to true in all cases (even for non skip-scans). I don't 
think this is acceptable given the side-effects this has (page will never be 
unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin)

- nbsearch.c, _bt_skip, line 1440
_bt_update_skip_scankeys(scan, indexRel); This function is called twice now - 
once in the else {} and immediately after that outside of the else. The second 
call can be removed I think.

- nbtsearch.c _bt_skip line 1597
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
scan->xs_itup = (IndexTuple) PageGetItem(page, 
itemid);

This is an UNLOCK followed by a read of the unlocked page. That looks incorrect?

- nbtsearch.c _bt_skip line 1440
if (BTScanPosIsValid(so->currPos) &&
_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, 
dir))

Is it allowed to look at the high key / low key of the page without have a read 
lock on it?

- nbtsearch.c line 1634
if (_bt_readpage(scan, indexdir, offnum))  ...
else
 error()

Is it really guaranteed that a match can be found on the page itself? Isn't it 
possible that an extra index condition, not part of the scan key, makes none of 
the keys on the page match?

- nbtsearch.c in general
Most of the code seems to rely quite heavily on the fact that xs_want_itup 
forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you 
considered that compacting of a page may still happen even if you hold the pin? 
[1] I've been trying to come up with cases in which this may break the patch, 
but I haven't able to produce such a scenario - so it may be fine. But it would 
be good to consider again. One thing I was thinking of was a scenario where 
page splits and/or compacting would happen in between returning tuples. Could 
this break the _bt_scankey_within_page check such that it thinks the scan key 
is within the current page, while it actually isn't? Mainly for backward and/or 
cursor scans. Forward scans shouldn't be a problem I think. Apologies for being 
a bit vague as I don't have a clear example ready when it would go wrong. It 
may well be fine, but it was one of the things on my mind.

[1] https://postgrespro.com/list/id/1566683972147.11...@optiver.com

-Floris

jsonb_set() strictness considered harmful to data

2019-10-20 Thread Floris Van Nee

FWIW I've been bitten by this 'feature' more than once as well, accidentally 
erasing a column. Now I usually write js = jsonb_set(js, coalesce(new_column, 
'null'::jsonb)) to prevent erasing the whole column, and instead setting the 
value to a jsonb null value, but I also found the STRICT behavior very 
surprising at first..


-Floris

Re: Index Skip Scan

2019-08-28 Thread Floris Van Nee


> Sorry for long delay. Here is more or less what I had in mind. After changing
> read_closest to read the whole page I couldn't resist to just merge it into
> readpage itself, since it's basically the same. It could raise questions 
> about>
> performance and how intrusive this patch is, but I hope it's not that much of 
> a
> problem (in the worst case we can split it back). I've also added few tests 
> for
> the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
> put into reviewing!

Putting it into one function makes sense I think. Looking at the patch, I think 
in general there are some good improvements in there.

I'm afraid I did manage to find another incorrect query result though, having 
to do with the keepPrev part and skipping to the first tuple on an index page:

postgres=# drop table if exists b; create table b as select a,b::int2 
b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create 
index on b (a,b,c); analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# set enable_indexskipscan=1;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
 a | b | c
---+---+---
 1 | 2 | 0
 2 | 4 | 0
 3 | 4 | 0
 4 | 4 | 0
 5 | 4 | 0
(5 rows)

postgres=# set enable_indexskipscan=0;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
 a | b | c
---+---+---
 1 | 2 | 0
 2 | 2 | 0
 3 | 2 | 0
 4 | 2 | 0
 5 | 2 | 0
(5 rows)


-Floris

Re: Optimize single tuple fetch from nbtree index

2019-08-27 Thread Floris Van Nee


>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update 
> that README file though. The current implementation of the patch should be 
> correct like this - that's why I added the look-back code on the page if the 
> tuple couldn't be found anymore on the same location on the page. Similarly, 
> it'll look on the page to the right if it detected a page split. These two 
> measures combined should give a correct implementation of the 'it's possible 
> that a scan stops in the middle of a page' relaxation. However, as Peter and 
> Tom pointed out earlier, they feel that the performance advantage that this 
> approach gives, does not outweigh the extra complexity at this time. I'd be 
> open to other suggestions though.

Although now that I think of it - do you mean the case where the tuple that we 
returned to the caller after _bt_first actually gets deleted (not moved) from 
the page? I guess that can theoretically happen if _bt_first returns a 
non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For 
my understanding, would a situation like the following lead to this (in my 
patch)?
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this 
tuple is actually deleted in the heap already, however it's not marked dead yet 
in the index.
2) Backend 1 does a heap fetch to check actual visibility and determines the 
tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and 
_bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on 
the page to mark tuples dead as well as 2) compact items on the page, thereby 
actually removing this item from the page.
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to 
locate the tuple where it left off, but cannot find it anymore because it got 
removed completely by backend 2.

If this is indeed possible then it's a bad issue unfortunately, and quite hard 
to try to reproduce, as a lot of things need to happen concurrently while doing 
a visiblity check.

As for your patch, I've had some time to take a look at it. For the two TODOs:

+   /* TODO Is it possible that currPage is not valid anymore? */
+   Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this 
function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
 * We note the buffer's block number so that we can release the 
pin later.
 * This allows us to re-read the buffer if it is needed again 
for hinting.
 */
so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);

As we figured out earlier, so->currPos.buf gets set to invalid when we release 
the pin by the unpin macro. So, if we don't store currPage number somewhere 
else, we cannot obtain the pin again if we need it during killitems. I think 
that's the reason that currPage is stored.

Other than the two TODOs in the code, I think the comments really help 
clarifying what's going on in the code - I'd be happy if this gets added.

-Floris

Re: Optimize single tuple fetch from nbtree index

2019-08-26 Thread Floris Van Nee


> It seems that it contradicts the very idea of your patch, so probably we
> should look for other ways to optimize this use-case.
> Maybe this restriction can be relaxed for write only tables, that never
> have to reread the page because of visibility, or something like that.
> Also we probably can add to IndexScanDescData info about expected number
> of tuples, to allow index work more optimal
> and avoid the overhead for other loads.=

The idea of the patch is exactly to relax this limitation. I forgot to update 
that README file though. The current implementation of the patch should be 
correct like this - that's why I added the look-back code on the page if the 
tuple couldn't be found anymore on the same location on the page. Similarly, 
it'll look on the page to the right if it detected a page split. These two 
measures combined should give a correct implementation of the 'it's possible 
that a scan stops in the middle of a page' relaxation. However, as Peter and 
Tom pointed out earlier, they feel that the performance advantage that this 
approach gives, does not outweigh the extra complexity at this time. I'd be 
open to other suggestions though.

> That's true. It took me quite some time to understand that existing code
> is correct.
> There is a comment for the structure's field that claims that
> BufferIsValid is the same that BufferIsPinned in ScanPos context.
> Attached patch contains some comments' updates. Any suggestions on how
> to improve them are welcome.

I'll have a look tomorrow. Thanks a lot for writing this up!

-Floris

Re: Optimize single tuple fetch from nbtree index

2019-08-24 Thread Floris Van Nee


> Hello,
> welcome to hackers with your first patch)

Thank you.

> Though, I got interested in the comment inconsistency you have found.
> I added debug message into this code branch of the patch and was able to
> see it in regression.diffs after 'make check':
> Speaking of your patch, it seems that the buffer was unpinned and pinned
> again between two reads,
> and the condition of holding it continuously has not been met.

May I ask what makes you conclude that the condition of holding the pin 
continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I 
wanted to check if indeed the pin was continuously held by the backend or not. 
I added some debug info to ReleaseBuffer for this: it turned out that the pin 
on the buffer was definitely never released by the backend between the calls to 
_bt_first and _bt_next. So the buffer got compacted while the backend held a 
pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in 
nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. 
It is used during inserts to make room on a page. I verified that if I comment 
out the calls to this function, the compacting never happens while I have a pin 
on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is 
one of the cases where compacting may happen even while other backends hold a 
pin to the buffer. Perhaps this should also be more clearly phrased in the 
comments in eg. _bt_killitems? Because currently those comments make it look 
like this case never occurs.

> While reading the code to answer your question, I noticed that
> BTScanPosIsPinned macro name is misleading.
> It calls BufferIsValid(), not BufferIsPinned() as one could expect.
> And BufferIsValid in bufmgr.h comment explicitly states that it
> shouldn't be confused with BufferIsPinned.
> The same goes for BTScanPosUnpinIfPinned().

I agree the name is misleading. It clearly does something else than how it's 
named. However, I don't believe this introduces problems in these particular 
pieces of code, as long as the macro's are always used. BTScanPosIsPinned 
actually checks whether it's valid and not necessarily whether it's pinned, as 
you mentioned. However, any time the buffer gets unpinned using the macro 
BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, 
any consecutive call to BTScanPosIsPinned should indeed return false. It'd 
definitely be nice if this gets clarified in comments though.

-Floris

Re: Index Skip Scan

2019-08-06 Thread Floris Van Nee

> Yes, the check should be for that. However, the query in question
> doesn't have any query_pathkeys, and hence query_uniquekeys in
> standard_qp_callback(), so therefore it isn't supported

> Your scenario is covered by one of the test cases in case the
> functionality is supported. But, I think that is outside the scope of
> the current patch.

Ah alright, thanks. That makes it clear why it doesn't work.
>From a user point of view I think it's rather strange that
SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2
would give a fast skip scan, even though the more likely query that someone 
would write
SELECT DISTINCT ON (a) a,b FROM a WHERE a=2
would not.
It is something we could be leave up to the next patch though.

Something else I just noticed which I'm just writing here for awareness; I 
don't think it's that pressing at the moment and can be left to another patch. 
When there are multiple indices on a table the planner gets confused and 
doesn't select an index-only skip scan even though it could. I'm guessing it 
just takes the first available index based on the DISTINCT clause and then 
doesn't look further, eg.
With an index on (a,b) and (a,c,b):
postgres=# explain select distinct on (a) a,c,b FROM a;
 QUERY PLAN

 Index Scan using a_a_b_idx on a  (cost=0.29..1.45 rows=5 width=12)
   Skip scan mode: true
(2 rows)
-> This could be an index only scan with the (a,b,c) index.

> For the records, the purpose of `_bt_read_closest` is not so much to reduce
> amount of data we read from a page, but more to correctly handle those
> situations we were discussing before with reading forward/backward in cursors,
> since for that in some cases we need to remember previous values for stepping
> to the next. I've limited number of items, fetched in this function just
> because I was misled by having a check for dead tuples in `_bt_skip`. Of 
> course
> we can modify it to read a whole page and leave visibility check for the code
> after `index_getnext_tid` (although in case if we know that all tuples on this
> page are visilbe I guess it's not strictly necessary, but we still get
> improvement from skipping itself).

I understand and I agree - primary purpose why we chose this function was to 
make it work correctly. I don't think it would be something for this patch to 
use the optimization of partially reading a page. My point was however, if this 
optimization was allowed in a future patch, it would have great performance 
benefits.
To fix the current patch, we'd indeed need to read the full page. It'd be good 
to take a close look at the implementation of this function then, because 
messing around with the previous/next is also not trivial. I think the current 
implementation also has a problem when the item that is skipped to, is the 
first item on the page. Eg. (this depends on page size)

postgres=# drop table if exists b; create table b as select a,b from 
generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b); 
analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# select distinct on(a) a,b from b;
 a | b
---+---
 1 | 1
 2 | 2   <-- (2,1) is the first item on the page and doesn't get selected by 
read_closest function. it returns the second item on page which is (2,2)
 3 | 2
 4 | 2
 5 | 2
(5 rows)


-Floris

Re: Optimize single tuple fetch from nbtree index

2019-08-05 Thread Floris Van Nee

Hi Peter,

> Actually, having looked at the test case in more detail, that now
> seems less likely. The test case seems designed to reward making it
> cheaper to access one specific tuple among a fairly large group of
> related tuples -- reducing the number of inner scans is not going to
> be possible there.

> If this really is totally representative of the case that Floris cares
> about, I suppose that the approach taken more or less makes sense.
> Unfortunately, it doesn't seem like an optimization that many other
> users would find compelling, partly because it's only concerned with
> fixed overheads, and partly because most queries don't actually look
> like this.

Thanks for taking a look. Unfortunately this is exactly the case I care about. 
I'm a bit puzzled as to why this case wouldn't come up more often by other 
users though. We have many large tables with timeseries data and it seems to me 
that with timeseries data, two of the most common queries are:
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know 
that there's an update *exactly* at timepoint t - so you're left with trying to 
find the latest update smaller than t)
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. 
sensor a1 and sensor a2, but both provide a temperature value and update 
independently from each other).
Both of these can also be done with some kind of DISTINCT ON clause, but this 
is often already much slower than just doing a nested loop of fast index 
lookups with LIMIT 1 (this depends on the frequency of the timeseries data 
itself versus the sampling rate of your query though, for high frequency time 
series and/or low frequency sampling the LIMIT 1 approach is much faster).

Note that there is actually some related work to this - in the Index Skip Scan 
thread [1] a function called _bt_read_closest was developed which also 
partially reads the page. A Skip Scan has a very similar access pattern to the 
use case I describe here, because it's also very likely to just require one 
tuple from the page. Even though the implementation in that patch is currently 
incorrect, performance of the Skip Scan would likely also be quite a bit faster 
if it had a correct implementation of this partial page-read and it wouldn't 
have to read the full page every time.

I have one further question about these index offsets. There are several 
comments in master that indicate that it's impossible that an item moves 'left' 
on a page, if we continuously hold a pin on the page. For example, 
_bt_killitems has a comment like this:
 
* Note that if we hold a pin on the target page continuously from initially
 * reading the items until applying this function, VACUUM cannot have deleted
 * any items from the page, and so there is no need to search left from the
 * recorded offset.  (This observation also guarantees that the item is still
 * the right one to delete, which might otherwise be questionable since heap
 * TIDs can get recycled.)  This holds true even if the page has been 
modified
 * by inserts and page splits, so there is no need to consult the LSN.
 
Still, exactly this case happens in practice. In my tests I was able to get 
behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have 
occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved 
left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I 
added code in my patch to also look left on the page from the previous 
indexOffset.

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part 
of Postgres that might move index items left even though there are others 
holding a pin?

-Floris

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169

Re: Index Skip Scan

2019-08-05 Thread Floris Van Nee

Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken 
a close look at the uniquekeys patch yet.


In my previous review I mentioned that queries of the form:

select distinct on(a) a,b from a where a=1;

do not lead to a skip scan with the patch, even though the skip scan would be 
much faster. It's about this piece of code in planner.c


/* Consider index skip scan as well */
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
root->distinct_pathkeys != NIL)

The root->distinct_pathkeys is already filtered for redundant keys, so column 
'a' is not in there anymore. Still, it'd be much faster to use the skip scan 
here, because a regular scan will scan all entries with a=1, even though we're 
really only interested in the first one. In previous versions, this would be 
fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys 
instead of root->distinct_pathkeys, but things change a bit now that the patch 
is rebased on the unique-keys patch. Would it be valid to change this check to 
root->query_uniquekeys != NIL to consider skip scans also for this query?

Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I 
don't think _bt_read_closest is correctly implemented, and I'm not sure if it 
can be used at all, due to concerns by Tom and Peter about such approach. I had 
a similar idea to only partially read items from a page for another use case, 
for which I submitted a patch last Friday. However, both Tom and Peter find 
this idea quite scary [1]. You could take a look at my patch on that thread to 
see the approach taken to correctly partially read a page (well, correct as far 
as I can see so far...), but perhaps we need to just use the regular 
_bt_readpage function which reads everything, although this is unfortunate from 
a performance point of view, since most of the time we're indeed just 
interested in the first tuple.

Eg. it looks like there's some mixups between index offsets and heap tid's in 
_bt_read_closest:
/*
* Save the current item and the previous, even if the
* latter does not pass scan key conditions
*/
ItemPointerData tid = prevItup->t_tid;
OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);

_bt_saveitem(so, itemIndex, prevOffnum, prevItup);
itemIndex++;

_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;

The 'prevOffnum' is the offset number for the heap tid, and not the offset 
number for the index offset, so it looks like just a random item is saved. 
Furthermore, index offsets may change due to insertions and vacuums, so if we, 
at any point, release the lock, these offsets are not necessarily valid 
anymore. However, currently, the patch just reads the closest and then doesn't 
consider this page at all anymore, if the first tuple skipped to turns out to 
be not visible. Consider the following sql to illustrate:

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 1) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a,b;

 a | b
---+---
 1 | 1
 2 | 1
 3 | 1
 4 | 1
 5 | 1
(5 rows)

delete from a where a=2 and b=1;
DELETE 1

select distinct on (a) a,b from a order by a,b;

 a |  b
---+-
 1 |   1
 2 | 249   ->> this should be b=2, because we deleted a=2 && b=1. 
however, it doesn't consider any tuples from that page anymore and gives us the 
first tuple from the next page.
 3 |   1
 4 |   1
 5 |   1
(5 rows)
?


-Floris


[1] 
https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a

Re: Optimize single tuple fetch from nbtree index

2019-08-02 Thread Floris Van Nee

Hi Tom,

Thanks for your quick reply!

> Regardless of whether there's actually a LIMIT 1?  That seems disastrous
> for every other case than the narrow one where the optimization wins.
> Because every other case is now going to have to touch the index page
> twice.  That's more CPU and about double the contention --- if you could
> not measure any degradation from that, you're not measuring the right
> thing.

I thought the same as well at first. Note that I did measure degradation of 
2-3% as mentioned on some cases, but initially I also expected worse. Do you 
have any ideas on cases that would suffer the most? I thought the tight inner 
nested loop that I posted in my performance tests would have this index lookup 
as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because 
these improve by a factor 2-3 with the patch). And my theory is that for a 
LIMIT 3, the price paid for this optimization is highest, because it would 
touch the page twice and read all items from it, while only returning three of 
them.

> In principle, you could pass down knowledge of whether there's a LIMIT,
> using the same mechanism used to enable top-N sorting.  But it'd have to
> also go through the AM interface layer, so I'm not sure how messy this
> would be.

This was an idea I had as well and I would be willing to implement such a thing 
if this is deemed interesting enough by the community. However, I didn't want 
to do this for the first version of this patch, as it would be quite some extra 
work, which would be useless if the idea of the patch itself gets rejected 
already. :-) I'd appreciate any pointers in the right direction - I can take a 
look at how top-N sorting pushes the LIMIT down. Given enough interest for the 
basic idea of this patch, I will implement it.

>> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, 
>> descend the tree to a leaf page and read just the first (or possibly two) 
>> tuples. It won't touch the rest of the page yet. If indeed just one tuple 
>> was required, there won't be a call to _bt_next and we're done. If we do 
>> need more than one tuple, _bt_next will resume reading tuples from the index 
>> page at the point where we left off.

> How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

Indeed we don't know that - that's why this initial patch does not make any 
assumptions about this and just assumes the good-weather scenario that 
everything is visible. I'm not sure if it's possible to give an estimation of 
this and whether or not that would be useful. Currently, if it turns out that 
the tuple is not visible, there'll just be another call to _bt_next again which 
will resume reading the page as normal. I'm open to implement any suggestions 
that may improve this.

>> - We need to take into account page splits, insertions and vacuums while we 
>> do not have the read-lock in between _bt_first and the first call to 
>> _bt_next. This made my patch quite a bit more complicated than my initial 
>> implementation.

> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

Agreed, which is also why I posted this initial version of the patch here 
already, to get some input from the experts on this topic what assumptions can 
be made now and in the future. If it turns out that it's completely not 
feasible to do an optimization like this, because of other constraints in the 
btree implementation, then we're done pretty quickly here. :-) For what it's 
worth: the patch at least passes make check consistently - I caught a lot of 
these edge cases related to page splits and insertions while running the 
regression tests, which runs the modified bits of code quite often and in 
parallel. There may be plenty of edge cases left however...

> I'm not unalterably opposed to doing something like this, but my sense
> is that the complexity and probable negative performance impact on other
> cases are not going to look like a good trade-off for optimizing the
> case at hand.

I do think it could be a big win if we could get something like this working. 
Cases with a LIMIT seem common enough to me to make it possible to add some 
extra optimizations, especially if that could lead to 2-3x the TPS for these 
kind of queries. However, it indeed needs to be within a reasonable complexity. 
If it turns out that in order for us to optimize this, we need to add a lot of 
extra complexity, it may not be worth it to add it.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

It is prett

Optimize single tuple fetch from nbtree index

2019-08-02 Thread Floris Van Nee

Hi hackers,


While I was reviewing some code in another patch, I stumbled upon a possible 
optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 
1'. I have written a small patch that implements this optimization, but I'd 
like some feedback on the design of the patch, whether it is correct at all to 
use this optimization, and whether the performance tradeoffs are deemed worth 
it by the community.


Basically, an example of the case I'd like to optimize is the following. Given 
a table 'tbl' with an index on columns (k,ts DESC):


SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;


And, even more importantly, when this query gets called in an inner loop like:


SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps 
thousands of iterations, could also be a loop over values of 'k' rather than 
timestamps. this is just an example

CROSS JOIN LATERAL (

   SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;

) _;


With time-series data, this case often arises as you have a certain natural key 
for which you store updates as they occur. Getting the state of k at a specific 
time then boils down to the given query there, which is almost always the 
fastest way to get this information, since the index scan with LIMIT 1 is very 
fast already. However, there seems to be a possibility to make this even faster 
(up to nearly 3x faster in test cases that use this nested loop of index scans)

Every time the index scan is done, all tuples from the leaf page are read in 
nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for 
this *only* the first time amgettuple gets called. This calls _bt_first in 
nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page 
and read just the first (or possibly two) tuples. It won't touch the rest of 
the page yet. If indeed just one tuple was required, there won't be a call to 
_bt_next and we're done. If we do need more than one tuple, _bt_next will 
resume reading tuples from the index page at the point where we left off.


There are a few caveats:

- Possible performance decrease for queries that need a small number of tuples 
(but more than one), because they now need to lock the same page twice. This 
can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple 
returned does not match other scan conditions; LIMIT 1 but the tuple returned 
is not visible; no LIMIT at all but there are just only a few matching rows.

- We need to take into account page splits, insertions and vacuums while we do 
not have the read-lock in between _bt_first and the first call to _bt_next. 
This made my patch quite a bit more complicated than my initial implementation.


I did performance tests for some best case and worst case test scenarios. TPS 
results were stable and reproducible in re-runs on my, otherwise idle, server. 
Attached are the full results and how to reproduce. I picked test cases that 
show best performance as well as worst performance compared to master. Summary: 
the greatest performance improvement can be seen for the cases with the 
subquery in a nested loop. In a nested loop of 100 times, the performance is 
roughly two times better, for 1 times the performance is roughly three 
times better. For most test cases that don't use LIMIT 1, I couldn't find a 
noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, 
a nested loop without any LIMIT-clause that returns just three tuples). This is 
also theoretically the worst-case test case, because it has to lock the page 
again and then read it, just for one tuple. In this case, I saw TPS decrease by 
2-3% in a few cases (details in the attached file), due to it having to 
lock/unlock the same page in both _bt_first and _bt_next.


A few concrete questions to the community:

- Does the community also see this as a useful optimization?

- Is the way it is currently implemented safe? I struggled quite a bit to get 
everything working with respect to page splits and insertions. In particular, I 
don't know why in my patch, _bt_find_offset_for_tid needs to consider searching 
for items with an offset *before* the passed offset. As far as my understanding 
goes, this could only happen when the index gets vacuumed in the mean-time. 
However, we hold a pin on the buffer the whole time (we even assert this), so 
vacuum should not have been possible. Still, this code gets triggered 
sometimes, so it seems to be necessary. Perhaps someone in the community who's 
more of an expert on this can shed some light on it.

- What are considered acceptable performance tradeoffs in this case? Is a 
performance degradation in any part generally not acceptable at all?


I'd also welcome any feedback on the process - this is my first patch and while 
I tried to follow the guidelines, I may have missed something along the way.


Attachments:

- out_limit.txt: pgbench resu

Re: Index Skip Scan

2019-07-11 Thread Floris Van Nee

> For the general forward direction but for a backwards cursor scroll,

> we'd return the lowest value for each distinct prefix, but for the
> general backwards direction (DESC case) we'd return the highest value
> for each distinct prefix. Looking at IndexNext() the cursor direction
> seems to be estate->es_direction and the general scan direction is
> indicated by the plan's indexorderdir. Can't we just pass both of
> those to index_skip() to have it decide what to do? If we also pass in
> indexorderdir then index_skip() should know if it's to return the
> highest or lowest value, right?

Correct, with these two values correct behavior can be deduced. The 
implementation of this is a bit cumbersome though. Consider a case like:

SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
Data (imagine every tuple here actually occurs 10.000 times in the index to see 
the benefit of skipping):
1,1,1
1,1,2
1,2,2
1,2,3
2,2,1
2,2,3
3,1,1
3,1,2
3,2,2
3,2,3

Creating a cursor on this query and then moving forward, you should get 
(1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, 
it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one 
until it finds a match. This match only comes at (3,1,2) however. Then it skips 
to the end.

If you move the cursor backwards from the end of the cursor, you should still 
get (3,1,2) (1,1,2). A possible implementation would start at the end and do a 
skip to the beginning of the prefix: (3,1,1). Then it needs to move forward 
one-by-one in order to find the first matching (minimum) item (3,1,2). When it 
finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It 
needs to move forwards to find the minimum element, but should stop as soon as 
it detects that the prefix doesn't match anymore (because there is no match for 
prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to 
skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find 
(1,1,2).
Perhaps anyone can think of an easier way to implement it?

I do think being able to use DISTINCT ON is very useful and it's worth the 
extra complications. In the future we can add even more useful skipping 
features to it, for example:
SELECT DISTINCT ON (a) * FROM a WHERE b =2
After skipping to the next prefix of column a, we can start a new search for 
(a,b)=(prefix,2) to avoid having to move one-by-one from the start of the 
prefix to the first matching element. There are many other useful optimizations 
possible. That won't have to be for this patch though :-)

-Floris

Re: Index Skip Scan

2019-07-10 Thread Floris Van Nee


> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

Note that in the case of a regular non-skip scan, this cursor backwards works 
because the Unique node on top does not support backwards scanning at all. 
Therefore, when creating the cursor, the actual plan actually contains a 
Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards' 
never reaches the the index scan therefore, as it just fetches stuff from the 
materialize node.

-Floris

Re: Index Skip Scan

2019-07-10 Thread Floris Van Nee


> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 1) ks);
create index on a (a,b);
analyze a;

set enable_indexskipscan=1; // setting this to 0 yields different results 
set random_page_cost=1;
explain SELECT DISTINCT ON (a) a,b FROM a;

BEGIN;
DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;

FETCH FROM c;
FETCH BACKWARD FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

END;

Re: Index Skip Scan

2019-07-10 Thread Floris Van Nee

> Here is finally a new version of the patch, where all the mentioned issues

> seems to be fixed and the corresponding new tests should keep it like that
> (I've skipped all the pubs at PostgresLondon for that).


Thanks for the new patch! Really appreciate the work you're putting into it.


I verified that the backwards index scan is indeed functioning now. However, 
I'm afraid it's not that simple, as I think the cursor case is broken now. I 
think having just the 'scan direction' in the btree code is not enough to get 
this working properly, because we need to know whether we want the minimum or 
maximum element of a certain prefix. There are basically four cases:

- Forward Index Scan + Forward cursor: we want the minimum element within a 
prefix and we want to skip 'forward' to the next prefix

- Forward Index Scan + Backward cursor: we want the minimum element within a 
prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Forward cursor: we want the maximum element within a 
prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Backward cursor: we want the maximum element within a 
prefix and we want to skip 'forward' to the next prefix

These cases make it rather complicated unfortunately. They do somewhat tie in 
with the previous discussion on this thread about being able to skip to the min 
or max value. If we ever want to support a sort of minmax scan, we'll encounter 
the same issues.


Also, I think in planner.c, line 4831, we should actually be looking at whether 
uniq_distinct_pathkeys is NIL, rather than the current check for 
distinct_pathkeys. That'll make the planner pick the skip scan even with 
queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick 
the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. 
This leads to it scanning every item for a=2 even though it can stop after the 
first one.


I'll do some more tests with the patch.


-Floris

Re: Index Skip Scan

2019-06-23 Thread Floris Van Nee


> Try the attached patch -- it has DEBUG1 traces with the contents of
> index tuples from key points during index scans, allowing you to see
> what's going on both as a B-Tree is descended, and as a range scan is
> performed. It also shows details of the insertion scankey that is set
> up within _bt_first(). This hasn't been adopted to the patch at all,
> so you'll probably need to do that.

Thanks! That works quite nicely.

I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my 
test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,1), (2, 1), 
(2,2), ... (2,1) until (5,1).
Running the query
SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at 
a custom location in the tree, so it just starts from the beginning and 
searches until it finds the first tuple (1,2).
After the first tuple was yielded, _bt_skip kicks in. It constructs an insert 
scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on 
this, it finds the first tuple larger than this: (2,1). This is not the tuple 
that it stops at though, because after that it does this:

if (ScanDirectionIsForward(dir))
/* Move back for _bt_next */
offnum = OffsetNumberPrev(offnum);

/* Now read the data */
if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page.  Try to advance to
* the next page.  Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))

First, it takes the previous tuple with OffsetNumberPrev (so tuple before 
(2,1), which is (1,1)). This is done, because if this tuple were to be 
returned, there would be a call to _bt_next afterwards, which would then 
conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes 
things up, because it only reads tuples that match all the provided keys (so 
where b=2). The only tuple it'll return is (2,2). This will be the tuple that 
is set, however, on the call to _bt_next, the tuple is first incremented, so 
we'll find (2,3) there which doesn't match our keys. This leads it to skip 
(2,2) in our result set.

I was wondering about something else: don't we also have another problem with 
updating this current index tuple by skipping before calling 
btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples 
when scan->kill_prior_tuple is true. I'm not too familiar with the concept of 
killing dead tuples while doing index scans, but by looking at the code it 
seems to be possible that btgettuple returns a tuple, caller processes it and 
sets kill_prior_tuple to true in order to have it killed. However, then the 
skip scan kicks in, which sets the current tuple to a completely different 
tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my 
reasoning correct here or am I missing something?

-Floris

Re: Index Skip Scan

2019-06-22 Thread Floris Van Nee


> Thanks for testing! You're right, looks like in the current implementation in
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.

Thanks. Looking forward to it. I think I found some other strange behavior. 
Given the same table as in my previous e-mail, the following queries also 
return inconsistent results. I spent some time trying to debug it, but can't 
easily pinpoint the cause. It looks like it also skips over one value too much, 
my guess is during _bt_skippage call in _bt_skip?
Perhaps a question: when stepping through code in GDB, is there an easy way to 
pretty print for example the contents on an IndexTuple? I saw there's some 
tools out there that can pretty print plans, but viewing tuples is more 
complicated I guess.

-- this one is OK
postgres=# select distinct on (a) a,b from a where b>1;
 a | b
---+---
 1 | 2
 2 | 2
 3 | 2
 4 | 2
 5 | 2
(5 rows)

-- this one is not OK, it seems to skip too much
postgres=# select distinct on (a) a,b from a where b=2;
 a | b
---+---
 1 | 2
 3 | 2
 5 | 2
(3 rows)

Re: Index Skip Scan

2019-06-22 Thread Floris Van Nee

The following sql statement seems to have incorrect results - some logic in the 
backwards scan is currently not entirely right.

-Floris


drop table if exists a;
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 1) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a desc, b desc;
explain select distinct on (a) a,b from a order by a desc, b desc;

DROP TABLE
CREATE TABLE
INSERT 0 5
CREATE INDEX
ANALYZE
 a |   b   
---+---
 5 | 1
 5 | 1
 4 | 1
 3 | 1
 2 | 1
 1 | 1
(6 rows)

   QUERY PLAN   
 
-
 Index Only Scan Backward using a_a_b_idx on a  (cost=0.29..1.45 rows=5 width=8)
   Scan mode: Skip scan
(2 rows)

Re: Index Skip Scan

2019-06-05 Thread Floris Van Nee

> To address this, probably we can do something like in the attached patch.
> Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
> the same, but without the constants elimination. It's being used then for
> getting the real number of distinct keys, and to check the order of the 
> columns
> to not consider index skip scan if it's different. Hope it doesn't
> look too hacky.
>

Thanks! I've verified that it works now.
I was wondering if we're not too strict in some cases now though. Consider the 
following queries:

postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2';
  QUERY PLAN
---
 Index Only Scan using t_m_f_t_idx on t  (cost=0.29..11.60 rows=40 width=5) 
(actual time=0.056..0.469 rows=10 loops=1)
   Scan mode: Skip scan
   Index Cond: (m = 'M2'::text)
   Heap Fetches: 10
 Planning Time: 0.095 ms
 Execution Time: 0.490 ms
(6 rows)

postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2';
 QUERY PLAN

 Unique  (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 
loops=1)
   ->  Index Only Scan using t_m_f_t_idx on t  (cost=0.29..824.70 rows=10052 
width=5) (actual time=0.087..8.524 rows=1 loops=1)
 Index Cond: (m = 'M2'::text)
 Heap Fetches: 1
 Planning Time: 0.078 ms
 Execution Time: 10.944 ms
(6 rows)

This is basically the opposite case - when distinct_pathkeys matches the 
filtered list of index keys, an index skip scan could be considered. Currently, 
the user needs to write 'distinct m,f' explicitly, even though he specifies in 
the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine 
like this, but it could be a small improvement for consistency.

-Floris?

Re: Index Skip Scan

2019-06-01 Thread Floris Van Nee

Hi,


Thanks for the helpful replies.


> Yes, good catch, I'll investigate. Looks like in the current implementation
> something is not quite right, when we have this order of columns in an index
> and where clause (e.g. in the examples above everything seems fine if we 
> create
> index over (feedcode, market) and not the other way around).

I did a little bit of investigation and it seems to occur because in pathkeys.c 
the function pathkey_is_redundant considers pathkeys redundant if there is an 
equality condition with a constant in the corresponding WHERE clause.

 * 1. If the new pathkey's equivalence class contains a constant, and isn't
 * below an outer join, then we can disregard it as a sort key.  An example:
 * SELECT ... WHERE x = 42 ORDER BY x, y;

In planner.c it builds the list of distinct_pathkeys, which is then used for 
the index skip scan to skip over the first length(distinct_pathkeys) columns 
when it does a skip. In my query, the distinct_pathkeys list only contains 
'feedcode' and not 'market', because 'market' was considered redundant due to 
the WHERE clause. However, the index skip scan interprets this as that it has 
to skip over just the first column.
We need to get this list of number of prefix columns to skip differently while 
building the plan. We need the 'real' number of distinct keys without throwing 
away the redundant ones. However, I'm not sure if this information can still be 
obtained while calling create_skipscan_unique_path? But I'm sure people here 
will have much better ideas than me about this :-)

-Floris

Re: Index Skip Scan

2019-05-31 Thread Floris Van Nee

Actually I'd like to add something to this. I think I've found a bug in the 
current implementation. Would someone be able to check?

Given a table definition of (market text, feedcode text, updated_at 
timestamptz, value float8) and an index on (market, feedcode, updated_at desc) 
(note that this table slightly deviates from what I described in my previous 
mail) and filling it with data.


The following query uses an index skip scan and returns just 1 row (incorrect!)

select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'

The following query still uses the regular index scan and returns many more 
rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'


It seems that partially filtering on one of the distinct columns triggers 
incorrect behavior where too many rows in the index are skipped.


-Floris

Re: Index Skip Scan

2019-05-31 Thread Floris Van Nee

After some talks with Jesper at PGCon about the Index Skip Scan, I started 
testing this patch, because it seems to have great potential in speeding up 
many of our queries (great conference by the way, really enjoyed my first time 
being there!). I haven't looked in depth to the code itself, but I focused on 
some testing with real data that we have.

Let me start by sketching our primary use case for this, as it is similar, but 
slightly different than what was discussed earlier in this thread. I think this 
use case is something a lot of people who handle timeseries data have. Our 
database has many tables with timeseries data. We don't update rows, but just 
insert new rows each time. One example of this would be a table with prices for 
instruments. Instruments are identified by a column called feedcode. Prices of 
instrument update with a certain frequency. Each time it updates we insert a 
new row with the new value and the timestamp at that time. So in the simplest 
form, you could see it as a table like this:


create table prices (feedcode text, updated_at timestamptz, value float8); -- 
there'll be some other columns as well, this is just an example

create unique index on prices (feedcode, updated_at desc);


This table perfectly fits the criteria for the index skip scan as there are 
relatively few distinct feedcodes, but each of them has quite a lot of price 
inserts (imagine 1000 distinct feedcodes, each of them having one price per 
second). We normally partition our data by a certain time interval, so let's 
say we're just looking at one day of prices here. We have other cases with 
higher update frequencies and/or more distinct values though.


Typical queries on this table involve querying the price at a certain point in 
time, or simply querying the latest update. If we know the feedcode, this is 
easy:

select * from prices where feedcode='A' and updated_at <= '2019-06-01 12:00' 
order by feedcode, updated_at desc limit 1


Unfortunately, this gets hard if you want to know the price of everything at a 
certain point in time. The query then becomes:

select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 
12:00' order by feedcode, updated_at desc

Up until now (even with this patch) this uses a regular index scan + a unique 
node which scans the full index, which is terribly slow and is also not 
constant - as the table grows it becomes slower and slower.


Obviously there are currently already ways to speed this up using the recursive 
loose index scan, but I think everybody agrees that those are pretty 
unreadable. However, since they're also several orders of magnitude faster, we 
actually use them everywhere. Eg.

-- certain point in time

-- first query *all* distinct feedcodes (disregarding time), then look do an 
index scan for every feedcode found to see if it has an update in the time 
window we're interested in

-- this essentially means we're doing 2*N index scans

with recursive t as (
   select feedcode from prices order by feedcode, updated_at desc limit 1
   union all
   select n.feedcode from t
   cross join lateral (select feedcode from prices where feedcode > t.feedcode 
order by feedcode, updated_at desc limit 1) n
) select n.* from t
  cross join lateral (select * from prices where feedcode=t.feedcode and 
updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1) n

-- just latest value
-- if we're interested in just the latest value, it can actually be optimized 
to just N index scans like this.
-- to make it even more confusing - there's a tradeoff here.. if you're 
querying a timestamp close to the latest available timestamp, it is often 
faster to use this method anyway and just put the filter for updated_at inside 
this query. this avoids the overhead of 2*N index scans, at the expense that 
the LIMIT 1 may have to scan several tuples before finding one that matches the 
timestamp criteria. With the 2*N method above we're guaranteed that the first 
tuple it sees is the correct tuple, but we're doing many more scans...
with recursive t as (
   select * from prices order by feedcode, updated_at desc limit 1
   union all
   select n.* from t
   cross join lateral (select * from prices where feedcode > t.feedcode order 
by feedcode, updated_at desc limit 1) _
) select * from t


I hope this makes our current situation clear. Please do ask if I need to 
elaborate on something here.


So what changes with this patch? The great thing is that the recursive CTE is 
not required anymore! This is a big win for readability and it helps 
performance as well. It makes everything much better. I am really happy with 
these results. If the index skip scan triggers, it is easily over 100x faster 
than the naive 'distinct on' query in earlier versions of Postgres. It is also 
quite a bit faster than the recursive CTE version of the query.


I have a few remarks though. I tested some of our queries with the patch and 
found that the following

Re: partitioning performance tests after recent patches

2019-04-15 Thread Floris Van Nee

Hi David,

Thanks for your reply. I really appreciate your work on run-time pruning!
Here's the output of explain/analyze for HEAD. At run-time, technically all 
partitions could be pruned directly. However, one partition remains in the 
output of explain/analyze because of other difficulties with removing all of 
them, if I remember correctly? Still, that partition is never executed.  The 
only difference I can see is the Limit node on top, as well as apparently 
another  partition appearing in the analyze output (4096_4096, last partition, 
remains in the first plan. 4096_1, the first partition, remains the second 
plan).

-- select_now.sql
explain(analyze, verbose, buffers on)
select * from :tbl where a='abc' and updated_at between now() and 
now()+interval '1d';

Append  (cost=0.16..8949.61 rows=4096 width=112) (actual time=0.000..0.000 
rows=0 loops=1)
  Subplans Removed: 4095
  ->  Index Scan using p4096_4096_a_updated_at_idx on public.p4096_4096  
(cost=0.16..2.18 rows=1 width=112) (never executed)
Output: p4096_4096.a, p4096_4096.b, p4096_4096.c, p4096_4096.d, 
p4096_4096.updated_at
Index Cond: ((p4096_4096.a = 'abc'::text) AND (p4096_4096.updated_at >= 
now()) AND (p4096_4096.updated_at <= (now() + '1 day'::interval)))
Planning Time: 237.603 ms
Execution Time: 0.475 ms

-- select_now_limit.sql
explain(analyze, verbose, buffers on)
select * from :tbl where a='abc' and updated_at between now() and 
now()+interval '1d'
order by a, updated_at desc limit 1;

Limit  (cost=645.53..647.56 rows=1 width=112) (actual time=0.002..0.002 rows=0 
loops=1)
  Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
  ->  Append  (cost=645.53..8949.61 rows=4096 width=112) (actual 
time=0.000..0.000 rows=0 loops=1)
Subplans Removed: 4095
->  Index Scan using p4096_1_a_updated_at_idx on public.p4096_1  
(cost=0.57..2.03 rows=1 width=54) (never executed)
  Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, 
p4096_1.updated_at
  Index Cond: ((p4096_1.a = 'abc'::text) AND (p4096_1.updated_at >= 
now()) AND (p4096_1.updated_at <= (now() + '1 day'::interval)))
Planning Time: 3897.687 ms
Execution Time: 0.491 ms

Regarding the nested loops - thanks for your explanation. I can see this is 
more complicated than I initially thought. It may be doable to determine if 
your set of pruned partitions is still valid, but it's more difficult to 
determine if, on top of that, extra partitions must be included due to widening 
of the range. 

-Floris

________
From: David Rowley 
Sent: Monday, April 15, 2019 1:25 AM
To: Floris Van Nee
Cc: Pg Hackers
Subject: Re: partitioning performance tests after recent patches [External]

On Mon, 15 Apr 2019 at 07:19, Floris Van Nee  wrote:
> 3) What could be causing the big performance difference between case 7 
> (simple SELECT) and 8 (simple SELECT with ORDER BY  LIMIT 1)? For 4096 
> partitions, TPS of 7) is around 5, while adding the ORDER BY  LIMIT 1 
> makes TPS drop well below 1. In theory, run-time pruning of the right chunk 
> should take exactly the same amount of time in both cases, because both are 
> pruning timestamp now() on the same number of partitions. The resulting plans 
> are also identical with the exception of the top LIMIT-node (in PG11 they 
> differ slightly as a MergeAppend is chosen for the ORDER BY instead of an 
> Append, in HEAD with ordered append this is not necessary anymore). Am I 
> missing something here?

With the information provided, I don't really see any reason why the
ORDER BY LIMIT would slow it down if the plan is the same apart from
the LIMIT node. Please share the EXPLAIN ANALYZE output of each.

> 4) A more general question about run-time pruning in nested loops, like the 
> one for case 14. I believe I read in one of the previous threads that 
> run-time pruning only reoccurs if it determines that the value that 
> determines which partitions must be excluded has changed in between 
> iterations. How is this defined? Eg. let's say partitions are 1-day wide and 
> the first iteration of the loop filters on the partitioned table for 
> timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically 
> determined). Then the second iteration comes along and now filters on values 
> between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be 
> scanned hasn't changed, because both timestamps fall into the same partition. 
> Is the full process of run-time pruning applied again, or is there some kind 
> of shortcut that first checks if the previous pruning result is still valid 
> even if the value has changed slightly? If not, would this be a possible 
> optimization, as I think it's a case that occurs very often? I don't k

Re: speeding up planning with partitions

2019-04-05 Thread Floris Van Nee

Thanks for the details! Indeed the versions with now()/current_date use the 
runtime pruning rather than planning time. I wasn't aware of the use of 'today' 
though - that could be useful in case we're sure statements won't be prepared.

Previously (v10/ partly v11) it was necessary to make sure that statements on 
partioned tables were never prepared, because run-time pruning wasn't available 
- using a generic plan was almost always a bad option. Now in v12 it seems to 
be a tradeoff between whether or not run-time pruning can occur. If pruning is 
possible at planning time it's probably still better not to prepare statements, 
whereas if run-time pruning has to occur, it's better to prepare them.

One unrelated thing I noticed (but I'm not sure if it's worth a separate email 
thread) is that the changed default of jit=on in v12 doesn't work very well 
with a large number of partitions at run-time, for which a large number get 
excluded at run-time. A query that has an estimated cost above 
jit_optimize_above_cost takes about 30 seconds to run (for a table with 1000 
partitions), because JIT is optimizing the full plan. Without JIT it's barely 
20ms (+400ms planning). I can give more details in a separate thread if it's 
deemed interesting.

Planning Time: 411.321 ms
JIT:
  Functions: 5005
  Options: Inlining false, Optimization true, Expressions true, Deforming true
  Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, 
Emission 12533.611 ms, Total 29567.278 ms

-Floris

Re: speeding up planning with partitions

2019-04-04 Thread Floris Van Nee

Hi all,

First of all I would like to thank everyone involved in this patch for their 
hard work on this. This is a big step forward. I've done some performance and 
functionality testing with the patch that was committed to master and it looks 
very good.
I had a question about the performance of pruning of functions like now() and 
current_date. I know these are handled differently, as they cannot be excluded 
during the first phases of planning. However, curerntly, this new patch makes 
the performance difference between the static timestamp variant and now() very 
obvious (even more than before). Consider
select * from partitioned_table where ts >= now()
or
select * from partitioned_table where ts >= '2019-04-04'

The second plans in less than a millisecond, whereas the first takes +- 180ms 
for a table with 1000 partitions. Both end up with the same plan.

I'm not too familiar with the code that handles this, but is there a 
possibility for improvement in this area? Or is the stage at which exclusion 
for now()/current_date occurs already too far in the process to make any good 
improvements to this? My apologies if this is considered off-topic for this 
patch, but I ran into this issue specifically when I was testing this patch, so 
I thought I'd ask here about it. I do think a large number of use-cases for 
tables with a large number of partitions involve a timestamp for partition key, 
and naturally people will start writing queries for this that use functions 
such as now() and current_date.

Thanks again for your work on this patch!

-Floris

From: Amit Langote 
Sent: Tuesday, April 2, 2019 7:50 AM
To: Tom Lane
Cc: David Rowley; Imai Yoshikazu; jesper.peder...@redhat.com; Imai, Yoshikazu; 
Amit Langote; Alvaro Herrera; Robert Haas; Justin Pryzby; Pg Hackers
Subject: Re: speeding up planning with partitions [External]

Thanks for taking a look.

On 2019/04/02 2:34, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/03/30 0:29, Tom Lane wrote:
>>> That seems like probably an independent patch --- do you want to write it?
>
>> Here is that patch.
>> It revises get_relation_constraints() such that the partition constraint
>> is loaded in only the intended cases.
>
> So I see the problem you're trying to solve here, but I don't like this
> patch a bit, because it depends on root->inhTargetKind which IMO is a
> broken bit of junk that we need to get rid of.  Here is an example of
> why, with this patch applied:
>
> regression=# create table p (a int) partition by list (a);
> CREATE TABLE
> regression=# create table p1 partition of p for values in (1);
> CREATE TABLE
> regression=# set constraint_exclusion to on;
> SET
> regression=# explain select * from p1 where a = 2;
> QUERY PLAN
> --
>  Result  (cost=0.00..0.00 rows=0 width=0)
>One-Time Filter: false
> (2 rows)
>
> So far so good, but watch what happens when we include the same case
> in an UPDATE on some other partitioned table:
>
> regression=# create table prtab (a int, b int) partition by list (a);
> CREATE TABLE
> regression=# create table prtab2 partition of prtab for values in (2);
> CREATE TABLE
> regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and 
> p1.a=2;
> QUERY PLAN
> ---
>  Update on prtab  (cost=0.00..82.30 rows=143 width=20)
>Update on prtab2
>->  Nested Loop  (cost=0.00..82.30 rows=143 width=20)
>  ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
>Filter: (a = 2)
>  ->  Materialize  (cost=0.00..38.30 rows=11 width=14)
>->  Seq Scan on prtab2  (cost=0.00..38.25 rows=11 width=14)
>  Filter: (a = 2)
> (8 rows)
>
> No constraint exclusion, while in v10 you get
>
>  Update on prtab  (cost=0.00..0.00 rows=0 width=0)
>->  Result  (cost=0.00..0.00 rows=0 width=0)
>  One-Time Filter: false
>
> The reason is that this logic supposes that root->inhTargetKind describes
> *all* partitioned tables in the query, which is obviously wrong.
>
> Now maybe we could make it work by doing something like
>
>   if (rel->reloptkind == RELOPT_BASEREL &&
> (root->inhTargetKind == INHKIND_NONE ||
>  rel->relid != root->parse->resultRelation))

Ah, you're right.  inhTargetKind has to be checked in conjunction with
checking whether the relation is the target relation.

> but I find that pretty messy, plus it's violating the concept that we
> shouldn't be allowing messiness from inheritance_planner to leak into
> other places.

I'm afraid that we'll have to live with this particular hack as long as we
have inheritance_planner(), but we maybe could somewhat reduce the extent
to which the hack is spread into other planner files.

How about we move the part of get_relation_constraints() that loads the
partition constrai

60 matches

Mail list logo