Re: Add PortalDrop in exec_execute_message

2021-06-10 Thread Yura Sokolov

Alvaro Herrera wrote 2021-06-08 00:07:

On 2021-May-27, Yura Sokolov wrote:


Alvaro Herrera писал 2021-05-26 23:59:



> I don't think they should do that.  The portal remains open, and the
> libpq interface does that.  The portal gets closed at end of transaction
> without the need for any client message.  I think if the client wanted
> to close the portal ahead of time, it would need a new libpq entry point
> to send the message to do that.

- PQsendQuery issues Query message, and exec_simple_query closes its
  portal.
- people doesn't expect PQsendQueryParams to be different from
  PQsendQuery aside of parameter sending. The fact that the portal
  remains open is a significant, unexpected and undesired difference.
- PQsendQueryGuts is used in PQsendQueryParams and 
PQsendQueryPrepared.

  It is always sends empty portal name and always "send me all rows"
  limit (zero). Both usages are certainly to just perform query and
  certainly no one expects portal remains open.


Thinking about it some more, Yura's argument about PQsendQuery does 
make

sense -- since what we're doing is replacing the use of a 'Q' message
just because we can't use it when in pipeline mode, then it is
reasonable to think that the replacement ought to have the same
behavior.  Upon receipt of a 'Q' message, the portal is closed
automatically, and ISTM that that behavior should be preserved.

That change would not solve the problem he complains about, because 
IIUC

his framework is using PQsendQueryPrepared, which I'm not proposing to
change.  It just removes the other discrepancy that was discussed in 
the

thread.

The attached patch does it.  Any opinions?


I'm propose to change PQsendQueryParams and PQsendQueryPrepared
(through change of PQsendQueryGuts) since they both has semantic
"execute unnamed portal till the end and send me all rows".

Extended protocol were introduced by Tom Lane on 2003-05-05
in 16503e6fa4a13051debe09698b6db9ce0d509af8
This commit already has Close ('C') message.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16503e6fa4a13051debe09698b6db9ce0d509af8

libpq adoption of extended protocol were made by Tom month later
on 2003-06-23 in efc3a25bb02ada63158fe7006673518b005261ba
and there is already no Close message in PQsendQueryParams.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=efc3a25bb02ada63158fe7006673518b005261ba

I didn't found any relevant discussion in pgsql-hackers on May
and June 2003.

This makes me think, Close message were intended to be used
but simply forgotten when libpq patch were made.

Tom, could I be right?

regards,
Yura.




Re: Clear empty space in a page.

2021-06-01 Thread Yura Sokolov

Hi,

Andres Freund wrote 2021-05-31 00:07:

Hi,

On 2021-05-30 03:10:26 +0300, Yura Sokolov wrote:
While this result is not directly applied to stock PostgreSQL, I 
believe
page compression is important for full_page_writes with 
wal_compression

enabled. And probably when PostgreSQL is used on filesystem with
compression enabled (ZFS?).


I don't think the former is relevant, because the hole is skipped in 
wal page

compression (at some cost).


Ah, forgot about. Yep, you are right.


Therefore I propose clearing page's empty space with zero in
PageRepairFragmentation, PageIndexMultiDelete, PageIndexTupleDelete 
and

PageIndexTupleDeleteNoCompact.

Sorry, didn't measure impact on raw performance yet.


I'm worried that this might cause O(n^2) behaviour in some cases, by
repeatedly memset'ing the same mostly already zeroed space to 0. Why do 
we
ever need to do memset_hole() instead of accurately just zeroing out 
the space

that was just vacated?


It is done exactly this way: memset_hole accepts "old_pd_upper" and 
cleans between

old and new one.

regards,
Yura




Clear empty space in a page.

2021-05-29 Thread Yura Sokolov

Good day.

Long time ago I've been played with proprietary "compressed storage"
patch on heavily updated table, and found empty pages (ie cleaned by
vacuum) are not compressed enough.

When table is stress-updated, page for new row versions are allocated
in round-robin kind, therefore some 1GB segments contains almost
no live tuples. Vacuum removes dead tuples, but segments remains large
after compression (>400MB) as if they are still full.

After some investigation I found it is because PageRepairFragmentation,
PageIndex*Delete* don't clear space that just became empty therefore it
still contains garbage data. Clearing it with memset greatly increase
compression ratio: some compressed relation segments become 30-60MB just
after vacuum remove tuples in them.

While this result is not directly applied to stock PostgreSQL, I believe
page compression is important for full_page_writes with wal_compression
enabled. And probably when PostgreSQL is used on filesystem with
compression enabled (ZFS?).

Therefore I propose clearing page's empty space with zero in
PageRepairFragmentation, PageIndexMultiDelete, PageIndexTupleDelete and
PageIndexTupleDeleteNoCompact.

Sorry, didn't measure impact on raw performance yet.

regards,
Yura Sokolov aka funny_falconcommit 6abfcaeb87fcb396c5e2dccd434ce2511314ff76
Author: Yura Sokolov 
Date:   Sun May 30 02:39:17 2021 +0300

Clear empty space in a page

Write zeroes to just cleared space in PageRepairFragmentation,
PageIndexTupleDelete, PageIndexMultiDelete and PageIndexDeleteNoCompact.

It helps increase compression ration on compression enabled filesystems
and with full_page_write and wal_compression enabled.

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 82ca91f5977..7deb6cc71a4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -681,6 +681,17 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
 	phdr->pd_upper = upper;
 }
 
+/*
+ * Clean up space between pd_lower and pd_upper for better page compression.
+ */
+static void
+memset_hole(Page page, LocationIndex old_pd_upper)
+{
+	PageHeader	phdr = (PageHeader) page;
+	if (phdr->pd_upper > old_pd_upper)
+		MemSet((char *)page + old_pd_upper, 0, phdr->pd_upper - old_pd_upper);
+}
+
 /*
  * PageRepairFragmentation
  *
@@ -797,6 +808,7 @@ PageRepairFragmentation(Page page)
 
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
+	memset_hole(page, pd_upper);
 
 	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
@@ -1114,6 +1126,7 @@ PageIndexTupleDelete(Page page, OffsetNumber offnum)
 
 	if (offset > phdr->pd_upper)
 		memmove(addr + size, addr, offset - phdr->pd_upper);
+	MemSet(addr, 0, size);
 
 	/* adjust free space boundary pointers */
 	phdr->pd_upper += size;
@@ -1271,6 +1284,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 		compactify_tuples(itemidbase, nused, page, presorted);
 	else
 		phdr->pd_upper = pd_special;
+	memset_hole(page, pd_upper);
 }
 
 
@@ -1351,6 +1365,7 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
 
 	if (offset > phdr->pd_upper)
 		memmove(addr + size, addr, offset - phdr->pd_upper);
+	MemSet(addr, 0, size);
 
 	/* adjust free space boundary pointer */
 	phdr->pd_upper += size;


Re: Add PortalDrop in exec_execute_message

2021-05-27 Thread Yura Sokolov

Tom Lane wrote 2021-05-27 00:19:

Alvaro Herrera  writes:
(I didn't add a Close Portal message to PQsendQueryInternal in 
pipeline

mode precisely because there is no such message in PQsendQueryGuts.
I think it would be wrong to unconditionally add a Close Portal 
message

to any of those places.)


Yeah, I'm not very comfortable with having libpq take it on itself
to do that, either.


But...

Tom Lane wrote 2021-05-21 21:23:

I'm inclined to think that your complaint would be better handled
by having the client send a portal-close command, if it's not
going to do something else immediately.


And given PQsendQueryParams should not be different from
PQsendQuery (aside of parameters sending) why shouldn't it close
its portal immediately, like it happens in exec_simple_query ?

I really doubt user of PQsendQueryPrepared is aware of portal as
well since it is also unnamed and also exhausted (because
PQsendQueryGuts always sends "send me all rows" limit).

And why PQsendQueryInternal should behave differently in pipelined
and not pipelined mode? It closes portal in not pipelined mode,
and will not close portal of last query in pipelined mode (inside
of transaction).


Looking back at the original complaint, it seems like it'd be fair to
wonder why we're still holding a page pin in a supposedly completed
executor run.  Maybe the right fix is somewhere in the executor
scan logic.


Perhaps because query is simple and portal is created as seek-able?



regards, tom lane


regards
Yura Sokolov




Re: Add PortalDrop in exec_execute_message

2021-05-27 Thread Yura Sokolov

Alvaro Herrera писал 2021-05-26 23:59:

On 2021-May-25, Yura Sokolov wrote:


Tom Lane писал 2021-05-21 21:23:
> Yura Sokolov  writes:
> > I propose to add PortalDrop at the 'if (completed)' branch of
> > exec_execute_message.
>
> This violates our wire protocol specification, which
> specifically says
>
> If successfully created, a named portal object lasts till the end of
> the current transaction, unless explicitly destroyed. An unnamed
> portal is destroyed at the end of the transaction, or as soon as the
> next Bind statement specifying the unnamed portal as destination is
> issued. (Note that a simple Query message also destroys the unnamed
> portal.)
>
> I'm inclined to think that your complaint would be better handled
> by having the client send a portal-close command, if it's not
> going to do something else immediately.

I thought about it as well. Then, if I understand correctly,
PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send
"close portal" (CP) message after "execute" message, right?


I don't think they should do that.  The portal remains open, and the
libpq interface does that.  The portal gets closed at end of 
transaction

without the need for any client message.  I think if the client wanted
to close the portal ahead of time, it would need a new libpq entry 
point

to send the message to do that.


- PQsendQuery issues Query message, and exec_simple_query closes its
  portal.
- people doesn't expect PQsendQueryParams to be different from
  PQsendQuery aside of parameter sending. The fact that the portal
  remains open is a significant, unexpected and undesired difference.
- PQsendQueryGuts is used in PQsendQueryParams and PQsendQueryPrepared.
  It is always sends empty portal name and always "send me all rows"
  limit (zero). Both usages are certainly to just perform query and
  certainly no one expects portal remains open.


(I didn't add a Close Portal message to PQsendQueryInternal in pipeline
mode precisely because there is no such message in PQsendQueryGuts.


But PQsendQueryInternal should replicate behavior of PQsendQuery and
not PQsendQueryParams. Despite it has to use new protocol, it should
be indistinguishable to user, therefore portal should be closed.


I think it would be wrong to unconditionally add a Close Portal message
to any of those places.)


Why? If you foresee problems, please share your mind.

regards,
Sokolov Yura aka funny_falcon




Re: Add PortalDrop in exec_execute_message

2021-05-24 Thread Yura Sokolov

Tom Lane писал 2021-05-21 21:23:

Yura Sokolov  writes:

I propose to add PortalDrop at the 'if (completed)' branch of
exec_execute_message.


This violates our wire protocol specification, which
specifically says

If successfully created, a named portal object lasts till the end 
of

the current transaction, unless explicitly destroyed. An unnamed
portal is destroyed at the end of the transaction, or as soon as 
the

next Bind statement specifying the unnamed portal as destination is
issued. (Note that a simple Query message also destroys the unnamed
portal.)

I'm inclined to think that your complaint would be better handled
by having the client send a portal-close command, if it's not
going to do something else immediately.


I thought about it as well. Then, if I understand correctly,
PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send
"close portal" (CP) message after "execute" message, right?

regards,
Sokolov Yura




Add PortalDrop in exec_execute_message

2021-05-19 Thread Yura Sokolov

Hi, hackers.

I've been playing with "autoprepared" patch, and have got isolation
"freeze-the-dead" test stuck on first VACUUM FREEZE statement.
After some research I found issue is reproduced with unmodified master
branch if extended protocol used. I've prepared ruby script for
demonstration (cause ruby-pg has simple interface to PQsendQueryParams).

Further investigation showed it happens due to portal is not dropped
inside of exec_execute_message, and it is kept in third session till
COMMIT is called. Therefore heap page remains pinned, and VACUUM FREEZE
became locked inside LockBufferForCleanup.

It seems that it is usually invisible to common users since either:
- command is called as standalone and then transaction is closed
  immediately,
- next PQsendQueryParams will initiate another unnamed portal using
  CreatePortal("", true, true) and this action will drop previous
  one.

But "freeze-the-dead" remains locked since third session could not
send COMMIT until VACUUM FULL finished.

I propose to add PortalDrop at the 'if (completed)' branch of
exec_execute_message.

--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2209,6 +2209,8 @@ exec_execute_message(const char *portal_name, long 
max_rows)


if (completed)
{
+   PortalDrop(portal, false);
+
if (is_xact_command)
{

With this change 'make check-world' runs without flaws (at least
on empty configure with enable-cassert and enable-tap-tests).

There is small chance applications exist which abuses seekable
portals with 'execute' protocol message so not every completed
portal can be safely dropped. But I believe there is some sane
conditions that cover common case. For example, isn't empty name
check is enough? Can client reset or seek portal with empty
name?

regards,
Sokolov Yura aka funny_falconrequire 'pg'

c1 = PG.connect(host: "localhost", dbname: "postgres")
c2 = PG.connect(host: "localhost", dbname: "postgres")
c3 = PG.connect(host: "localhost", dbname: "postgres")

class PG::Connection
  def simple(sql)
puts sql
exec(sql)
  end
  def extended(sql)
puts "#{sql}"
exec_params(sql, [])
  end
end

c1.simple "DROP TABLE IF EXISTS tab_freeze" 
c1.simple "
CREATE TABLE tab_freeze (
  id int PRIMARY KEY,
  name char(3),
  x int);
INSERT INTO tab_freeze VALUES (1, '111', 0);
INSERT INTO tab_freeze VALUES (3, '333', 0);
"

c1.simple "BEGIN"
c2.simple "BEGIN"
c3.simple "BEGIN"
c1.simple "UPDATE tab_freeze SET x = x + 1 WHERE id = 3"
c2.extended "SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE"
c3.extended "SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE"
c1.simple "COMMIT"
c2.simple "COMMIT"
c2.simple "VACUUM FREEZE tab_freeze"
c1.simple "
BEGIN;
SET LOCAL enable_seqscan = false;
SET LOCAL enable_bitmapscan = false;
SELECT * FROM tab_freeze WHERE id = 3;
COMMIT;
"
c3.simple "COMMIT"
c2.simple "VACUUM FREEZE tab_freeze"
c1.simple "SELECT * FROM tab_freeze ORDER BY name, id"


Re: plan with result cache is very slow when work_mem is not enough

2021-05-09 Thread Yura Sokolov

David Rowley писал 2021-05-09 04:01:
On Sun, 9 May 2021 at 03:29, Pavel Stehule  
wrote:
Personally, I have not problem with too slow assertions, although it 
is not too practical. The main problem is some shock, and feeling so 
some is wrong. I spent 1 hour detecting if it is a bug or not.


Thanks for spending the time figuring out where the slowness came from.


Can it be possible to identify this situation?

Maybe use some specific name of this routine - like

assert_only_check_

Then I can see this warning in perf, and I don't need to do other or 
deeper checks


I don't think we need to go around leaving clues for people who run
perf on cassert builds. I think anyone doing that should just never
expect any meaningful results.


Occasionally there is a need to run cassert builds in production to
catch an issue. It is usually ok if cassert build O(1) slower than
optimized biuld (ie it is slower in some constant factor C). But
if cassert build will be quadratically slower, it will unusable.

regards,
Yura




Re: plan with result cache is very slow when work_mem is not enough

2021-05-08 Thread Yura Sokolov



8 мая 2021 г. 00:16:54 GMT+03:00, Tomas Vondra  
пишет:
>On 5/7/21 11:04 PM, David Rowley wrote:
>> On Sat, 8 May 2021 at 08:18, Pavel Stehule 
>wrote:
>>>
>>> pá 7. 5. 2021 v 21:56 odesílatel David Rowley 
>napsal:
>>>> With USE_ASSERT_CHECKING builds, I did add some code that verifies
>the
>>>> memory tracking is set correctly when evicting from the cache. This
>>>> code is pretty expensive as it loops over the entire cache to check
>>>> the memory accounting every time we evict something from the cache.
>>>> Originally, I had this code only run when some other constant was
>>>> defined, but I ended up changing it to compile that code in for all
>>>> assert enabled builds.
>>>>
>>>> I considered that it might be too expensive as you can see from the
>>>> comment in [1].  I just wanted to get a few machines other than my
>own
>>>> to verify that the memory accounting code was working as expected.
>>>> There have been no complaints of any Assert failures yet, so maybe
>>>> it's safe to consider either removing the code entirely or just
>having
>>>> it run when some other more specific to purpose constant is
>defined.
>>>> If we did the latter, then I'd have concerns that nothing would
>ever
>>>> run the code to check the memory accounting, that's why I ended up
>>>> changing it to run with USE_ASSERT_CHECKING builds.
>>>
>>>
>>> I understand. I think this is too slow for generic assertions,
>because the overhead is about 50x.
>> 
>> I didn't expect it would show up quite that much.  If you scaled the
>> test up a bit more and increased work_mem further, then it would be
>> even more than 50x.
>> 
>> At one point when I was developing the patch, I had two high water
>> marks for cache memory. When we reached the upper of the two marks,
>> I'd reduce the memory down to the lower of two marks.  The lower of
>> the two marks was set to 98% of the higher mark.  In the end, I got
>> rid of that as I didn't really see what extra overhead there was from
>> just running the eviction code every time we require another byte.
>> However, if we did have that again, then the memory checking could
>> just be done when we run the eviction code. We'd then need to consume
>> that 2% more memory before it would run again.
>> 
>> My current thinking is that I don't really want to add that
>complexity
>> just for some Assert code. I'd only want to do it if there was
>another
>> valid reason to.
>> 
>
>Agreed. I think this approach to eviction (i.e. evicting more than you 
>need) would be useful if the actual eviction code was expensive, and 
>doing it in a "batch" would make it significantly cheaper. But I don't 
>think "asserts are expensive" is a good reason for it.
>
>> Another thought I have is that maybe it would be ok just to move
>> memory accounting debug code so it only runs once in
>> ExecEndResultCache.  I struggling to imagine that if the memory
>> tracking did go out of whack, that the problem would have
>accidentally
>> fixed itself by the time we got to ExecEndResultCache().  I guess
>even
>> if the accounting was counting far too much memory and we'd evicted
>> everything from the cache to try and get the memory usage down, we'd
>> still find the problem during ExecEndResultCache(), even if the cache
>> had become completely empty as a result.
>> 
>
>I don't think postponing the debug code until much later is a great 
>idea. When something goes wrong it's good to know ASAP, otherwise it's 
>much more difficult to identify the issue.
>
>Not sure we need to do something here - for regression tests this is
>not 
>an issue, because those generally work with small data sets. And if you
>
>run with asserts on large amounts of data, I think this is acceptable.
>
>I had the same dilemma with the new BRIN index opclasses, which also 
>have some extensive and expensive assert checks - for the regression 
>tests that's fine, and it proved very useful during development.
>
>I have considered enabling those extra checks only on request somehow, 
>but I'd bet no one would do that and I'd forget it exists pretty soon.
>
>
>regards

Perhaps there is need for flag "heavy asserts". Or option 
"--enable-cassert=heavy". Then USE_ASSERT_CHECKING could be defined to integer 
1 or 2 depending on "heaviness" of enabled checks.

regards
Yura Sokolov




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Yura Sokolov

Pavel Stehule писал 2021-05-07 21:45:


Samples: 119K of event 'cycles', 4000 Hz, Event count (approx.):
Overhead  Shared Object Symbol
79.20%  postgres  [.] cache_reduce_memory
1.94%  [kernel]  [k] native_write_msr_safe
1.63%  [kernel]  [k] update_cfs_shares
1.00%  [kernel]  [k] trigger_load_balance
0.97%  [kernel]  [k] timerqueue_add
0.51%  [kernel]  [k] task_tick_fair
0.51%  [kernel]  [k] task_cputime
0.50%  [kernel]  [k] perf_event_task_tick
0.50%  [kernel]  [k] update_curr
0.49%  [kernel]  [k] hrtimer_active

Regards

Pavel


It is strange to see cache_reduce_memory itself consumes a lot of CPU.
It doesn't contain CPU hungry code.
It calls prepare_probe_slot, that calls some tuple forming. Then
it calls resultcache_lookup that may call to ResultCacheHash_hash
and ResultCacheHash_equal. And finally it calls remove_cache_entry.
I suppose remove_cache_entry should consume most of CPU time since
it does deallocations.
And if you compile with --enable-cassert, then remove_cache_entry
iterates through whole cache hashtable, therefore it reaches
quadratic complexity easily (or more correct M*N, where M - size
of a table, N - eviction count).

regards,
Yura Sokolov




Re: Use simplehash.h instead of dynahash in SMgr

2021-05-07 Thread Yura Sokolov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I believe it is ready for committer.

The new status of this patch is: Ready for Committer


Re: Use simplehash.h instead of dynahash in SMgr

2021-04-28 Thread Yura Sokolov

David Rowley писал 2021-04-29 02:51:
On Thu, 29 Apr 2021 at 00:28, Yura Sokolov  
wrote:

The best result is with just:

return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among 
whole

cluster
since it is Oid.


I admit to having tried that too just to almost eliminate the cost of
hashing.  I just didn't consider it something we'd actually do.

The system catalogues are quite likely to all have the same
relfilenode in all databases, so for workloads that have a large
number of databases, looking up WAL records that touch the catalogues
is going to be pretty terrible.


Single workload that could touch system catalogues in different
databases is recovery (and autovacuum?). Client backends couldn't
be connected to more than one database.

But netherless, you're right. I oversimplified it intentionally.
I wrote originally:

   hashcode = (uint32)rnode->node.dbNode * 0xc2b2ae35;
   hashcode ^= (uint32)rnode->node.relNode;
   return hashcode;

But before sending mail I'd cut dbNode part.
Still, main point: relNode could be put unmixed into final value.
This way less collisions happen.



The simplified hash function I wrote with just the relNode and dbNode
would be the least I'd be willing to entertain.  However, I just
wouldn't be surprised if there was a good reason for that being bad
too.


So, I've repeated benchmark with different number of partitons (I 
tried

to catch higher fillfactor) and less amount of inserted data (since I
don't
want to stress my SSD).

partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
fillfactor  |  | simple func |   | simple func  |
+--+-+--+
  3500/0.43  |   3.73s  |   3.54s |   3.58s|   3.34s  |
  3200/0.78  |   3.64s  |   3.46s |   3.47s|   3.25s  |
  1500/0.74  |   3.18s  |   2.97s |   3.03s|   2.79s  |

Fillfactor is effective fillfactor in simplehash with than number of
partitions.
I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.


Thanks for testing that.

I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
see if it made a difference.  The test was similar to last time, but I
trimmed the number of rows inserted from 100 million down to 25
million.  Last time I tested with 1000 partitions, this time with each
of: 100 200 300 400 500 600 700 800 900 1000 partitions.  There didn't
seem to be any point of testing lower than that as the minimum hash
table size is 512.

The averages over 10 runs were:

nparts  ff75   ff90
100  21.898  22.226
200  23.105  25.493
300  25.274  24.251
400  25.139  25.611
500  25.738  25.454
600  26.656  26.82
700  27.577  27.102
800  27.608  27.546
900  27.284  28.186
100029 28.153

Or to summarise a bit, we could just look at the sum of all the
results per fillfactor:

sum ff75 2592.79
sum ff90 2608.42 100.6%

fillfactor 75 did come out slightly faster, but only just. It seems
close enough that it might be better just to keep the 90 to save a
little memory and improve caching elsewhere.  Also, from below, you
can see that for 300, 500, 600, 700, 1000 tables tests, the hash
tables ended up the same size, yet there's a bit of variability in the
timing result.  So the 0.6% gain might just be noise.

I don't think it's worth making the fillfactor 75.


To be clear: I didn't change SH_FILLFACTOR. It were equal to 0.9 .
I just were not able to catch table with fill factor more than 0.78.
Looks like you've got it with 900 partitions :-)



Another line of thought for making it go faster would be to do
something like get rid of the hash status field from SMgrEntry.  That
could be either coded into a single bit we'd borrow from the hash
value, or it could be coded into the least significant bit of the data
field.  A pointer to palloc'd memory should always be MAXALIGNed,
which means at least the lower two bits are always zero.  We'd just
need to make sure and do something like "data & ~((uintptr_t) 3)" to
trim off the hash status bits before dereferencing the pointer.  That
would make the SMgrEntry 12 bytes on a 64-bit machine.  However, it
would also mean that some entries would span 2 cache lines, which
might affect performance a bit.


Then data pointer will be itself unaligned to 8 bytes. While x86 is
mostly indifferent to this, doubtfully this memory economy will pay
off.

regards,
Yura Sokolov.




Re: Use simplehash.h instead of dynahash in SMgr

2021-04-28 Thread Yura Sokolov

David Rowley писал 2021-04-26 09:43:

I tried that and it got a median result of 113.795 seconds over 14
runs with this recovery benchmark test.

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1014,
max chain: 6, avg chain: 0.499016, total_collisions: 428,
max_collisions: 3, avg_collisions: 0.210630

I also tried the following hash function just to see how much
performance might be left from speeding it up:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h;

h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32)
rnode->node.dbNode);
return murmurhash32(h);
}

I got a median of 112.685 seconds over 14 runs with:

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1044,
max chain: 7, avg chain: 0.513780, total_collisions: 438,
max_collisions: 3, avg_collisions: 0.215551


The best result is with just:

   return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among whole 
cluster

since it is Oid.


I'd like to see benchmark code. It quite interesting this place became
measurable at all.


Sure.

$ cat recoverybench_insert_hash.sh


David.


So, I've repeated benchmark with different number of partitons (I tried
to catch higher fillfactor) and less amount of inserted data (since I 
don't

want to stress my SSD).

partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
fillfactor  |  | simple func |   | simple func  |
+--+-+--+
 3500/0.43  |   3.73s  |   3.54s |   3.58s|   3.34s  |
 3200/0.78  |   3.64s  |   3.46s |   3.47s|   3.25s  |
 1500/0.74  |   3.18s  |   2.97s |   3.03s|   2.79s  |

Fillfactor is effective fillfactor in simplehash with than number of 
partitions.

I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Simple function is hash function that returns only rnode->node.relNode.
I've test it both with simplehash and dynahash.
For dynahash also custom match function were made.

Conclusion:
- trivial hash function gives better results for both simplehash and 
dynahash,
- simplehash improves performance for both complex and trivial hash 
function,

- simplehash + trivial function perform best.

I'd like to hear other's people comments on trivial hash function. But 
since
generation of relation's Oid are not subject of human interventions, I'd 
recommend

to stick with trivial.

Since patch is simple, harmless and gives measurable improvement,
I think it is ready for commit fest.

regards,
Yura Sokolov.
Postgres Proffesional https://www.postgrespro.com

PS. David, please send patch once again since my mail client reattached 
files in

previous messages, and commit fest robot could think I'm author.




Re: Use simplehash.h instead of dynahash in SMgr

2021-04-26 Thread Yura Sokolov

Andres Freund писал 2021-04-26 21:46:

Hi,

On 2021-04-25 01:27:24 +0300, Yura Sokolov wrote:

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.


It's not a "plain" linear probing hash table (although it is on the 
lookup
side). During insertions collisions are reordered so that the average 
distance
from the "optimal" position is ~even ("robin hood hashing"). That 
allows a
higher load factor than a plain linear probed hash table (for which 
IIRC

there's data to show 0.75 to be a good default load factor).


Even for Robin Hood hashing 0.9 fill factor is too high. It leads to too
much movements on insertion/deletion and longer average collision chain.

Note that Robin Hood doesn't optimize average case. Indeed, usually 
Robin Hood

has worse (longer) average collision chain than simple linear probing.
Robin Hood hashing optimizes worst case, ie it guarantees there is no 
unnecessary

long collision chain.

See discussion on Rust hash table fill factor when it were Robin Hood:
https://github.com/rust-lang/rust/issues/38003



There of course may still be a benefit in lowering the load factor, but 
I'd

not start there.

David's test aren't really suited to benchmarking the load factor, but 
to me
the stats he showed didn't highlight a need to lower the load factor. 
Lowering

the fill factor does influence the cache hit ratio...

Greetings,

Andres Freund


regards,
Yura.




Re: Use simplehash.h instead of dynahash in SMgr

2021-04-25 Thread Yura Sokolov

David Rowley писал 2021-04-25 16:36:
On Sun, 25 Apr 2021 at 18:48, Yura Sokolov  
wrote:

If you read comments in SH_START_ITERATE, you'll see:

   * Search for the first empty element. As deletions during 
iterations

are
   * supported, we want to start/end at an element that cannot be
affected
   * by elements being shifted.

   * Iterate backwards, that allows the current element to be deleted,
even
   * if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead
nor
require loop restart.


I had only skimmed that with a pre-loaded assumption that it wouldn't
be safe. I didn't do a very good job of reading it as I failed to
notice the lack of guarantees were about deleting items other than the
current one. I didn't consider the option of finding a free bucket
then looping backwards to avoid missing entries that are moved up
during a delete.

With that, I changed the patch to get rid of the SH_TRUNCATE and got
rid of the smgrclose_internal which skips the hash delete.  The code
is much more similar to how it was now.

In regards to the hashing stuff.  I added a new function to
pg_bitutils.h to rotate left and I'm using that instead of the other
expression that was taken from nodeHash.c

For the hash function, I've done some further benchmarking with:

1) The attached v2 patch
2) The attached + plus use_hash_combine.patch.txt which uses
hash_combine() instead of pg_rotate_left32()ing the hashkey each time.
3) The attached v2 with use_hash_bytes.patch.txt applied.
4) Master

I've also included the hash stats from each version of the hash 
function.


I hope the numbers help indicate the reason I picked the hash function
that I did.

1) v2 patch.
Median = 113.375 s

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

2) v2 patch + use_hash_combine.patch.txt
Median = 119.355 s

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 971,
max chain: 6, avg chain: 0.477854, total_collisions: 433,
max_collisions: 4, avg_collisions: 0.213091

3) v2 patch + use_hash_bytes.patch.txt
Median = 117.685 s

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 900,
max chain: 4, avg chain: 0.442913, total_collisions: 415,
max_collisions: 4, avg_collisions: 0.204232

4) master
Median = 124.49 s

You can see that the bare v2 patch is quite a bit faster than any of
the alternatives.  We'd be better of with hash_bytes than using
hash_combine() both for performance and for the seemingly better job
the hash function does at reducing the hash collisions.

David


Looks much better! Simpler is almost always better.

Minor remarks:

Comment for SH_ENTRY_INITIALIZER e. May be like:
- SH_ENTRY_INITIALIZER(a) - if defined, this macro is called for new 
entries

  before key or hash is stored in. For example, it can be used to make
  necessary memory allocations.

`pg_rotate_left32(x, 1) == pg_rotate_right(x, 31)`, therefore there's
no need to add `pg_rotate_left32` at all. More over, for hash combining
there's no much difference between `pg_rotate_left32(x, 1)` and
`pg_rotate_right(x, 1)`. (To be honestly, there is a bit of difference
due to murmur construction, but it should not be very big.)

If your test so sensitive to hash function speed, then I'd suggest
to try something even simpler:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32  h = 0;
#define step(x) h ^= (uint32)(x) * 0x85ebca6b; h = pg_rotate_right(h, 
11); h *= 9;

step(rnode->node.relNode);
	step(rnode->node.spcNode); // spcNode could be different for same 
relNode only
   // during table movement. Does it pay 
to hash it?

step(rnode->node.dbNode);
step(rnode->backend); // does it matter to hash backend?
  // It equals to InvalidBackendId for 
non-temporary relations
  // and temporary relations in same 
database never have same

  // relNode (have they?).
return murmurhash32(hashkey);
}

I'd like to see benchmark code. It quite interesting this place became
measurable at all.

regards,
Yura Sokolov.diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8310f73212..33e5cadd03 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -34,8 +34,6 @@ typedef struct SMgrEntry
SMgrRelation data;  /* Pointer to the 
SMgrRelationData */
 } SMgrEntry;
 
-static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
-
 /*
  * Because simplehash.h does not provide a stable pointer to hash table
  * entries, we don't make the element type a SMgrRelation directly, instead we
@@ -51,7 +49,7 @@ static inline uint32 
relfilenodebackend_hash(RelFileNodeBackend *rnode);
 #define SH_ELEMENT_TYPE 

Re: Use simplehash.h instead of dynahash in SMgr

2021-04-25 Thread Yura Sokolov

David Rowley писал 2021-04-25 05:23:

Thanks for having a look at this.

 "On Sun, 25 Apr 2021 at 10:27, Yura Sokolov  
wrote:


It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.


You might be right there, although, with the particular benchmark I'm
using the size of the table does not change as a result of that. I'd
need to experiment with varying numbers of relations to see if
dropping the fillfactor helps or hinders performance.

FWIW, the hash stats at the end of recovery are:

LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
system: 5.61 s, elapsed: 112.67 s
LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

Perhaps if try using a number of relations somewhere between 2048 *
0.75 and 2048 * 0.9 then I might see some gains.  Because I have 2032,
the hash table grew up to 4096 buckets.

I did a quick test dropping the fillfactor down to 0.4.  The aim there
was just to see if having 8192 buckets in this test would make it
faster or slower

LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
system: 4.28 s, elapsed: 113.93 s
LOG:  size: 8192, members: 2032, filled: 0.248047, total chain: 303,
max chain: 2, avg chain: 0.149114, total_collisions: 209,
max_collisions: 2, avg_collisions: 0.102854

it was slightly slower.


Certainly. That is because in unmodified case you've got fillfactor 0.49
because table just grew. Below somewhat near 0.6 there is no gain in 
lower

fillfactor. But if you test it when it closer to upper bound, you will
notice difference. Try to test it with 3600 nodes, for example, if
going down to 1800 nodes is not possible.


> + /* rotate hashkey left 1 bit at each step */
> + hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 1 : 0);
> + hashkey ^= murmurhash32((uint32) rnode->node.dbNode);

Why do you use so strange rotation expression? I know compillers are
able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?


Not sure about all compilers, I only checked the earliest version of
clang and gcc at godbolt.org and they both use a single "rol"
instruction. https://godbolt.org/z/1GqdE6T3q


Yep, looks like all compilers recognize such construction with single
exception of old icc compiler (both 13.0.1 and 16.0.3): 
https://godbolt.org/z/qsrjY5Eof

and all compilers recognize `(h << 1) | (h >> 31)` well

Your construction looks more like "multiplate-modulo" operation in 
32bit

Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x10001 is not such
polynomial. 0x100c5 is, therefore it should be:

 hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 0xc5 : 0);
or
 hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 
0xc5);


That does not really make sense to me.  If you're shifting a 32-bit
variable left 31 places then why would you AND with 0xc5? The only
possible result is 1 or 0 depending on if the most significant bit is
on or off.


That is why there is cast to signed int before shifting: `(int32)hashkey 
>> 31`.
Shift is then also signed ie arithmetic, and results are 0 or 
0x.


But why don't just use hash_combine(uint32 a, uint32 b) instead 
(defined

in hashfn.h)? Yep, it could be a bit slower, but is it critical?


I had that function in the corner of my eye when writing this, but
TBH, the hash function performance was just too big a factor to slow
it down any further by using the more expensive hash_combine()
function. I saw pretty good performance gains from writing my own hash
function rather than using hash_bytes(). I didn't want to detract from
that by using hash_combine().  Rotating the bits left 1 slot seems
good enough for hash join and hash aggregate, so I don't have any
reason to believe it's a bad way to combine the hash values.  Do you?


Well, if think a bit more, this hash values could be combined with using
just addition: `hash(a) + hash(b) + hash(c)`.

I thought more about consistency in a codebase. But looks like both ways
(`hash_combine(a,b)` and `rotl(a,1)^b`) are used in a code.
- hash_combine is in one time/three lines in hashTupleDesc at 
tupledesc.c

- rotl+xor six times:
-- three times/three lines in execGrouping.c with construction like 
yours

-- three times in jsonb_util.c, multirangetypes.c and rangetypes.c with
   `(h << 1) | (h >> 31)`.
Therefore I step down on recommendation in this place.

Looks like it is possibility for micropatch to unify hash combining :-)



If you grep the source for "ha

Re: Use simplehash.h instead of dynahash in SMgr

2021-04-24 Thread Yura Sokolov

David Rowley писал 2021-04-24 18:58:

Hackers,

Last year, when working on making compactify_tuples() go faster for
19c60ad69, I did quite a bit of benchmarking of the recovery process.
The next thing that was slow after compactify_tuples() was the hash
lookups done in smgropen().

Currently, we use dynahash hash tables to store the SMgrRelation so we
can perform fast lookups by RelFileNodeBackend. However, I had in mind
that a simplehash table might perform better. So I tried it...

The attached converts the hash table lookups done in smgr.c to use
simplehash instead of dynahash.

This does require a few changes in simplehash.h to make it work.  The
reason being is that RelationData.rd_smgr points directly into the
hash table entries.  This works ok for dynahash as that hash table
implementation does not do any reallocations of existing items or move
any items around in the table, however, simplehash moves entries
around all the time, so we can't point any pointers directly at the
hash entries and expect them to be valid after adding or removing
anything else from the table.

To work around that, I've just made an additional type that serves as
the hash entry type that has a pointer to the SMgrRelationData along
with the hash status and hash value. It's just 16 bytes (or 12 on
32-bit machines).  I opted to keep the hash key in the
SMgrRelationData rather than duplicating it as it keeps the SMgrEntry
struct nice and small.  We only need to dereference the SMgrRelation
pointer when we find an entry with the same hash value.  The chances
are quite good that an entry with the same hash value is the one that
we want, so any additional dereferences to compare the key are not
going to happen very often.

I did experiment with putting the hash key in SMgrEntry and found it
to be quite a bit slower.  I also did try to use hash_bytes() but
found building a hash function that uses murmurhash32 to be quite a
bit faster.

Benchmarking
===

I did some of that.  It made my test case about 10% faster.

The test case was basically inserting 100 million rows one at a time
into a hash partitioned table with 1000 partitions and 2 int columns
and a primary key on one of those columns. It was about 12GB of WAL. I
used a hash partitioned table in the hope to create a fairly
random-looking SMgr hash table access pattern.  Hopefully something
similar to what might happen in the real world.

Over 10 runs of recovery, master took an average of 124.89 seconds.
The patched version took 113.59 seconds. About 10% faster.

I bumped shared_buffers up to 10GB, max_wal_size to 20GB and
checkpoint_timeout to 60 mins.

To make the benchmark more easily to repeat I patched with the
attached recovery_panic.patch.txt.  This just PANICs at the end of
recovery so that the database shuts down before performing the end of
recovery checkpoint.  Just start the database up again to do another
run.

I did 10 runs.  The end of recovery log message reported:

master (aa271209f)
CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

master + v1 patch
CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s
CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s
CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s
CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s
CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s
CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s
CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s
CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s
CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s
CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s

I wrote this patch a few days ago. I'm only posting it now as I know a
couple of other people have expressed an interest in working on this.
I didn't really want any duplicate efforts, so thought I'd better post
it now before someone else goes and writes a similar patch.

I'll park this here and have another look at it when the PG15 branch 
opens.


David


Hi, David

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.


+   /* rotate hashkey left 1 bit at each step */
+   hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 1 : 0);
+   hashkey ^= murmurhash32((uint32) rnode->node.dbNode);


Why do you use so strange rotation expression? I know compillers are 
able

to 

Re: Old Postgresql version on i7-1165g7

2021-04-19 Thread Yura Sokolov

Yura Sokolov писал 2021-04-18 23:29:

Tom Lane писал 2021-04-13 17:45:

Justin Pryzby  writes:

On Fri, Apr 09, 2021 at 04:28:25PM +0300, Yura Sokolov wrote:
Occasinally I found I'm not able to `make check` old Postgresql 
versions.


I've bisected between REL_11_0 and "Rename pg_rewind's 
copy_file_range()"

and
found 372728b0d49552641f0ea83d9d2e08817de038fa

Replace our traditional initial-catalog-data format with a better
design.
This is first commit where `make check` doesn't fail during initdb 
on my

machine.


This doesn't make much sense or help much, since 372728b doesn't 
actually

change the catalogs, or any .c file.


It could make sense if some part of the toolchain that was previously
used to generate postgres.bki doesn't work right on that machine.
Overall though I'd have thought that 372728b would increase not
decrease our toolchain footprint.  It also seems unlikely that a
recent Ubuntu release would contain toolchain bugs that we hadn't
already heard about.


You used make clean too, right ?


Really, when bisecting, you need to use "make distclean" or even
"git clean -dfx" between steps, or you may get bogus results,
because our makefiles aren't that great about tracking dependencies,
especially when you move backwards in the history.


Yep, "git clean -dfx" did the job. "make distclean" didn't, btw.
I've had "src/backend/catalog/schemapg.h" file in source tree
generated with "make submake-generated-headers" on REL_13_0.
It were not shown with "git status", therefore I didn't notice its
existence. It were not deleted neither with "make distclean", nor with
"git clean -dx" I tried before. Only "git clean -dfx" deletes it.

Thank you for the suggestion, Tom. You've saved my sanity.

Regards,
Yura Sokolov.




Re: Old Postgresql version on i7-1165g7

2021-04-18 Thread Yura Sokolov

Tom Lane писал 2021-04-13 17:45:

Justin Pryzby  writes:

On Fri, Apr 09, 2021 at 04:28:25PM +0300, Yura Sokolov wrote:
Occasinally I found I'm not able to `make check` old Postgresql 
versions.


I've bisected between REL_11_0 and "Rename pg_rewind's 
copy_file_range()"

and
found 372728b0d49552641f0ea83d9d2e08817de038fa

Replace our traditional initial-catalog-data format with a better
design.
This is first commit where `make check` doesn't fail during initdb on 
my

machine.


This doesn't make much sense or help much, since 372728b doesn't 
actually

change the catalogs, or any .c file.


It could make sense if some part of the toolchain that was previously
used to generate postgres.bki doesn't work right on that machine.
Overall though I'd have thought that 372728b would increase not
decrease our toolchain footprint.  It also seems unlikely that a
recent Ubuntu release would contain toolchain bugs that we hadn't
already heard about.


You used make clean too, right ?


Really, when bisecting, you need to use "make distclean" or even
"git clean -dfx" between steps, or you may get bogus results,
because our makefiles aren't that great about tracking dependencies,
especially when you move backwards in the history.

So perhaps a more plausible theory is that this bisection result
is wrong because you weren't careful enough.

regards, tom lane


Sorry for missing mail for a week.

I believe I cleaned before each step since I'm building in external 
directory

and cleanup is just `rm * -r`.

But I'll repeat bisecting tomorrow to be sure.

I don't think it is really PostgreSQL or toolchain bug. I believe it is 
some

corner case that were changed in new Intel CPU.

With regards,
Yura Sokolov.




Re: Old Postgresql version on i7-1165g7

2021-04-13 Thread Yura Sokolov

Yura Sokolov писал 2021-04-09 16:28:

Good day, hackers.

I've got HP ProBook 640g8 with i7-1165g7. I've installed Ubuntu 20.04 
LTS on it

and started to play with PostgreSQL sources.

Occasinally I found I'm not able to `make check` old Postgresql 
versions.

At least 9.6 and 10. They are failed at the initdb stage in the call
to postgresql.

Raw postgresql version 9.6.8 and 10.0 fails in boostrap stage:

running bootstrap script ... 2021-04-09 12:33:26.424 MSK [161121]
FATAL:  could not find tuple for opclass 1
2021-04-09 12:33:26.424 MSK [161121] PANIC:  cannot abort
transaction 1, it was already committed
Aborted (core dumped)
child process exited with exit code 134

Our modified custom version 9.6 fails inside of libc __strncmp_avx2
during post-bootstrap
with segmentation fault:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __strncmp_avx2 ()
#1  0x557168a7eeda in nameeq
#2  0x557168b4c4a0 in FunctionCall2Coll
#3  0x557168659555 in heapgettup_pagemode
#4  0x55716865a617 in heap_getnext
#5  0x557168678cf1 in systable_getnext
#6  0x557168b5651c in GetDatabaseTuple
#7  0x557168b574a4 in InitPostgres
#8  0x5571689dcb7d in PostgresMain
#9  0x5571688844d5 in main

I've bisected between REL_11_0 and "Rename pg_rewind's 
copy_file_range()" and

found 372728b0d49552641f0ea83d9d2e08817de038fa
Replace our traditional initial-catalog-data format with a better 
design.


https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=372728b0d49552641f0ea83d9d2e08817de038fa

This is first commit where `make check` doesn't fail during initdb on
my machine.
Therefore 02f3e558f21c0fbec9f94d5de9ad34f321eb0e57 is the last one
where `make check` fails.

I've tried with gcc9, gcc10 and clang10.
I've configured either without parameters or with `CFLAGS=-O0
./configure --enable-debug`.

Thing doesn't happen on Intel CPU of 10th series (i7-10510U and 
i9-10900K).
Unfortunately, I have no fellows or colleagues with Intel CPU  11 
series,

therefore I couldn't tell if this bug of 11 series or bug of concrete
CPU installed
in the notebook.

It will be great if some with i7-11* could try to make check and report
if it also fails or not.


BTW, problem remains in Debian stable (10.4) inside docker on same 
machine.




With regards,
Yura Sokolov
PostgresPro





Old Postgresql version on i7-1165g7

2021-04-09 Thread Yura Sokolov

Good day, hackers.

I've got HP ProBook 640g8 with i7-1165g7. I've installed Ubuntu 20.04 
LTS on it

and started to play with PostgreSQL sources.

Occasinally I found I'm not able to `make check` old Postgresql 
versions.
At least 9.6 and 10. They are failed at the initdb stage in the call to 
postgresql.


Raw postgresql version 9.6.8 and 10.0 fails in boostrap stage:

running bootstrap script ... 2021-04-09 12:33:26.424 MSK [161121] 
FATAL:  could not find tuple for opclass 1
2021-04-09 12:33:26.424 MSK [161121] PANIC:  cannot abort 
transaction 1, it was already committed

Aborted (core dumped)
child process exited with exit code 134

Our modified custom version 9.6 fails inside of libc __strncmp_avx2 
during post-bootstrap

with segmentation fault:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __strncmp_avx2 ()
#1  0x557168a7eeda in nameeq
#2  0x557168b4c4a0 in FunctionCall2Coll
#3  0x557168659555 in heapgettup_pagemode
#4  0x55716865a617 in heap_getnext
#5  0x557168678cf1 in systable_getnext
#6  0x557168b5651c in GetDatabaseTuple
#7  0x557168b574a4 in InitPostgres
#8  0x5571689dcb7d in PostgresMain
#9  0x5571688844d5 in main

I've bisected between REL_11_0 and "Rename pg_rewind's 
copy_file_range()" and

found 372728b0d49552641f0ea83d9d2e08817de038fa
Replace our traditional initial-catalog-data format with a better 
design.


https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=372728b0d49552641f0ea83d9d2e08817de038fa

This is first commit where `make check` doesn't fail during initdb on my 
machine.
Therefore 02f3e558f21c0fbec9f94d5de9ad34f321eb0e57 is the last one where 
`make check` fails.


I've tried with gcc9, gcc10 and clang10.
I've configured either without parameters or with `CFLAGS=-O0 
./configure --enable-debug`.


Thing doesn't happen on Intel CPU of 10th series (i7-10510U and 
i9-10900K).
Unfortunately, I have no fellows or colleagues with Intel CPU  11 
series,
therefore I couldn't tell if this bug of 11 series or bug of concrete 
CPU installed

in the notebook.

It will be great if some with i7-11* could try to make check and report
if it also fails or not.

With regards,
Yura Sokolov
PostgresPro




Re: [HACKERS] Block level parallel vacuum

2018-10-31 Thread Yura Sokolov
Excuse me for being noisy.

Increasing vacuum's ring buffer improves vacuum upto 6 times.
https://www.postgresql.org/message-id/flat/20170720190405.GM1769%40tamriel.snowman.net
This is one-line change.

How much improvement parallel vacuum gives?

31.10.2018 3:23, Masahiko Sawada пишет:
> On Tue, Oct 30, 2018 at 5:30 PM Masahiko Sawada  wrote:
>>
>> On Tue, Aug 14, 2018 at 9:31 AM Masahiko Sawada  
>> wrote:
>>>
>>> On Thu, Nov 30, 2017 at 11:09 AM, Michael Paquier
>>>  wrote:
 On Tue, Oct 24, 2017 at 5:54 AM, Masahiko Sawada  
 wrote:
> Yeah, I was thinking the commit is relevant with this issue but as
> Amit mentioned this error is emitted by DROP SCHEMA CASCASE.
> I don't find out the cause of this issue yet. With the previous
> version patch, autovacuum workers were woking with one parallel worker
> but it never drops relations. So it's possible that the error might
> not have been relevant with the patch but anywayI'll continue to work
> on that.

 This depends on the extension lock patch from
 https://www.postgresql.org/message-id/flat/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=gn...@mail.gmail.com/
 if I am following correctly. So I propose to mark this patch as
 returned with feedback for now, and come back to it once the root
 problems are addressed. Feel free to correct me if you think that's
 not adapted.
>>>
>>> I've re-designed the parallel vacuum patch. Attached the latest
>>> version patch. As the discussion so far, this patch depends on the
>>> extension lock patch[1]. However I think we can discuss the design
>>> part of parallel vacuum independently from that patch. That's way I'm
>>> proposing the new patch. In this patch, I structured and refined the
>>> lazy_scan_heap() because it's a single big function and not suitable
>>> for making it parallel.
>>>
>>> The parallel vacuum worker processes keep waiting for commands from
>>> the parallel vacuum leader process. Before entering each phase of lazy
>>> vacuum such as scanning heap, vacuum index and vacuum heap, the leader
>>> process changes the all workers state to the next state. Vacuum worker
>>> processes do the job according to the their state and wait for the
>>> next command after finished. Also in before entering the next phase,
>>> the leader process does some preparation works while vacuum workers is
>>> sleeping; for example, clearing shared dead tuple space before
>>> entering the 'scanning heap' phase. The status of vacuum workers are
>>> stored into a DSM area pointed by WorkerState variables, and
>>> controlled by the leader process. FOr the basic design and performance
>>> improvements please refer to my presentation at PGCon 2018[2].
>>>
>>> The number of parallel vacuum workers is determined according to
>>> either the table size or PARALLEL option in VACUUM command. The
>>> maximum of parallel workers is max_parallel_maintenance_workers.
>>>
>>> I've separated the code for vacuum worker process to
>>> backends/commands/vacuumworker.c, and created
>>> includes/commands/vacuum_internal.h file to declare the definitions
>>> for the lazy vacuum.
>>>
>>> For autovacuum, this patch allows autovacuum worker process to use the
>>> parallel option according to the relation size or the reloption. But
>>> autovacuum delay, since there is no slots for parallel worker of
>>> autovacuum in AutoVacuumShmem this patch doesn't support the change of
>>> the autovacuum delay configuration during running.
>>>
>>
>> Attached rebased version patch to the current HEAD.
>>
>>> Please apply this patch with the extension lock patch[1] when testing
>>> as this patch can try to extend visibility map pages concurrently.
>>>
>>
>> Because the patch leads performance degradation in the case where
>> bulk-loading to a partitioned table I think that the original
>> proposal, which makes group locking conflict when relation extension
>> locks, is more realistic approach. So I worked on this with the simple
>> patch instead of [1]. Attached three patches:
>>
>> * 0001 patch publishes some static functions such as
>> heap_paralellscan_startblock_init so that the parallel vacuum code can
>> use them.
>> * 0002 patch makes the group locking conflict when relation extension locks.
>> * 0003 patch add paralel option to lazy vacuum.
>>
>> Please review them.
>>
> 
> Oops, forgot to attach patches.
> 
> Regards,
> 
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
> 




Re: [HACKERS] Two pass CheckDeadlock in contentent case

2018-07-20 Thread Yura Sokolov
11.07.2018 17:01, Ashutosh Bapat пишет:
> The patch still applies and it's part of this commitfest.
> 
> On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura  wrote:
>> On 2017-10-03 17:30, Sokolov Yura wrote:
>>>
>>> Good day, hackers.
>>>
>>> During hard workload sometimes process reaches deadlock timeout
>>> even if no real deadlock occurred. It is easily reproducible with
>>> pg_xact_advisory_lock on same value + some time consuming
>>> operation (update) and many clients.
>>>
>>> When backend reaches deadlock timeout, it calls CheckDeadlock,
>>> which locks all partitions of lock hash in exclusive mode to
>>> walk through graph and search for deadlock.
>>>
>>> If hundreds of backends reaches this timeout trying to acquire
>>> advisory lock on a same value, it leads to hard-stuck for many
>>> seconds, cause they all traverse same huge lock graph under
>>> exclusive lock.
>>> During this stuck there is no possibility to do any meaningful
>>> operations (no new transaction can begin).
>>>
>>> Attached patch makes CheckDeadlock to do two passes:
>>> - first pass uses LW_SHARED on partitions of lock hash.
>>>   DeadLockCheck is called with flag "readonly", so it doesn't
>>>   modify anything.
>>> - If there is possibility of "soft" or "hard" deadlock detected,
>>>   ie if there is need to modify lock graph, then partitions
>>>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>>>
>>> It fixes hard-stuck, cause backends walk lock graph under shared
>>> lock, and found that there is no real deadlock.
>>>
>>> Test on 4 socket xeon machine:
>>> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
>>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
>>>
>>> ycsb_read_zipf.sql:
>>> \set i random_zipfian(1, 10 * :scale, 1.01)
>>> SELECT abalance FROM pgbench_accounts WHERE aid = :i;
>>> ycsb_update_lock2_zipf.sql:
>>> \set i random_zipfian(1, 10 * :scale, 1.01)
>>> select lock_and_update( :i );
>>>
>>> CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>>>  RETURNS void
>>>  LANGUAGE sql
>>> AS $function$
>>> SELECT pg_advisory_xact_lock( $1 );
>>> UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>>> $function$
>>>
>>> Without attached patch:
>>>
>>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
>>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
>>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
>>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
>>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
>>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
>>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
>>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
>>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
>>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
>>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>>>
>>> (autovacuum led to trigger deadlock timeout,
>>>  and therefore, to stuck)
>>>
>>> Patched:
>>>
>>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
>>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
>>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
>>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
>>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
>>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
>>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
>>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
>>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
>>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
>>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
>>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
>>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
>>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
> 
> I am confused about interpreting these

Re: [HACKERS] Clock with Adaptive Replacement

2018-05-06 Thread Yura Sokolov
06.05.2018 20:28, Andrey Borodin пишет:
> 
>> 6 мая 2018 г., в 20:38, Yura Sokolov <funny.fal...@gmail.com> написал(а):
>>
>> I've been playing with logarithmic scale in postgresql roughly year ago.
>> I didn't found any benefits compared to current code. In fact, it looked
>> to perform worse than current code.
>> That is why I didn't wrote about that experiment to pgsql-hackers.
> Is there a feature of testgres to benchmark pgbench tps against shared buffer 
> size? Let's request this feature if it is not there :)

That would be great. Will you?

But pgbench is a bit... far from realworld. I used pgbench to test
log-scale, and it shows that log-scale is worser.
It will be more important to test with some realworld installations. Or
validate algorithm with realworld traces.

>> But probably I measured in a wrong way. And why I dream to have
>> real-world traces in hands.
>>
>> Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO,
>> CAR, - they all consider buffer "hot" if it has more temporal frequency
>> in opposite to raw access count. They all mostly ignores spike of usages
>> during first moments after placement into cache, and moves buffer to hot
>> if it is accessed at some time after.
> These algorithms do not ignore spikes, they ignore spike's amplitude. And 
> this does not mean that amplitude is irrelevant information, even if these 
> algorithms perform almost like Belady's.

More important their consideration of temporal frequency.

> Building a complicated heuristics (with a lot of magic numbers) to merge this 
> spikes into one event does not look promising to me. But this is just my 
> superstition, chances are that you can tune your algorithm into serious 
> advancement. But please publish benchmarks, whatever result will be.

Yes, this numbers looks magic. They are hand tuned on some traces that
probably irrelevant to PostgreSQL behavior.

And since you already present some patch, may you publish its benchmark
results?

With regards,
Sokolov Yura.



Re: [HACKERS] Clock with Adaptive Replacement

2018-05-06 Thread Yura Sokolov
06.05.2018 11:20, Andrey Borodin пишет:
> 
> 
>> 5 мая 2018 г., в 13:25, Yura Sokolov <funny.fal...@gmail.com> написал(а):
>>
>> 05.05.2018 09:16, Andrey Borodin пишет:
>>> Hi!
>>>
>>>> 4 мая 2018 г., в 16:05, Юрий Соколов <funny.fal...@gmail.com>
>>>> написал(а):
>>>>
>>>> I didn't suggest log scale of usages, but rather
>>>> "replacement-period based" increment: usage count could be
>>>> incremented at most once in NBlocks/32 replaced items. Once it is
>>>> incremented, its "replacement time" is remembered, and next
>>>> NBlocks/32 replacements usage count of this buffer doesn't
>>>> increment. This way, increment is synchronized with replacement
>>>> activity.
>>>
>>> But you loose difference between "touched once" and "actively used".
>>> Log scale of usage solves this: usage count grows logarithmically,
>>> but drains linearly.
>> No, I didn't loose difference. But instead of absolute value (log scale
>> or linear) I count how often in time block is used:
>> - if buffer were touched 1000 times just after placing into
>> shared_buffers should it live 500 times longer than neighbor that were
>> touched only 2 times? or 10 times longer? or 5 times longer?
>> - but what if that "1000 times" buffer were not touched in next period
>> of time, but neighbor were touched again 2 times?
>> All effective algorithms answers: "1000 times" buffer should be evicted
>> first, but its neighbor is a really hot buffer that should be saved for
>> longer period.
> It is hard to tell actually what is right decision here. It is better to 
> evict buffer that will not be needed longer, and it is not obvious that is is 
> true for buffer that you called hot.
> 
> Assume we have buffer A who is touched 1024 times (and then forgotten 
> forever) and buffer B which is touched 2 times every clock cycle.
>   A   B
> Usage count   0x400   0x2
> 1 Eviction time!  0x100   0x0 E
> Usage count   0x100   0x2
> 2 Eviction time!  0x080   0x0 E
> Usage count   0x080   0x2
> 3 Eviction time!  0x020   0x0 E
> Usage count   0x020   0x2
> 4 Eviction time!  0x00A   0x0 E
> Usage count   0x00A   0x2
> 5 Eviction time!  0x004   0x0 E
> Usage count   0x004   0x2
> 6 Eviction time!  0x001   0x0 E
> Usage count   0x001   0x2
> 7 Eviction time!  0x000 E 0x2
> So, buffer used 512 times more survived only 7 stale cycles. Looks fair.

No, it is not fair. In fact, it is worser than with current strategy:
with current GClock it will survive only 5 stale cycles.

And in my approach I told about NBlock/32 periods, so it looks more like
(lets increment be 1):

PeriodA B C
 touch UseCnt  touch cnt  touch cnt
1/64  5120  1601 0
2/64  5120   001 0
3/64   0 0   000 0
4/64   0 0   001 1
5/64   0 0   001 1
.
32/64  0 0  1610 1
33/64  0 0  1611 2
34/64  0 0  0 11 2
.
63/64  0 0  0 11 3
64/64  0 0  0 11 3
Eviction time!
 A evicted  B 1C 2

So, in my algorithm A will be evicted at first eviction time.
And while B accessed more times, its access distance is much longer than
C, so it has less changes to survive in a future.

>> Log scale doesn't solve this. But increment "once in period" solves.
>> Especially if block is placed first with zero count (instead of 1 as
>> currently).
>>
>>>> Digging further, I suggest as improvement of GClock algorithm: -
>>>> placing new buffer with usage count = 0 (and next NBlock/32
>>>> replacements its usage count doesn't increased)
>>>> - increment not by 1, but by 8 (it simulates "hot queue" of
>>>> popular algorithms) with limit 32.
>>>> - scan at most 25 buffers for eviction. If no buffer with zero
>>>> usage count found, the least used buffer (among scanned 25) is evicted.
>>>> (new buffers are not evicted during their first NBlock/32
>>>> replacements).
>>>>
>>>
>>> I do not understand where these numbers come from...
>>
>> I found this number by testing with several artificial traces found in web.
>> I don't claim this number are best. Even on that traces best values may
>> vary on cache size: for small cache size increment and limit tends to be
>> higher, for huge cac

Re: [HACKERS] Clock with Adaptive Replacement

2018-05-05 Thread Yura Sokolov
05.05.2018 09:16, Andrey Borodin пишет:
> Hi!
> 
>> 4 мая 2018 г., в 16:05, Юрий Соколов <funny.fal...@gmail.com>
>> написал(а):
>> 
>> I didn't suggest log scale of usages, but rather
>> "replacement-period based" increment: usage count could be
>> incremented at most once in NBlocks/32 replaced items. Once it is
>> incremented, its "replacement time" is remembered, and next
>> NBlocks/32 replacements usage count of this buffer doesn't
>> increment. This way, increment is synchronized with replacement
>> activity.
> 
> But you loose difference between "touched once" and "actively used".
> Log scale of usage solves this: usage count grows logarithmically,
> but drains linearly.
No, I didn't loose difference. But instead of absolute value (log scale
or linear) I count how often in time block is used:
- if buffer were touched 1000 times just after placing into
shared_buffers should it live 500 times longer than neighbor that were
touched only 2 times? or 10 times longer? or 5 times longer?
- but what if that "1000 times" buffer were not touched in next period
of time, but neighbor were touched again 2 times?
All effective algorithms answers: "1000 times" buffer should be evicted
first, but its neighbor is a really hot buffer that should be saved for
longer period.

Log scale doesn't solve this. But increment "once in period" solves.
Especially if block is placed first with zero count (instead of 1 as
currently).

>> Digging further, I suggest as improvement of GClock algorithm: -
>> placing new buffer with usage count = 0 (and next NBlock/32
>> replacements its usage count doesn't increased)
>> - increment not by 1, but by 8 (it simulates "hot queue" of
>> popular algorithms) with limit 32.
>> - scan at most 25 buffers for eviction. If no buffer with zero
>> usage count found, the least used buffer (among scanned 25) is evicted.
>> (new buffers are not evicted during their first NBlock/32
>> replacements).
>> 
> 
> I do not understand where these numbers come from...

I found this number by testing with several artificial traces found in web.
I don't claim this number are best. Even on that traces best values may
vary on cache size: for small cache size increment and limit tends to be
higher, for huge cache - smaller. But this were most balanced.

And I don't claim those traces are representative for PostgreSQL, that
is why I'm pushing this discussion to collect more real-world PostgreSQL
traces and make them public.

And I believe my algorithm is not the best. Clock-Pro and ARC shows
better results on that traces. Tiny-LFU - cache admission algorithm -
may be even more efficient (in term of evictions). But results should be
rechecked with PostgreSQL traces.

My algorithm will be just least invasive for current code, imho.

With regards,
Sokolov Yura



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Clock with Adaptive Replacement

2018-05-02 Thread Yura Sokolov
02.05.2018 01:37, Peter Geoghegan пишет:
> On Tue, May 1, 2018 at 11:58 AM, Robert Haas  wrote:
>> I agree that double-counting correlated accesses is a a problem, and I
>> agree that we probably want to do something about it.  I am skeptical
>> that using wall-clock time is the right way to solve that problem
>> because it leads to edge effects
> 
> I agree that wall-clock time is a bad approach, actually. If we were
> to use wall-clock time, we'd only do so because it can be used to
> discriminate against a thing that we actually care about in an
> approximate, indirect way. If there is a more direct way of
> identifying correlated accesses, which I gather that there is, then we
> should probably use it.

I suggest incrementing only once in 1/32 replacements in shared_buffers.
I.e. if size of shared_buffers is 1024, and this page were put into
shared_buffers as 21200, then next time its usage count will be
incremented only after 21232 page were put into shared buffers. And
second time only after 21264 page.

regards,
Yura.



Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2018-04-01 Thread Yura Sokolov
23.03.2018 17:59, Amit Kapila пишет:
> On Sat, Mar 10, 2018 at 7:41 AM, Yura Sokolov <funny.fal...@gmail.com> wrote:
>> 08.03.2018 03:42, Tomas Vondra пишет:
>>> One reason against building the hash table in GetSnapshotData is that
>>> we'd build it even when the snapshot is never queried. Or when it is
>>> queried, but we only need to check xmin/xmax.
>>
>> Thank you for analyze, Tomas.
>>
>> Stephen is right about bug in snapmgr.c
>> Attached version fixes bug, and also simplifies XidInXip a bit.
>>
> 
> @@ -2167,8 +2175,7 @@ RestoreSnapshot(char *start_address)
>   /* Copy SubXIDs, if present. */
>   if (serialized_snapshot.subxcnt > 0)
>   {
> - snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
> - serialized_snapshot.xcnt;
> + snapshot->subxip = ((TransactionId *) (snapshot + 1)) + xcnt;
>   memcpy(snapshot->subxip, serialized_xids + serialized_snapshot.xcnt,
> serialized_snapshot.subxcnt * sizeof(TransactionId));
>   }
> 
> 
> It is not clear why you want to change this in RestoreSnapshot when
> nothing related is changed in SerializeSnapshot?  Can you please add
> some comments to clarify it?
> 

I didn't change serialized format. Therefore is no need to change
SerializeSnapshot.
But in-memory representation were changed, so RestoreSnapshot is changed.

With regards,
Sokolov Yura.





Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2018-04-01 Thread Yura Sokolov
17.03.2018 03:36, Tomas Vondra пишет:
> 
> On 03/17/2018 12:03 AM, Yura Sokolov wrote:
>> 16.03.2018 04:23, Tomas Vondra пишет:
>>>
>>> ...
>>>
>>> OK, a few more comments.
>>>
>>> 1) The code in ExtendXipSizeForHash seems somewhat redundant with
>>> my_log2 (that is, we could just call the existing function).
>>
>> Yes, I could call my_log2 from ExtendXipSizeForHash. But wouldn't one
>> more call be more expensive than loop itself?
>>
> 
> I very much doubt it there would be a measurable difference. Firstly,
> function calls are not that expensive, otherwise we'd be running around
> and removing function calls from the hot paths. Secondly, the call only
> happens with many XIDs, and in that case the cost should be out-weighted
> by faster lookups. And finally, the function does stuff that seems far
> more expensive than a single function call (for example allocation,
> which may easily trigger a malloc).
> 
> In fact, in the interesting cases it's pretty much guaranteed to hit a
> malloc, because the number of backend processes needs to be high enough
> (say, 256 or more), which means
> 
> GetMaxSnapshotSubxidCount()
> 
> will translate to something like
> 
> ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
> = (64 + 1) * 256
> = 16640
> 
> and because XIDs are 4B each, that's ~65kB of memory (even ignoring the
> ExtendXipSizeForHash business). And aset.c treats chunks larger than 8kB
> as separate blocks, that are always malloc-ed independently.
> 
> But I may be missing something, and the extra call actually makes a
> difference. But I think the onus of proving that is on you, and the
> default should be not to duplicate code.
> 
>>> 2) Apparently xhlog/subxhlog fields are used for two purposes - to
>>> store log2 of the two XID counts, and to remember if the hash table
>>> is built. That's a bit confusing (at least it should be explained
>>> in a comment) but most importantly it seems a bit unnecessary.
>>
>> Ok, I'll add comment.
>>
>>> I assume it was done to save space, but I very much doubt that
>>> makes any difference. So we could easily keep a separate flag. I'm
>>> pretty sure there are holes in the SnapshotData struct, so we could
>>> squeeze it the flags in one of those.
>>
>> There's hole just right after xhlog. But it will be a bit strange to 
>> move fields around.
>>
> 
> Is SnapshotData really that sensitive to size increase? I have my doubts
> about that, TBH. The default stance should be to make the code easy to
> understand. That is, we should only move fields around if it actually
> makes a difference.
> 
>>> But do we even need a separate flag? We could just as easily keep 
>>> the log2 fields set to 0 and only set them after actually building 
>>> the hash table.
>>
>> There is a need to signal that there is space for hash. It is not
>> always allocated. iirc, I didn't cover the case where snapshot were
>> restored from file, and some other place or two.
>> Only if all places where snapshot is allocated are properly modified
>> to allocate space for hash, then flag could be omitted, and log2
>> itself used as a flag.
>>
> 
> Hmmm, that makes it a bit inconsistent, I guess ... why not to do the
> same thing on all those places?
> 
>>> Or even better, why not to store the mask so that XidInXip does not
>>> need to compute it over and over (granted, that's uint32 instead
>>> of just uint8, but I don't think SnapshotData is particularly
>>> sensitive to this, especially considering how much larger the xid
>>> hash table is).
>>
>> I don't like unnecessary sizeof struct increase. And I doubt that 
>> computation matters. I could be mistaken though, because it is hot
>> place. Do you think it will significantly improve readability?
>>
> 
> IMHO the primary goal is to make the code easy to read and understand,
> and only optimize struct size if it actually makes a difference. We have
> no such proof here, and I very much doubt you'll be able to show any
> difference because even with separate flags pahole says this:
> 
> struct SnapshotData {
> SnapshotSatisfiesFunc  satisfies;/* 0 8 */
> TransactionId  xmin; /* 8 4 */
> TransactionId  xmax; /*12 4 */
> TransactionId *xip;  /*16 8 */
> uint32 xcnt; /*24 4 */
> uint8  xhlog;

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2018-03-16 Thread Yura Sokolov
16.03.2018 04:23, Tomas Vondra пишет:
> 
> 
> On 03/10/2018 03:11 AM, Yura Sokolov wrote:
>> 08.03.2018 03:42, Tomas Vondra пишет:
>>> On 03/06/2018 06:23 AM, Yura Sokolov wrote:
>>>> 05.03.2018 18:00, Tom Lane пишет:
>>>>> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>>>>>> Snapshots are static (we don't really add new XIDs into existing ones,
>>>>>> right?), so why don't we simply sort the XIDs and then use bsearch to
>>>>>> lookup values? That should fix the linear search, without need for any
>>>>>> local hash table.
>>>>>
>>>>> +1 for looking into that, since it would avoid adding any complication
>>>>> to snapshot copying.  In principle, with enough XIDs in the snap, an
>>>>> O(1) hash probe would be quicker than an O(log N) bsearch ... but I'm
>>>>> dubious that we are often in the range where that would matter.
>>>>> We do need to worry about the cost of snapshot copying, too.
>>>>
>>>> Sorting - is the first thing I've tried. Unfortunately, sorting
>>>> itself eats too much cpu. Filling hash table is much faster.
>>>>
>>>
>>> I've been interested in the sort-based approach, so I've spent a bit of
>>> time hacking on it (patch attached). It's much less invasive compared to
>>> the hash-table, but Yura is right the hashtable gives better results.
>>>
>>> I've tried to measure the benefits using the same script I shared on
>>> Tuesday, but I kept getting really strange numbers. In fact, I've been
>>> unable to even reproduce the results I shared back then. And after a bit
>>> of head-scratching I think the script is useless - it can't possibly
>>> generate snapshots with many XIDs because all the update threads sleep
>>> for exactly the same time. And first one to sleep is first one to wake
>>> up and commit, so most of the other XIDs are above xmax (and thus not
>>> included in the snapshot). I have no idea why I got the numbers :-/
>>>
>>> But with this insight, I quickly modified the script to also consume
>>> XIDs by another thread (which simply calls txid_current). With that I'm
>>> getting snapshots with as many XIDs as there are UPDATE clients (this
>>> time I actually checked that using gdb).
>>>
>>> And for a 60-second run the tps results look like this (see the attached
>>> chart, showing the same data):
>>>
>>> writers master  hash   sort
>>>-
>>> 16   1068   1057   1070
>>> 32   1005995   1033
>>> 64   1064   1042   1117
>>> 128  1058   1021   1051
>>> 256   977   1004928
>>> 512   773935808
>>> 768   576815670
>>> 1000  413752573
>>>
>>> The sort certainly does improve performance compared to master, but it's
>>> only about half of the hashtable improvement.
>>>
>>> I don't know how much this simple workload resembles the YCSB benchmark,
>>> but I seem to recall it's touching individual rows. In which case it's
>>> likely worse due to the pg_qsort being more expensive than building the
>>> hash table.
>>>
>>> So I agree with Yura the sort is not a viable alternative to the hash
>>> table, in this case.
>>>
>>>> Can I rely on snapshots being static? May be then there is no need
>>>> in separate raw representation and hash table. I may try to fill hash
>>>> table directly in GetSnapshotData instead of lazy filling. Though it
>>>> will increase time under lock, so it is debatable and should be
>>>> carefully measured.
>>>>
>>>
>>> Yes, I think you can rely on snapshots not being modified later. That's
>>> pretty much the definition of a snapshot.
>>>
>>> You may do that in GetSnapshotData, but you certainly can't do that in
>>> the for loop there. Not only we don't want to add more work there, but
>>> you don't know the number of XIDs in that loop. And we don't want to
>>> build hash tables for small number of XIDs.
>>>
>>> One reason against building the hash table in GetSnapshotData is that
>>> we'd build it even when the snapshot is never queried. Or when it is
>>> queried, but we only need to check xmin/

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2018-03-09 Thread Yura Sokolov
08.03.2018 03:42, Tomas Vondra пишет:
> On 03/06/2018 06:23 AM, Yura Sokolov wrote:
>> 05.03.2018 18:00, Tom Lane пишет:
>>> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>>>> Snapshots are static (we don't really add new XIDs into existing ones,
>>>> right?), so why don't we simply sort the XIDs and then use bsearch to
>>>> lookup values? That should fix the linear search, without need for any
>>>> local hash table.
>>>
>>> +1 for looking into that, since it would avoid adding any complication
>>> to snapshot copying.  In principle, with enough XIDs in the snap, an
>>> O(1) hash probe would be quicker than an O(log N) bsearch ... but I'm
>>> dubious that we are often in the range where that would matter.
>>> We do need to worry about the cost of snapshot copying, too.
>>
>> Sorting - is the first thing I've tried. Unfortunately, sorting
>> itself eats too much cpu. Filling hash table is much faster.
>>
> 
> I've been interested in the sort-based approach, so I've spent a bit of
> time hacking on it (patch attached). It's much less invasive compared to
> the hash-table, but Yura is right the hashtable gives better results.
> 
> I've tried to measure the benefits using the same script I shared on
> Tuesday, but I kept getting really strange numbers. In fact, I've been
> unable to even reproduce the results I shared back then. And after a bit
> of head-scratching I think the script is useless - it can't possibly
> generate snapshots with many XIDs because all the update threads sleep
> for exactly the same time. And first one to sleep is first one to wake
> up and commit, so most of the other XIDs are above xmax (and thus not
> included in the snapshot). I have no idea why I got the numbers :-/
> 
> But with this insight, I quickly modified the script to also consume
> XIDs by another thread (which simply calls txid_current). With that I'm
> getting snapshots with as many XIDs as there are UPDATE clients (this
> time I actually checked that using gdb).
> 
> And for a 60-second run the tps results look like this (see the attached
> chart, showing the same data):
> 
> writers master  hash   sort
>-
> 16   1068   1057   1070
> 32   1005995   1033
> 64   1064   1042   1117
> 128  1058   1021   1051
> 256   977   1004928
> 512   773935808
> 768   576815670
> 1000  413752573
> 
> The sort certainly does improve performance compared to master, but it's
> only about half of the hashtable improvement.
> 
> I don't know how much this simple workload resembles the YCSB benchmark,
> but I seem to recall it's touching individual rows. In which case it's
> likely worse due to the pg_qsort being more expensive than building the
> hash table.
> 
> So I agree with Yura the sort is not a viable alternative to the hash
> table, in this case.
> 
>> Can I rely on snapshots being static? May be then there is no need
>> in separate raw representation and hash table. I may try to fill hash
>> table directly in GetSnapshotData instead of lazy filling. Though it
>> will increase time under lock, so it is debatable and should be
>> carefully measured.
>>
> 
> Yes, I think you can rely on snapshots not being modified later. That's
> pretty much the definition of a snapshot.
> 
> You may do that in GetSnapshotData, but you certainly can't do that in
> the for loop there. Not only we don't want to add more work there, but
> you don't know the number of XIDs in that loop. And we don't want to
> build hash tables for small number of XIDs.
> 
> One reason against building the hash table in GetSnapshotData is that
> we'd build it even when the snapshot is never queried. Or when it is
> queried, but we only need to check xmin/xmax.

Thank you for analyze, Tomas.

Stephen is right about bug in snapmgr.c
Attached version fixes bug, and also simplifies XidInXip a bit.

With regards,
Sokolov Yura.
From 8484ae3f2c2f8af20ae8ce2f6d7960b6519e65c0 Mon Sep 17 00:00:00 2001
From: Sokolov Yura aka funny_falcon <funny.fal...@gmail.com>
Date: Fri, 9 Mar 2018 22:49:01 +0300
Subject: [PATCH] Make hash table for xip in XidInMVCCSnapshot

When lot of concurrent transactions attempts to update single
row, then linear scan through running list in XidInMVCCSnapshot
became noticebale bottleneck.

With this change, hash table is built on first search of xid in
snapshot->xip and snapshot->subxip arrays.

If size of array is smaller than 60, then lin

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2018-03-05 Thread Yura Sokolov
05.03.2018 18:00, Tom Lane пишет:
> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>> Snapshots are static (we don't really add new XIDs into existing ones,
>> right?), so why don't we simply sort the XIDs and then use bsearch to
>> lookup values? That should fix the linear search, without need for any
>> local hash table.
> 
> +1 for looking into that, since it would avoid adding any complication
> to snapshot copying.  In principle, with enough XIDs in the snap, an
> O(1) hash probe would be quicker than an O(log N) bsearch ... but I'm
> dubious that we are often in the range where that would matter.
> We do need to worry about the cost of snapshot copying, too.

Sorting - is the first thing I've tried. Unfortunately, sorting itself
eats too much cpu. Filling hash table is much faster.

Can I rely on snapshots being static? May be then there is no need in
separate raw representation and hash table. I may try to fill hash table
directly in GetSnapshotData instead of lazy filling. Though it will
increase time under lock, so it is debatable and should be carefully
measured.

I'll look at a bug in CopySnapshot soon. Excuse me for delaying.

With regards,
Sokolov Yura



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Small improvement to compactify_tuples

2018-03-04 Thread Yura Sokolov
01.03.2018 22:22, Andres Freund пишет:
> Hi,
> 
> On 2018-02-25 21:39:46 +0300, Yura Sokolov wrote:
>>> If that's the case then does it really make sense to make this change..?
>>
>> I don't think it is really necessary to implement generic version
>> through templated.
> 
> Why?

It is better to replace use of generic version with templated in
appropriate places.
Generic version uses variable size of element. It will be difficult to
describe through template.

> 
>> Updated numbers are (same benchmark on same notebook, but with new
>> master, new ubuntu and later patch version) (average among 6 runs):
>>
>> master   - 16135tps
>> with templated qsort - 16199tps
>> with bucket sort - 16956tps
>>
>> Difference is still measurable, but less significant. I don't know why.
>>
>> Rebased version of first patch (qsorted tamplate) is in atttach.
> 
> Hm, that's a bit underwhelming. It's nice to deduplicate, but 16135tps
> -> 16199tps is barely statistically significant?

I mean bucket sort is measurably faster than both generic and templated
sort (16956 vs 16199 and 16135). So initial goal remains: to add bucket
sort in this place.

BTW, I have small change to templated version that improves sorting of
random tuples a bit (1-1.5%). Will post it a bit later with test.

> - Andres
> 

With regards,
Yura.



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Small improvement to compactify_tuples

2018-02-25 Thread Yura Sokolov
23.01.2018 06:34, Stephen Frost пишет:
> Greetings,
> 
> * Юрий Соколов (funny.fal...@gmail.com) wrote:
>> On Wed, Nov 29, 2017 at 8:00 AM, Peter Geoghegan <p...@bowt.ie> wrote:
>>> On Tue, Nov 28, 2017 at 2:41 PM, Andres Freund <and...@anarazel.de> wrote:
>>>> Maybe it's a stupid question. But would we still want to have this after
>>>> the change? These should be just specializations of the template version
>>>> imo.
>>
>> "generic" version operates on bytes, and it will be a bit hard to combine
>> it with
>> templated version. Not impossible, but it will look ugly.
> 
> If that's the case then does it really make sense to make this change..?

I don't think it is really necessary to implement generic version
through templated. It is much better to replace generic version with
templated in places where it matters for performance.

> 
>> In attach fixed qsort_template version.
>> And version for compactify_tuples with bucket_sort and templated qsort.
> 
> While having the patch is handy, I'm not seeing any performance numbers
> on this version, and I imagine others watching this thread are also
> wondering about things like a test run that just uses the specialized
> qsort_itemIds() without the bucketsort.
> 
> Are you planning to post some updated numbers and/or an updated test
> case that hopefully shows best/worst case with this change?  Would be
> good to get that on a couple of platforms too, if possible, since we've
> seen that the original benchmarks weren't able to be consistently
> repeated across different platforms.  Without someone doing that
> leg-work, this doesn't seem like it'll be moving forward.

Updated numbers are (same benchmark on same notebook, but with new
master, new ubuntu and later patch version) (average among 6 runs):

master   - 16135tps
with templated qsort - 16199tps
with bucket sort - 16956tps

Difference is still measurable, but less significant. I don't know why.

Rebased version of first patch (qsorted tamplate) is in atttach.

With regards,
Sokolov Yura.


0001-Header-for-customized-qsort.patch.gz
Description: application/gzip


signature.asc
Description: OpenPGP digital signature


<    1   2