Re: Converting plpgsql to use DTYPE_REC for named composite types

2017-12-30 Thread Pavel Stehule
2017-12-30 0:16 GMT+01:00 Tom Lane :

> I wrote:
> > I'll stick this into the January commitfest, but I'd like to get it
> > reviewed and committed pretty soon, because there are follow-on patches
> > that need to get done in time for v11 --- in particular, we need to close
> > out the lack of plpgsql support for domains-over-composite.
>
> I hacked on the domain-support problem a bit and it worked out well,
> so attached is a revised patch that incorporates that.  This caused
> me to revise some assumptions about what expandedrecord.c's internal
> invariants ought to be, so it's probably better to look at this as
> a new patch rather than a diff from v1.
>
> Mostly this is changes internal to expandedrecord.c, but I cleaned up
> some code in plpgsql that tests for domain-ness, and I also added support
> in ExecEvalFieldSelect for extracting a field directly from an expanded
> record without flattening it into a tuple first.  It hadn't been clear
> that that was worth the trouble before, but it definitely does come up
> while applying domain constraints.  For example, having that fast path
> makes about a 2X speed difference in a test like this:
>
> create type two_ints as (f1 int, f2 int);
> create domain ordered_ints as two_ints check((value).f1 <= (value).f2);
>
> \timing on
>
> do $$
> declare d ordered_ints;
> begin
>   for i in 1..300 loop
> d.f2 := i;
> d.f1 := i;
>   end loop;
> end$$;
>
>
> There are still a couple of soft spots having to do with enforcing
> domain constraints against null composite values, e.g. if there's
> a constraint that would reject a null value we don't notice that
> at DECLARE time.  I think there's not much point in being strict
> about that until we have support for initializers for composite
> variables.  Which is definitely worth doing but it seems like it
> could be a separate patch.
>
> The patches in <11986.1514407...@sss.pgh.pa.us> still apply over this
> with just some line-number offsets, so I won't post a new version
> of those.
>
>
all test passed on my comp too.

I think so these patches can be useful for schema variables too.

Regards

Pavel


> regards, tom lane
>
>


Re: Why standby restores some WALs many times from archive?

2017-12-30 Thread Sergey Burladyan
Michael Paquier  writes:

> On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote:
> > We use this scripts:
> > https://github.com/avito-tech/dba-utils/tree/master/pg_archive
> > 
> > But I can reproduce problem with simple cp & mv:
> > archive_command:
> >   test ! -f /var/lib/postgresql/wals/%f && \
> >   test ! -f /var/lib/postgresql/wals/%f.tmp && \
> >   cp %p /var/lib/postgresql/wals/%f.tmp && \
> >   mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f
>
> This is unsafe. PostgreSQL expects the WAL segment archived to be
> flushed to disk once the archive command has returned its result to the
> backend.

Yes, you are right, thank you for pointing that out! I upload new
version with sync to github.

> Don't be surprised if you get corrupted instances or that you
> are not able to recover up to a consistent point if you need to roll in
> a backup.

But only if archive was reboot unexpectedly, am I right?

-- 
Sergey Burladyan



Faster inserts with mostly-monotonically increasing values

2017-12-30 Thread Pavan Deolasee
Hello All,

On a per-session basis, we cache the last heap block used for inserts and
try to use the same block for subsequent inserts. We don't do that for
indexes because the target block in the index is determined by the overall
structure of the index and the index key being inserted and hence caching
is usually not very useful. But for certain typical, yet not-so-uncommon
cases we can make similar optimisations. For example, consider a btree
index on a "timestamp" column or on a "serial" column. In such cases, each
new index entry typically goes beyond the rightmost entry in the index. If
we cache the rightmost block and check that first, we can avoid the cost of
descending down and looking up the index

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the
insert, it should still be the rightmost, the to-be-inserted key should go
into the cached block and there is enough free space in the block. If any
of these conditions do not meet, we fall back to the regular code path,
i.e. descent down the index from the top.

I benchmarked the patch by creating a simple table with just one bigint
column and a btree index on it. Here are the results:

Monotonically Increasing 10M Inserts (time in ms)
==
Run Patched  Master
1 17656.222 25737.593
2 18765.002 26843.314
3 20629.567 27058.358
4 21174.998 26680.003
5 21118.865 26660.557

Avg 19868.9308 26595.965 (33.86% improvement)


Monotonically Increasing 100M Inserts (time in ms)
==
Run Patched  Master
1 159861.58 248885.763
2 160138.432 256438.625
3 160426.135 250379.589
4 163218.405 249474.695
5 161125.523 247805.675

Avg 160954.015 250596.8694 (55.69% improvement)


So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

Patch credit: this work is based on Simon Riggs's original ideas and
research.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_btree_target_block_v1.patch
Description: Binary data


Re: [HACKERS] Commits don't block for synchronous replication

2017-12-30 Thread Masahiko Sawada
On Sat, Dec 30, 2017 at 9:25 PM, Michael Paquier
 wrote:
> On Fri, Dec 29, 2017 at 02:45:09PM +, Simon Riggs wrote:
>> Applied, thanks.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Better testing coverage and unified coding for plpgsql loops

2017-12-30 Thread Tom Lane
While I've been fooling around with plpgsql, I've been paying close
attention to code coverage reports to make sure that the regression tests
exercise all that new code.  It started to bug me that there were some
serious gaps in the test coverage for existing code in pl_exec.c.
One thing I noticed in particular was that coverage for the
PLPGSQL_RC_EXIT/PLPGSQL_RC_RETURN/PLPGSQL_RC_CONTINUE management code
in the various looping statements was nearly nonexistent, and coverage
for integer FOR loops was pretty awful too (eg, not a single BY clause
in the whole test corpus :-().  So I said to myself "let's fix that",
and wrote up a new regression test file plpgsql_control.sql with a
charter to test plpgsql's control structures.  I moved a few existing
tests that seemed to fall into that charter into the new file, and
added tests until I was happier with the state of the coverage report.
The result is attached as plpgsql-better-code-coverage.patch.

However, while I was doing that, it seemed like the tests I was adding
were mighty repetitive, as many of them were just exactly the same thing
adjusted for a different kind of loop statement.  And so I began to wonder
why it was that we had five copies of the RC_FOO management logic, no two
quite alike.  If we only had *one* copy then it would not seem necessary
to have such duplicative test cases for it.  A bit of hacking later, and
I had the management logic expressed as a macro, with only one copy for
all five kinds of loop.  I verified it still passes the previous set of
tests and then removed the ones that seemed redundant, yielding
plpgsql-unify-loop-rc-code.patch below.  So what I propose actually
committing is the combination of these two patches.

Anyone feel like reviewing this, or should I just push it?  It seems
pretty noncontroversial to me, but maybe I'm wrong about that.

regards, tom lane

diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 76ac247..14a4d83 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*** DATA = plpgsql.control plpgsql--1.0.sql 
*** 26,32 
  
  REGRESS_OPTS = --dbname=$(PL_TESTDB)
  
! REGRESS = plpgsql_call
  
  all: all-lib
  
--- 26,32 
  
  REGRESS_OPTS = --dbname=$(PL_TESTDB)
  
! REGRESS = plpgsql_call plpgsql_control
  
  all: all-lib
  
diff --git a/src/pl/plpgsql/src/expected/plpgsql_control.out b/src/pl/plpgsql/src/expected/plpgsql_control.out
index ...b7089e3 .
*** a/src/pl/plpgsql/src/expected/plpgsql_control.out
--- b/src/pl/plpgsql/src/expected/plpgsql_control.out
***
*** 0 
--- 1,833 
+ --
+ -- Tests for PL/pgSQL control structures
+ --
+ -- integer FOR loop
+ do $$
+ begin
+   -- basic case
+   for i in 1..3 loop
+ raise notice '1..3: i = %', i;
+   end loop;
+   -- with BY, end matches exactly
+   for i in 1..10 by 3 loop
+ raise notice '1..10 by 3: i = %', i;
+   end loop;
+   -- with BY, end does not match
+   for i in 1..11 by 3 loop
+ raise notice '1..11 by 3: i = %', i;
+   end loop;
+   -- zero iterations
+   for i in 1..0 by 3 loop
+ raise notice '1..0 by 3: i = %', i;
+   end loop;
+   -- REVERSE
+   for i in reverse 10..0 by 3 loop
+ raise notice 'reverse 10..0 by 3: i = %', i;
+   end loop;
+   -- potential overflow
+   for i in 2147483620..2147483647 by 10 loop
+ raise notice '2147483620..2147483647 by 10: i = %', i;
+   end loop;
+   -- potential overflow, reverse direction
+   for i in reverse -2147483620..-2147483647 by 10 loop
+ raise notice 'reverse -2147483620..-2147483647 by 10: i = %', i;
+   end loop;
+ end$$;
+ NOTICE:  1..3: i = 1
+ NOTICE:  1..3: i = 2
+ NOTICE:  1..3: i = 3
+ NOTICE:  1..10 by 3: i = 1
+ NOTICE:  1..10 by 3: i = 4
+ NOTICE:  1..10 by 3: i = 7
+ NOTICE:  1..10 by 3: i = 10
+ NOTICE:  1..11 by 3: i = 1
+ NOTICE:  1..11 by 3: i = 4
+ NOTICE:  1..11 by 3: i = 7
+ NOTICE:  1..11 by 3: i = 10
+ NOTICE:  reverse 10..0 by 3: i = 10
+ NOTICE:  reverse 10..0 by 3: i = 7
+ NOTICE:  reverse 10..0 by 3: i = 4
+ NOTICE:  reverse 10..0 by 3: i = 1
+ NOTICE:  2147483620..2147483647 by 10: i = 2147483620
+ NOTICE:  2147483620..2147483647 by 10: i = 2147483630
+ NOTICE:  2147483620..2147483647 by 10: i = 2147483640
+ NOTICE:  reverse -2147483620..-2147483647 by 10: i = -2147483620
+ NOTICE:  reverse -2147483620..-2147483647 by 10: i = -2147483630
+ NOTICE:  reverse -2147483620..-2147483647 by 10: i = -2147483640
+ -- BY can't be zero or negative
+ do $$
+ begin
+   for i in 1..3 by 0 loop
+ raise notice '1..3 by 0: i = %', i;
+   end loop;
+ end$$;
+ ERROR:  BY value of FOR loop must be greater than zero
+ CONTEXT:  PL/pgSQL function inline_code_block line 3 at FOR with integer loop variable
+ do $$
+ begin
+   for i in 1..3 by -1 loop
+ raise notice '1..3 by -1: i = %', i;
+   end loop;
+ end$$;
+ ERROR:  BY value of FOR loop must be greater than zero
+ CONTEXT:  PL/pgSQL function inline_code_block line 3 at FOR with integer loop variable

Re: What does Time.MAX_VALUE actually represent?

2017-12-30 Thread Gavin Flower

On 12/31/2017 03:07 AM, Dave Cramer wrote:
We are having a discussion on the jdbc project about dealing with 
24:00:00.


https://github.com/pgjdbc/pgjdbc/pull/992#issuecomment-354507612

Dave Cramer


In Dublin (I was there 2001 to 2004), Time tables show buses just after 
midnight, such as 1:20am as running at the time 2520 - so there are 
visible close to the end of the day.  If you are looking for buses 
around midnight this is very user friendly - better than looking at the 
other end of the time table for 0120.


I think logically that 24:00:00 is exactly one day later than 00:00:00 - 
but I see from following the URL, that there are other complications...



Cheers,
Gavin




Re: [HACKERS] Re: [HACKERS] generated columns

2017-12-30 Thread Joe Conway
On 12/27/2017 09:31 AM, Peter Eisentraut wrote:
> On 9/12/17 15:35, Jaime Casanova wrote:
>> On 10 September 2017 at 00:08, Jaime Casanova
>>  wrote:
>>>
>>> During my own tests, though, i found some problems:
> 
> Here is an updated patch that should address the problems you have found.

In the commit message it says:

  "The plan to is implement two kinds of generated columns:
   virtual (computed on read) and stored (computed on write).  This
   patch only implements the virtual kind, leaving stubs to implement
   the stored kind later."

and in the patch itself:

+
+ The generation expression can refer to other columns in the table, but
+ not other generated columns.  Any functions and operators used must be
+ immutable.  References to other tables are not allowed.
+

Question -- when the "stored" kind of generated column is implemented,
will the immutable restriction be relaxed? I would like, for example, be
able to have a stored generated column that executes now() whenever the
row is written/rewritten.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: TAP test module - PostgresClient

2017-12-30 Thread Andrew Dunstan


On 12/30/2017 10:45 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> As for out-dating, if we used DBD::PgPP we'd not be not in great danger
>> there - it doesn't appear to have changed for many years - latest
>> version is dated 2010. If we were to use it we'd have a dependency on
>> DBI, but that in itself doesn't seem a great burden.
> [ blowing the dust off my old red fedora... ]  Actually, there's a
> different problem with this proposal: you can bet that DBD::Pg has got a
> build dependency on Postgres.  If Postgres starts to depend on DBD::Pg
> then we've created circular-dependency hell for packagers.  


The Pure Perl driver has no such dependency, since it doesn't require
libpq. But ...

> I much prefer the other line of thought about doing whatever we need
> to do to make psql workable for the desired type of tests. 


... agreed ...

>  Or just
> write a bespoke testing tool.
>
>   

... that's pretty much where we came in.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: TAP test module - PostgresClient

2017-12-30 Thread Tom Lane
Andrew Dunstan  writes:
> As for out-dating, if we used DBD::PgPP we'd not be not in great danger
> there - it doesn't appear to have changed for many years - latest
> version is dated 2010. If we were to use it we'd have a dependency on
> DBI, but that in itself doesn't seem a great burden.

[ blowing the dust off my old red fedora... ]  Actually, there's a
different problem with this proposal: you can bet that DBD::Pg has got a
build dependency on Postgres.  If Postgres starts to depend on DBD::Pg
then we've created circular-dependency hell for packagers.  We could only
make that work if we carefully kept the DBD::Pg requirement *out* of
"make check" and anything else that a packager might care to run during
package sanity checks.  I suppose maybe we could live with a restriction
like that, if we treat this like the SSL tests as something that doesn't
get run except by special manual invocation --- but that'd reduce its
utility greatly don't you think?  And I fear there would be quite a risk
of somebody breaking the restriction because they weren't thinking about
it.  I note that there are no buildfarm members running any distro
packaging script, so we wouldn't find out about unintended-dependency bugs
until packagers were trying to build a release.

I much prefer the other line of thought about doing whatever we need
to do to make psql workable for the desired type of tests.  Or just
write a bespoke testing tool.

regards, tom lane



Re: [HACKERS] taking stdbool.h into use

2017-12-30 Thread Tom Lane
Michael Paquier  writes:
> So, looking at 0001 now... Shouldn't there be a DatumGetBool8(), with
> the existing DatumGetBool() which should depend on the size of bool? I
> can see that all the catalogs are correctly updated with bool8 in the
> patch.

Surely bool and bool8 should have identical Datum representations,
so I'm not sure they need different DatumGet/GetDatum macros.

Although this opens up another point: just above those macros,
postgres.h says

 * When a type narrower than Datum is stored in a Datum, we place it in the
 * low-order bits and are careful that the DatumGetXXX macro for it discards
 * the unused high-order bits (as opposed to, say, assuming they are zero).
 * This is needed to support old-style user-defined functions, since depending
 * on architecture and compiler, the return value of a function returning char
 * or short may contain garbage when called as if it returned Datum.

Since we flushed support for V0 functions, the stated rationale doesn't
apply anymore.  I wonder whether there is anything to be gained by
changing DatumGetXXX and XXXGetDatum to be simple casts (as, if memory
serves, they once were until we noticed the stated hazard).  Or, if
there's still a reason to keep the masking steps in place, we'd better
update this comment.

regards, tom lane



Re: TAP test module - PostgresClient

2017-12-30 Thread Andrew Dunstan


On 12/29/2017 08:12 AM, Tels wrote:
> On Thu, December 28, 2017 10:14 pm, Tom Lane wrote:
>> Craig Ringer  writes:
>>> Another option might be to teach the TAP infrastructure and the
>>> buildfarm
>>> client how to fetch cpanminus and build DBD::Pg against our build-tree,
>>> so
>>> we could use Perl DBI.
>> As a buildfarm owner, I'd take a *really* dim view of the buildfarm
>> trying to auto-fetch code off the internet.  As a developer, the
>> idea that "make check-world" would try to do that is even scarier.
>> Buildfarm owners are likely to have taken at least some steps to
>> sandbox their builds, but developers not so much.
>>
>> I do not think we should even think about going there.
> Well, while I couldn't agree more on the "running code from the internet
> is dangerous" thing, there are a few points for it, tho:
>
> * if you use Perl modules on your system, you are likely doing already,
> anyway, as the Perl modules come, you guessed it, from the internet :)
> Just because a random $PackageMaintainer signed it does mean it is really
> safe.
>
> * And a lot of Perl modules are not in say, Debian repositories, so you
> need to use CPAN (or re-invent everything). Unfortunately, the trend for
> other languages seems to go into the same direction, with Ruby gems, the
> python package manager, and almost everybody else re-inventing their own
> packaging system, often poorly. So you might already have fallen in the
> trap of "use random code from the internet". (Of course, that is not
> really an argument for doing it, too...)
>
> * the other option seems to be "re-invent the wheel, again, locally",
> which isn't always the best, either.
>
> I do agree tho that having "setup" or "make check" auto-fetching stuff
> from the internet is not a good idea, however. Mostly because it becomes
> suddenly much harder to run in closed networks w/o access and such
> side-loading installations can bypass your systems packaging system, which
> doesn't sound good, either.
>
> OTOH, it is possible to setup local repositories, or maybe even
> pre-bundling modules into some sort of "approved TAP bundle" hosted on an
> official server.
>
> The last resort would be to pre-bundle the wanted modules, but this has
> the risk of outdating them fast. Plus, pre-bundled modules are not more
> security vetted than the ones from the internet, so you might as well use
> the CPAN version directly.
>
> The best course seems to me to have dependencies on the OS packackes for
> the  Perl modules you want to use. Not sure, however, if the build farm
> client has "proper" Debian etc. packages and if it is even possible to add
> these dependencies in this way.
>


The buildfarm client isn't even packaged as a CPAN module let alone as a
bunch of OS-level packages (and if I supported Debian packaging I'd need
to support every other packaging system on the planet too, including
Windows).

It's always seemed to me unnecessary to use something beyond a simple
tarball for something that has a target of less than 100 tolerably savvy
users and which requires no actual build.

In any case, I agree with Craig that we'd be much better off spending
time working out why we can't get IPC::Run to do everything we want on
Windows.

As for out-dating, if we used DBD::PgPP we'd not be not in great danger
there - it doesn't appear to have changed for many years - latest
version is dated 2010. If we were to use it we'd have a dependency on
DBI, but that in itself doesn't seem a great burden.


cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




What does Time.MAX_VALUE actually represent?

2017-12-30 Thread Dave Cramer
We are having a discussion on the jdbc project about dealing with 24:00:00.

https://github.com/pgjdbc/pgjdbc/pull/992#issuecomment-354507612

Dave Cramer


Re: [HACKERS] taking stdbool.h into use

2017-12-30 Thread Michael Paquier
On Sat, Dec 30, 2017 at 08:29:09AM +0900, Michael Paquier wrote:
> On Fri, Dec 29, 2017 at 12:33:24PM -0500, Tom Lane wrote:
>> It does make sense, probably, to push 0001-0003 first and see if
>> anything turns up from that, then 0004.
> 
> I have not looked at 0001 in details yet, which was going to be my next
> step. If you could wait for at least two days that would be nice to give
> me some room.

So, looking at 0001 now... Shouldn't there be a DatumGetBool8(), with
the existing DatumGetBool() which should depend on the size of bool? I
can see that all the catalogs are correctly updated with bool8 in the
patch.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Commits don't block for synchronous replication

2017-12-30 Thread Michael Paquier
On Fri, Dec 29, 2017 at 02:45:09PM +, Simon Riggs wrote:
> Applied, thanks.

Thanks, Simon.
--
Michael


signature.asc
Description: PGP signature


Re: Why standby restores some WALs many times from archive?

2017-12-30 Thread Michael Paquier
On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote:
> We use this scripts:
> https://github.com/avito-tech/dba-utils/tree/master/pg_archive
> 
> But I can reproduce problem with simple cp & mv:
> archive_command:
>   test ! -f /var/lib/postgresql/wals/%f && \
>   test ! -f /var/lib/postgresql/wals/%f.tmp && \
>   cp %p /var/lib/postgresql/wals/%f.tmp && \
>   mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f

This is unsafe. PostgreSQL expects the WAL segment archived to be
flushed to disk once the archive command has returned its result to the
backend. Don't be surprised if you get corrupted instances or that you
are not able to recover up to a consistent point if you need to roll in
a backup. Note that the documentation of PostgreSQL provides a simple
example of archive command, which is itself bad enough not to use.
--
Michael


signature.asc
Description: PGP signature


Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2017-12-30 Thread Simon Riggs
Comments in ReserveXLogInsertLocation() says
"* This is the performance critical part of XLogInsert that must be serialized
  * across backends. The rest can happen mostly in parallel. Try to keep this
  * section as short as possible, insertpos_lck can be heavily contended on a
  * busy system."

We've worked out a way of reducing contention during
ReserveXLogInsertLocation() by using atomic operations.

The mechanism requires that we remove the xl_prev field in each WAL
record. This also has the effect of reducing WAL volume by a few %.

Currently, we store the start location of the previous WAL record in
the xl_prev field of the WAL record header. Typically, redo recovery
is a forward moving process and hence we don't ever need to consult
xl_prev and read WAL backwards (there is one exception, more on that
later [1]). So in theory, we should be able to remove this field
completely without compromising any functionality or correctness.

But the presence of xl_prev field enables us to guard against torn WAL
pages, when a WAL record starts on a sector boundary. In case of a
torn page, even though the WAL page looks sane, the WAL record could
actually be a stale record retained from the older, recycled WAL file.
The system usually guards against this by comparing xl_prev field
stored in the WAL record header with the WAL location of the previous
record read. Any mismatch is treated as end-of-WAL-stream.

So we can't completely remove xl_prev field, without giving up some
functionality. But we don't really need to store the 8-byte previous
WAL pointer in order to detect torn pages. Something else which can
tell us that the WAL record does not belong to current WAL segno would
be enough as well. I propose that we replace it with a much smaller
2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
decide to call it) is the low order 16-bits of the WAL segno to which
the WAL record belongs. While reading WAL, we always match that the
"xl_walid" value stored in the WAL record matches with the current WAL
segno's lower order 16-bits and if not, then consider that as the end
of the stream.

For this to work, we must ensure that WAL files are either recycled in
such a way that the "xl_walid" of the previous (to be recycled) WAL
differs from the new WAL or we zero-out the new WAL file. Seems quite
easy to do with the existing infrastructure.

Because of padding and alignment, replacing 8-byte xl_prev with 2-byte
xl_walid effectively reduces the WAL record header by a full 8-bytes
on a 64-bit machine. Obviously, this reduces the amount of WAL
produced and transferred to the standby. On a pgbench test, we see
about 3-5% reduction in WAL traffic, though in some tests higher -
depending on workload.

There is yet another important benefit of removing the xl_prev field.
We no longer need to track PrevBytePos field in XLogCtlInsert. The
insertpos_lck spinlock is now only guarding CurrBytePos. So we can
replace that with an atomic 64-bit integer and completely remove the
spinlock. The comment at the top of ReserveXLogInsertLocation()
clearly mentions the importance of keeping the critical section as
small as possible and this patch achieves that by using atomic
variables.

Pavan ran some micro-benchmarks to measure the effectiveness of the
approach. I (Pavan) wrote a wrapper on top of
ReserveXLogInsertLocation() and exposed that as a SQL-callable
function. I then used pgbench with 1-16 clients where each client
effectively calls ReserveXLogInsertLocation() 1M times. Following are
the results from the master and the patched code, averaged across 5
runs. The tests are done on a i2.2xlarge AWS instance.

HEAD
 1 ... 24.24 tps
 2 ... 18.12 tps
 4 ... 10.95 tps
 8 ...  9.05 tps
16 ... 8.44 tps

As you would notice, the spinlock contention is immediately evident
even when running with just 2 clients and gets worse with 4 or more
clients.

Patched
 1 ... 35.08 tps
 2 ... 31.99 tps
 4 ... 30.48 tps
 8 ... 40.44 tps
16 ... 50.14 tps

The patched code on the other hand scales to higher numbers of clients
much better.

Those are microbenchmarks. You need to run a multi-CPU workload with
heavy WAL inserts to show benefits.

[1] pg_rewind is the only exception which uses xl_prev to find the
previous checkpoint record. But we can always start from the beginning
of the WAL segment and read forward until we find the checkpoint
record. The patch does just the same and passes pg_rewind's tap tests.

Patch credit: Simon Riggs and Pavan Deolasee

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pg_wal_header_reduction_v1.patch
Description: Binary data


Re: New gist vacuum.

2017-12-30 Thread Andrey Borodin
Hi!

> 28 дек. 2017 г., в 16:37, Andrey Borodin  написал(а):
> Here is new version of the patch for GiST VACUUM.
> There are two main changes:
> 1. During rescan for page deletion only know to be recently empty pages are 
> rescanned.
> 2. I've re-implemented physical scan with array instead of hash table.

There is one more minor spot in GiST VACUUM. It takes heap tuples count for 
statistics for partial indexes, while it should not. 

If gistvacuumcleanup() is not given a statistics gathered by gistbulkdelete() 
it returns incorrect tuples count for partial index.
Here's the micropatch, which fixes that corner case.
To reproduce this effect I used this query:
create table y as select cube(random()) c from generate_series(1,1) y; 
create index on y using gist(c) where c~>1 > 0.5;
vacuum verbose y;
Before patch it will report 1 tuples, with patch it will report different 
values around 5000.

I do not know, should I register separate commitfest entry? The code is very 
close to main GiST VACUUM patch, but solves a bit different problem.

Best regards, Andrey Borodin.


0001-Count-tuples-correctly-during-GiST-VACUUM-of-partial.patch
Description: Binary data