Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread John Naylor
On Wed, May 1, 2019 at 11:43 AM Amit Kapila  wrote:
>
> On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera  
> wrote:
> > but that seems correct.
> > Sounds better than keeping outdated entries indicating no-space-available.
>
> Agreed, but as mentioned in one of the above emails, I am also bit
> scared that it should not lead to many invalidation messages for small
> relations, so may be we should send the invalidation message only when
> the entire page is empty.

One way would be to send the inval if the new free space is greater
than some percentage of BLCKSZ.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-30 15:53:07 -0700, Andres Freund wrote:
>> I'll move the test into a new "reindex_catalog" test, with a comment
>> explaining that the failure cases necessitating that are somewhere
>> between bugs, ugly warts, an hard to fix edge cases.

> Just pushed that.

locust is kind of unimpressed:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust=2019-05-01%2003%3A12%3A13

The relevant bit of log is

2019-05-01 05:24:47.527 CEST [97690:429] pg_regress/create_view LOG:  
statement: DROP SCHEMA temp_view_test CASCADE;
2019-05-01 05:24:47.605 CEST [97690:430] pg_regress/create_view LOG:  
statement: DROP SCHEMA testviewschm2 CASCADE;
2019-05-01 05:24:47.858 CEST [97694:1] [unknown] LOG:  connection received: 
host=[local]
2019-05-01 05:24:47.863 CEST [97694:2] [unknown] LOG:  connection authorized: 
user=pgbuildfarm database=regression
2019-05-01 05:24:47.878 CEST [97694:3] pg_regress/reindex_catalog LOG:  
statement: REINDEX TABLE pg_class;
2019-05-01 05:24:48.887 CEST [97694:4] pg_regress/reindex_catalog ERROR:  
deadlock detected
2019-05-01 05:24:48.887 CEST [97694:5] pg_regress/reindex_catalog DETAIL:  
Process 97694 waits for ShareLock on transaction 2559; blocked by process 97690.
Process 97690 waits for RowExclusiveLock on relation 1259 of database 
16387; blocked by process 97694.
Process 97694: REINDEX TABLE pg_class;
Process 97690: DROP SCHEMA testviewschm2 CASCADE;
2019-05-01 05:24:48.887 CEST [97694:6] pg_regress/reindex_catalog HINT:  See 
server log for query details.
2019-05-01 05:24:48.887 CEST [97694:7] pg_regress/reindex_catalog CONTEXT:  
while checking uniqueness of tuple (12,71) in relation "pg_class"
2019-05-01 05:24:48.887 CEST [97694:8] pg_regress/reindex_catalog STATEMENT:  
REINDEX TABLE pg_class;
2019-05-01 05:24:48.904 CEST [97690:431] pg_regress/create_view LOG:  
disconnection: session time: 0:00:03.748 user=pgbuildfarm database=regression 
host=[local]

which is mighty confusing at first glance, but I think the explanation is
that what the postmaster is reporting is process 97690's *latest* query,
not what it's currently doing.  What it's really currently doing at the
moment of the deadlock is cleaning out its temporary schema after the
client disconnected.  So this says you were careless about where to insert
the reindex_catalog test in the test schedule: it can't be after anything
that creates any temp objects.  That seems like kind of a problem :-(.
We could put it second, after the tablespace test, but that would mean
that we're reindexing after very little churn has happened in the
catalogs, which doesn't seem like much of a stress test.

Another fairly interesting thing is that this log includes the telltale

2019-05-01 05:24:48.887 CEST [97694:7] pg_regress/reindex_catalog CONTEXT:  
while checking uniqueness of tuple (12,71) in relation "pg_class"

Why did I have to dig to find that information in HEAD?  Have we lost
some useful context reporting?  (Note this run is in the v10 branch.)

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread Amit Kapila
On Wed, May 1, 2019 at 9:57 AM John Naylor  wrote:
>
> On Tue, Apr 30, 2019 at 12:48 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 26, 2019 at 10:46 AM John Naylor
> >  wrote:
> > I don't much like the new function name GetAlternatePage, may be
> > GetPageFromLocalFSM or something like that.  OTOH, I am not sure if we
> > should go that far to address this concern of Andres's, maybe just
> > adding a proper comment is sufficient.
>
> That's a clearer name. I think 2 functions is easier to follow than
> the boolean parameter.
>

Okay, but then add a few comments where you are calling that function.

> > > Putting the thresholds in 3 files with completely different purposes
> > > is a mess, and serves no example for future access methods, but I
> > > don't have a better idea.
> > >
> >
> > Yeah, I am also not sure if it is a good idea because it still won't
> > be easy for pluggable storage especially the pg_upgrade part.  I think
> > if we really want to make it easy for pluggable storage to define
> > this, then we might need to build something along the lines of how to
> > estimate relation size works.
> >
> > See how table_relation_estimate_size is defined and used
> > and TableAmRoutine heapam_methods
> > {
> > ..
> > relation_estimate_size
> > }
>
> That might be the best way for table ams, but I guess we'd still need
> to keep the hard-coding for indexes to always have a FSM. That might
> not be too bad.
>

I also think so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread John Naylor
On Tue, Apr 30, 2019 at 12:48 PM Amit Kapila  wrote:
>
> On Fri, Apr 26, 2019 at 10:46 AM John Naylor
>  wrote:
> >
> > On Wed, Apr 17, 2019 at 2:04 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
> > > complexity that looks like it should be purely in freespacemap.c to
> > > callers.
> >
> > I took a stab at untying the free space code from any knowledge about
> > heaps, and made it the responsibility of each access method that calls
> > these free space routines to specify their own threshold (possibly
> > zero). The attached applies on top of HEAD, because it's independent
> > of the relcache logic being discussed. If I can get that working, the
> > I'll rebase it on top of this API, if you like.
> >
> > >  extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
> > > -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
> > > +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
> > > +bool check_fsm_only);
> > >
> > > So now freespace.c has an argument that says we should only check the
> > > fsm. That's confusing. And it's not explained to callers what that
> > > argument means, and when it should be set.
> >
> > I split this up into 2 routines: GetPageWithFreeSpace() is now exactly
> > like it is in v11, and GetAlternatePage() is available for access
> > methods that can use it.
> >
>
> I don't much like the new function name GetAlternatePage, may be
> GetPageFromLocalFSM or something like that.  OTOH, I am not sure if we
> should go that far to address this concern of Andres's, maybe just
> adding a proper comment is sufficient.

That's a clearer name. I think 2 functions is easier to follow than
the boolean parameter.

> > Putting the thresholds in 3 files with completely different purposes
> > is a mess, and serves no example for future access methods, but I
> > don't have a better idea.
> >
>
> Yeah, I am also not sure if it is a good idea because it still won't
> be easy for pluggable storage especially the pg_upgrade part.  I think
> if we really want to make it easy for pluggable storage to define
> this, then we might need to build something along the lines of how to
> estimate relation size works.
>
> See how table_relation_estimate_size is defined and used
> and TableAmRoutine heapam_methods
> {
> ..
> relation_estimate_size
> }

That might be the best way for table ams, but I guess we'd still need
to keep the hard-coding for indexes to always have a FSM. That might
not be too bad.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread John Naylor
On Tue, Apr 30, 2019 at 6:06 PM Amit Kapila  wrote:
>
> On Tue, Apr 30, 2019 at 2:24 PM Dilip Kumar  wrote:
> >
> >  insert into atacc1 values (21, 22, 23);
> > +ERROR:  could not read block 0 in file "base/16384/31379": read only
> > 0 of 8192 bytes
> >
> > I have analysed this failure.  Seems that we have not reset the
> > rel->fsm_local_map while truncating the relation pages by vacuum
> > (lazy_truncate_heap). So when next time while accessing it we are
> > getting the error.   I think we need a mechanism to invalidate this
> > when we truncate the relation pages.   I am not sure whether we should
> > invalidate the relcache entry here or just reset the
> > rel->fsm_local_map?
> >
>
> Thanks, this appears to be the missing case where we need to
> invalidate the cache.  So, as discussed above if we issue invalidation
> call (in RecordPageWithFreeSpace) when the page becomes empty, then we
> shouldn't encounter this.  John, can we try this out and see if the
> failure goes away?

I added a clear/inval call in RecordPageWithFreeSpace and the failure
goes away. Thanks for the analysis, Dilip!

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: POC: Cleaning up orphaned files using undo logs

2019-04-30 Thread Amit Kapila
On Wed, May 1, 2019 at 6:02 AM Robert Haas  wrote:
>
> Replying to myself to resend to the list, since my previous attempt
> seems to have been eaten by a grue.
>
> On Tue, Apr 30, 2019 at 11:14 AM Robert Haas  wrote:
> >
> > On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar  wrote:
> > > Like previous version these patch set also applies on:
> > > https://github.com/EnterpriseDB/zheap/tree/undo
> > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
> >
> > Some more review of 0003:
> >

Another suggestion:

+/*
+ * Insert a previously-prepared undo records.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
..
..
+
+ /* Advance the insert pointer past this record. */
+ UndoLogAdvance(urp, size);
+ }
..
}

UndoLogAdvance internally takes LWLock and we don't recommend doing
that in the critical section which will happen as this function is
supposed to be invoked in the critical section as mentioned in
comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread Amit Kapila
On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera  wrote:
>
> On 2019-Apr-30, John Naylor wrote:
>
> > On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila  
> > wrote:
> > > As discussed above, we need to issue an
> > > invalidation for following points:  (a) when vacuum finds there is no
> > > FSM and page has more space now, I think you can detect this in
> > > RecordPageWithFreeSpace
> >
> > I took a brief look and we'd have to know how much space was there
> > before. That doesn't seem possible without first implementing the idea
> > to save free space locally in the same way the FSM does. Even if we
> > have consensus on that, there's no code for it, and we're running out
> > of time.
>
> Hmm ... so, if vacuum runs and frees up any space from any of the pages,
> then it should send out an invalidation -- it doesn't matter what the
> FSM had, just that there is more free space now.  That means every other
> process will need to determine a fresh FSM,
>

I think you intend to say the local space map because once FSM is
created we will send invalidation and we won't further build relcache
entry having local space map.

> but that seems correct.
> Sounds better than keeping outdated entries indicating no-space-available.
>

Agreed, but as mentioned in one of the above emails, I am also bit
scared that it should not lead to many invalidation messages for small
relations, so may be we should send the invalidation message only when
the entire page is empty.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Adding a test for speculative insert abort case

2019-04-30 Thread Thomas Munro
On Wed, May 1, 2019 at 12:16 PM Melanie Plageman
 wrote:
> s1: insert into t1 values(1, 'someval') on conflict(id) do update set val = 
> 'someotherval';
> s1: pause in ExecInsert before calling ExecInsertIndexTuples
> s2: insert into t1 values(1, 'someval');
> s2: continue
>
> We don't know of a way to add this scenario to the current isolation 
> framework.
>
> Can anyone think of a good way to put this codepath under test?

Hi Melanie,

I think it'd be nice to have a set of macros that can create wait
points in the C code that isolation tests can control, in a special
build.  Perhaps there could be shm hash table of named wait points in
shared memory; if DEBUG_WAIT_POINT("foo") finds that "foo" is not
present, it continues, but if it finds an entry it waits for it to go
away.  Then isolation tests could add/remove names and signal a
condition variable to release waiters.

I contemplated that while working on SKIP LOCKED, which had a bunch of
weird edge cases that I tested by inserting throw-away wait-point code
like this:

https://www.postgresql.org/message-id/CADLWmXXss83oiYD0pn_SfQfg%2ByNEpPbPvgDb8w6Fh--jScSybA%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com




Re: Adding a test for speculative insert abort case

2019-04-30 Thread Andres Freund
Hi,

On April 30, 2019 6:43:08 PM PDT, Peter Geoghegan  wrote:
>On Tue, Apr 30, 2019 at 5:16 PM Melanie Plageman
> wrote:
>> Can anyone think of a good way to put this codepath under test?
>
>During the initial development of ON CONFLICT, speculative insertion
>itself was tested using custom stress testing that you can still get
>here:
>
>https://github.com/petergeoghegan/jjanes_upsert
>
>I'm not sure that this is something that you can adopt, but I
>certainly found it very useful at the time. It tests whether or not
>there is agreement among concurrent speculative inserters, and whether
>or not there are "unprincipled deadlocks" (user hostile deadlocks that
>cannot be fixed by reordering something in application code).

I think we want a deterministic case. I recall asking for that back then...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: performance regression when filling in a table

2019-04-30 Thread Fabien COELHO



Hello Andres,


The effect is that the first generation seems to take more time, but
dropping the table and regenerating again much less, with a typical 40%
performance improvement between first and second run, independently of the
version. The reported figures above where comparisons between first for pg12
and second or later for pg11.


Yea, that's pretty normal. The likely difference is that in the repeated
case you'll have WAL files ready to be recycled. I'd assume that the
difference between the runs would be much smaller if used unlogged
tables (or WAL on a ramdisk or somesuch).


I tried unlogged, and indeed the first run is no different from subsequent 
ones.



Performances are quite unstable, with index generation on the same scale 100
data taking anything from 6 to 15 seconds over runs.


How comparable are the runs?


See below for a taste.


Are you restarting postgres inbetween?


Nope. Trying once did not change the measures.


Perform checkpoints?


Nope, but with the default settings there is one avery five minutes. I'm 
not sure a checkpoint should have a significant impact on a COPY 
initialization.



Doing a VACUUM and checksums interact badly: vacuum time jumps from 3
seconds to 30 seconds:-(


Hm, that's more than I normally see. What exactly did you do there?


I simply ran "pgbench -i -s 100" on master, with 
https://commitfest.postgresql.org/23/2085/ thrown in for detailed stats.


Without checksums:

  # init
  37.68 s (drop tables 0.00 s, create tables 0.02 s, generate 27.12 s, vacuum 
2.97 s, primary keys 7.56 s)
  30.53 s (drop tables 0.25 s, create tables 0.01 s, generate 16.64 s, vacuum 
3.47 s, primary keys 10.16 s)
  36.31 s (drop tables 0.25 s, create tables 0.01 s, generate 18.94 s, vacuum 
3.40 s, primary keys 13.71 s)
  31.34 s (drop tables 0.23 s, create tables 0.01 s, generate 19.07 s, vacuum 
3.00 s, primary keys 9.03 s)
  # reinit
  38.25 s (drop tables 0.00 s, create tables 0.03 s, generate 29.33 s, vacuum 
3.10 s, primary keys 5.80 s)
  35.16 s (drop tables 0.25 s, create tables 0.01 s, generate 17.62 s, vacuum 
2.62 s, primary keys 14.67 s)
  29.15 s (drop tables 0.25 s, create tables 0.01 s, generate 17.35 s, vacuum 
2.98 s, primary keys 8.55 s)
  32.70 s (drop tables 0.25 s, create tables 0.01 s, generate 21.49 s, vacuum 
2.65 s, primary keys 8.29 s)
  # reinit
  42.39 s (drop tables 0.00 s, create tables 0.03 s, generate 33.98 s, vacuum 
2.16 s, primary keys 6.23 s)
  31.24 s (drop tables 0.24 s, create tables 0.01 s, generate 17.34 s, vacuum 
4.74 s, primary keys 8.91 s)
  26.91 s (drop tables 0.24 s, create tables 0.01 s, generate 16.83 s, vacuum 
2.89 s, primary keys 6.94 s)
  29.00 s (drop tables 0.25 s, create tables 0.01 s, generate 17.78 s, vacuum 
2.97 s, primary keys 7.98 s)
  # init
  37.68 s (drop tables 0.00 s, create tables 0.02 s, generate 27.12 s, vacuum 
2.97 s, primary keys 7.56 s)
  30.53 s (drop tables 0.25 s, create tables 0.01 s, generate 16.64 s, vacuum 
3.47 s, primary keys 10.16 s)
  36.31 s (drop tables 0.25 s, create tables 0.01 s, generate 18.94 s, vacuum 
3.40 s, primary keys 13.71 s)
  31.34 s (drop tables 0.23 s, create tables 0.01 s, generate 19.07 s, vacuum 
3.00 s, primary keys 9.03 s)
  # reinit
  38.25 s (drop tables 0.00 s, create tables 0.03 s, generate 29.33 s, vacuum 
3.10 s, primary keys 5.80 s)
  35.16 s (drop tables 0.25 s, create tables 0.01 s, generate 17.62 s, vacuum 
2.62 s, primary keys 14.67 s)
  29.15 s (drop tables 0.25 s, create tables 0.01 s, generate 17.35 s, vacuum 
2.98 s, primary keys 8.55 s)
  32.70 s (drop tables 0.25 s, create tables 0.01 s, generate 21.49 s, vacuum 
2.65 s, primary keys 8.29 s)
  # reinit
  42.39 s (drop tables 0.00 s, create tables 0.03 s, generate 33.98 s, vacuum 
2.16 s, primary keys 6.23 s)
  31.24 s (drop tables 0.24 s, create tables 0.01 s, generate 17.34 s, vacuum 
4.74 s, primary keys 8.91 s)
  26.91 s (drop tables 0.24 s, create tables 0.01 s, generate 16.83 s, vacuum 
2.89 s, primary keys 6.94 s)
  29.00 s (drop tables 0.25 s, create tables 0.01 s, generate 17.78 s, vacuum 
2.97 s, primary keys 7.98 s)

With checksum enabled:

  # init
  73.84 s (drop tables 0.00 s, create tables 0.03 s, generate 32.81 s, vacuum 
34.95 s, primary keys 6.06 s)
  61.49 s (drop tables 0.24 s, create tables 0.01 s, generate 18.55 s, vacuum 
33.26 s, primary keys 9.42 s)
  62.79 s (drop tables 0.24 s, create tables 0.01 s, generate 21.08 s, vacuum 
33.50 s, primary keys 7.96 s)
  58.77 s (drop tables 0.23 s, create tables 0.06 s, generate 21.98 s, vacuum 
31.21 s, primary keys 5.30 s)
  # restart
  63.77 s (drop tables 0.04 s, create tables 0.02 s, generate 17.37 s, vacuum 
40.84 s, primary keys 5.51 s)
  64.48 s (drop tables 0.22 s, create tables 0.01 s, generate 19.84 s, vacuum 
33.43 s, primary keys 10.98 s)
  64.10 s (drop tables 0.23 s, create tables 0.01 s, generate 22.11 s, vacuum 
33.17 s, primary keys 8.57 s)
  # reinit
  71.65 s (drop tables 0.00 s, create tables 0.03 s, generate 

Re: Adding a test for speculative insert abort case

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 5:16 PM Melanie Plageman
 wrote:
> Can anyone think of a good way to put this codepath under test?

During the initial development of ON CONFLICT, speculative insertion
itself was tested using custom stress testing that you can still get
here:

https://github.com/petergeoghegan/jjanes_upsert

I'm not sure that this is something that you can adopt, but I
certainly found it very useful at the time. It tests whether or not
there is agreement among concurrent speculative inserters, and whether
or not there are "unprincipled deadlocks" (user hostile deadlocks that
cannot be fixed by reordering something in application code).

-- 
Peter Geoghegan




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
On 2019-04-30 15:53:07 -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-04-30 18:42:36 -0400, Tom Lane wrote:
> > markhor just reported in with results showing that we have worse
> > problems than deadlock-prone tests in the back branches: 9.4
> > for example looks like
> 
> >   -- whole tables
> >   REINDEX TABLE pg_class; -- mapped, non-shared, critical
> > + ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
> > 8192 bytes
> 
> Ugh.  Also failed on 9.6.

I see the bug. Turns out we need to figure out another way to solve the
assertion triggered by doing catalog updates within
RelationSetNewRelfilenode() - we can't just move the
SetReindexProcessing() before it.  When CCA is enabled, the
CommandCounterIncrement() near the tail of RelationSetNewRelfilenode()
triggers a rebuild of the catalog entries - but without the
SetReindexProcessing() those scans will try to use the index currently
being rebuilt. Which then predictably fails:

#0  mdread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, 
buffer=0x7f71037db800 "") at 
/home/andres/src/postgresql/src/backend/storage/smgr/md.c:633
#1  0x5600ae3f656f in smgrread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, 
blocknum=0, buffer=0x7f71037db800 "")
at /home/andres/src/postgresql/src/backend/storage/smgr/smgr.c:590
#2  0x5600ae3b4c13 in ReadBuffer_common (smgr=0x5600aea36498, 
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, 
strategy=0x0, 
hit=0x7fff5bb11cab) at 
/home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:896
#3  0x5600ae3b44ab in ReadBufferExtended (reln=0x7f7107972540, 
forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0)
at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:664
#4  0x5600ae3b437f in ReadBuffer (reln=0x7f7107972540, blockNum=0) at 
/home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:596
#5  0x5600ae00e0b3 in _bt_getbuf (rel=0x7f7107972540, blkno=0, access=1) at 
/home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:805
#6  0x5600ae00dd2a in _bt_heapkeyspace (rel=0x7f7107972540) at 
/home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:694
#7  0x5600ae01679c in _bt_first (scan=0x5600aea0, 
dir=ForwardScanDirection) at 
/home/andres/src/postgresql/src/backend/access/nbtree/nbtsearch.c:1237
#8  0x5600ae012617 in btgettuple (scan=0x5600aea0, 
dir=ForwardScanDirection) at 
/home/andres/src/postgresql/src/backend/access/nbtree/nbtree.c:247
#9  0x5600ae005572 in index_getnext_tid (scan=0x5600aea0, 
direction=ForwardScanDirection)
at /home/andres/src/postgresql/src/backend/access/index/indexam.c:550
#10 0x5600ae00571e in index_getnext_slot (scan=0x5600aea0, 
direction=ForwardScanDirection, slot=0x5600ae9c6ed0)
at /home/andres/src/postgresql/src/backend/access/index/indexam.c:642
#11 0x5600ae003e54 in systable_getnext (sysscan=0x5600aea44080) at 
/home/andres/src/postgresql/src/backend/access/index/genam.c:450
#12 0x5600ae564292 in ScanPgRelation (targetRelId=1259, indexOK=true, 
force_non_historic=false)
at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:365
#13 0x5600ae568203 in RelationReloadNailed (relation=0x5600aea0c4d0) at 
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2292
#14 0x5600ae568621 in RelationClearRelation (relation=0x5600aea0c4d0, 
rebuild=true) at 
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425
#15 0x5600ae569081 in RelationCacheInvalidate () at 
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2858
#16 0x5600ae55b32b in InvalidateSystemCaches () at 
/home/andres/src/postgresql/src/backend/utils/cache/inval.c:649
#17 0x5600ae55b408 in AcceptInvalidationMessages () at 
/home/andres/src/postgresql/src/backend/utils/cache/inval.c:708
#18 0x5600ae3d7b22 in LockRelationOid (relid=1259, lockmode=1) at 
/home/andres/src/postgresql/src/backend/storage/lmgr/lmgr.c:136
#19 0x5600adf85ad2 in relation_open (relationId=1259, lockmode=1) at 
/home/andres/src/postgresql/src/backend/access/common/relation.c:56
#20 0x5600ae040337 in table_open (relationId=1259, lockmode=1) at 
/home/andres/src/postgresql/src/backend/access/table/table.c:43
#21 0x5600ae564215 in ScanPgRelation (targetRelId=2662, indexOK=false, 
force_non_historic=false)
at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:348
#22 0x5600ae567ecf in RelationReloadIndexInfo (relation=0x7f7107972540) at 
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2170
#23 0x5600ae5681d3 in RelationReloadNailed (relation=0x7f7107972540) at 
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2270
#24 0x5600ae568621 in RelationClearRelation (relation=0x7f7107972540, 
rebuild=true) at 
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425
#25 0x5600ae568d19 in RelationFlushRelation (relation=0x7f7107972540) at 

Re: Adding a test for speculative insert abort case

2019-04-30 Thread Melanie Plageman
On Tue, Apr 30, 2019 at 5:22 PM Andres Freund  wrote:

>
> Not easily so - that's why the ON CONFLICT patch didn't add code
> coverage for it :(. I wonder if you could whip something up by having
> another non-unique expression index, where the expression acquires a
> advisory lock? If that advisory lock where previously acquired by
> another session, that should allow to write a reliable isolation test?
>
>
So, I took a look at one of the existing tests that does something like what
you mentioned and tried the following:
--
create table t1(key int, val text);
create unique index t1_uniq_idx on t1(key);
create or replace function t1_lock_func(int) returns int immutable language
sql AS
'select pg_advisory_xact_lock_shared(1); select $1';
create index t1_lock_idx ON t1(t1_lock_func(key));
--
s1:
begin isolation level read committed;
insert into t1 values(1, 'someval');
s2:
set default_transaction_isolation = 'read committed';
insert into t1 values(1, 'anyval') on conflict(key) do update set val =
'updatedval';
--

So, the above doesn't work because s2 waits to acquire the lock in the first
phase of the speculative insert -- when it is just checking the index,
before
inserting to the table and before inserting to the index.

Then when the s1 is committed, we won't execute the speculative insert code
at
all and will go into ExecOnConflictUpdate instead.

Maybe I just need a different kind of advisory lock to allow
ExecCheckIndexConstraints to be able to check the index here. I figured it
is a
read operation, so a shared advisory lock should be okay, but it seems like
it
is not okay

Without knowing any of the context, on an initial pass of debugging, I did
notice that, in the initial check of the index by s2, XactLockTableWait is
called with reason_wait as XLTW_InsertIndex (even though we are just trying
to
check it, so maybe it knows our intentions:))

Is there something I can do in the test to allow my check to go
through but the insert to have to wait?

-- 
Melanie Plageman


Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 10:58 AM Peter Geoghegan  wrote:j
> I have found what the problem is. I simply neglected to make a
> conservative assumption about suffix truncation needing to add a heap
> TID to a leaf page's new high key in nbtsort.c (following commit
> dd299df8189), even though I didn't make the same mistake in
> nbtsplitloc.c. Not sure how I managed to make such a basic error.

Attached is a much more polished version of the same patch. I tried to
make clear how the "page full" test (the test that has been fixed to
take heap TID space for high key into account) is related to other
close-by code, such as the tuple space limit budget within
_bt_check_third_page(), and the code that sets up an actual call to
_bt_truncate().

I'll wait a few days before pushing this. This version doesn't feel
too far off being committable. I tested it with some of the CREATE
INDEX tests that I developed during development of the nbtree unique
keys project, including a test with tuples that are precisely at the
1/3 of a page threshold. The new definition of 1/3 of a page takes
high key heap TID overhead into account -- see _bt_check_third_page().

-- 
Peter Geoghegan


v2-0001-Fix-nbtsort.c-s-page-space-accounting.patch
Description: Binary data


Re: Improve search for missing parent downlinks in amcheck

2019-04-30 Thread Peter Geoghegan
On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov
 wrote:
> I think this definitely not bug fix.  Bloom filter was designed to be
> lossy, no way blaming it for that :)

I will think about a simple fix, but after the upcoming point release.
There is no hurry.


-- 
Peter Geoghegan




Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-04-30 Thread Noah Misch
On Tue, Apr 30, 2019 at 10:47:37PM +1200, Thomas Munro wrote:
> On Mon, Apr 22, 2019 at 4:59 AM Noah Misch  wrote:
> > On Thu, Apr 18, 2019 at 04:30:46PM +1200, Thomas Munro wrote:
> > > This causes make check-world to deliver a flurry of pop-ups from
> > > macOS's built-in Firewall asking if perl should be allowed to listen
> > > to all interfaces [...].
> >
> > That is unfortunate.  The "Allowing specific applications" section of
> > https://support.apple.com/en-us/HT201642 appears to offer a way to allow 
> > perl
> > permanently.  Separately, it wouldn't cost much for us to abandon that check
> > on !$use_tcp (non-Windows) configurations.
> 
> My system is set up not to allow that and I suppose I could go and
> argue with my IT department about that, but I'm interested in your
> second suggestion if the test is in fact not serving any useful
> purpose for non-Windows systems anyway.  Do you mean like this?
> 
> --- a/src/test/perl/PostgresNode.pm
> +++ b/src/test/perl/PostgresNode.pm
> @@ -1098,17 +1098,12 @@ sub get_new_node
> # native Perl (https://stackoverflow.com/a/14388707),
> so we also test
> # individual addresses.
> #
> -   # This seems like a good idea on Unixen as well, even
> though we don't
> -   # ask the postmaster to open a TCP port on Unix.  On 
> Non-Linux,
> -   # non-Windows kernels, binding to 127.0.0.1/24
> addresses other than
> -   # 127.0.0.1 fails with EADDRNOTAVAIL.
> -   #

Deleting that comment paragraph isn't quite right, since we're still testing
127.0.0.1 everywhere.  The paragraph does have cause to change.

> # XXX A port available now may become unavailable by
> the time we start
> # the postmaster.
> if ($found == 1)
> {
> -   foreach my $addr (qw(127.0.0.1 0.0.0.0),
> -   $use_tcp ? qw(127.0.0.2 127.0.0.3) : ())
> +   foreach my $addr (qw(127.0.0.1),
> +   $use_tcp ? qw(0.0.0.0 127.0.0.2 127.0.0.3) : 
> ())

This is what I meant, yes.

> {
> can_bind($addr, $port) or $found = 0;
> }




Re: Initializing LWLock Array from Server Code

2019-04-30 Thread Robert Haas
Replying to myself to resend to the list, since my previous attempt
seems to have been eaten by a grue.

On Mon, Apr 29, 2019 at 1:59 PM Robert Haas  wrote:
>
> On Fri, Apr 26, 2019 at 2:58 PM Souvik Bhattacherjee  
> wrote:
> > I have created a shared hash table in partitioned mode inside the postgres 
> > server code. In order to guard the partitions, I'm trying to initialize an 
> > array of LWLocks. The code that I'm trying to use for that is
> >
> > void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks);
> > LWLockPadded *GetNamedLWLockTranche(const char *tranche_name);
> >
> > I'm not sure where exactly should this code be called from the server code. 
> > So I had placed it in
> >
> > void CreateSharedMemoryAndSemaphores(bool makePrivate, int port);
> >
> > within ipic.c. However, I'm getting the following error message when 
> > starting the database:
> >
> > FATAL:  requested tranche is not registered
> >
> > So at this point, I'm a little confused as to where the methods should be 
> > called from inside the server code. Any pointers would be appreciated.
>
> RequestNamedLWLockTranche() changes the behavior of LWLockShmemSize()
> and CreateLWLocks(), so must be called before either of those.
> GetNamedLWLockTranche() must be called after CreateLWLocks().
>
> This machinery works for pg_stat_statements, so see that as an example
> of how to make this work from an extension.  If you're hacking the
> core server code, then look at the places where the corresponding bits
> of pg_stat_statements code get called.  IIRC, _PG_init() gets called
> from process_shared_preload_libraries(), so you might look at the
> placement of the calls to that function.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-04-30 Thread Robert Haas
Replying to myself to resend to the list, since my previous attempt
seems to have been eaten by a grue.

On Tue, Apr 30, 2019 at 11:14 AM Robert Haas  wrote:
>
> On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar  wrote:
> > Like previous version these patch set also applies on:
> > https://github.com/EnterpriseDB/zheap/tree/undo
> > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
>
> Some more review of 0003:
>
> There is a whitespace-only hunk in xact.c.
>
> It would be nice if AtAbort_ResetUndoBuffers didn't need to exist at
> all.  Then, this patch would make no changes whatsoever to xact.c.
> We'd still need such changes in other patches, because the whole idea
> of undo is tightly bound up with the concept of transactions.  Yet,
> this particular patch wouldn't touch that file, and that would be
> nice.  In general, the reason why we need AtCommit/AtAbort/AtEOXact
> callbacks is to adjust the values of global variables (or the data
> structures to which they point) at commit or abort time.  And that is
> also the case here.  The thing is, I'm not sure why we need these
> particular global variables.  Is there some way that we can get by
> without them? The obvious thing to do seems to be to invent yet
> another context object, allocated via a new function, which can then
> be passed to PrepareUndoInsert, InsertPreparedUndo,
> UndoLogBuffersSetLSN, UnlockReleaseUndoBuffers, etc.  This would
> obsolete UndoSetPrepareSize, since you'd instead pass the size to the
> context allocator function.
>
> UndoRecordUpdateTransInfo should declare a local variable
> XactUndoRecordInfo *something = _urec_info[idx] and use that
> variable wherever possible.
>
> It should also probably use while (1) { ... } rather than do { ... }
> while (true).
>
> In UndoBufferGetSlot you could replace 'break;' with 'return i;' and
> then more than half the function would need one less level of
> indentation.  This function should also declare PreparedUndoBuffer
> *something and set that variable equal to _undo_buffers[i] at
> the top of the loop and again after the loop, and that variable should
> then be used whenever possible.
>
> UndoRecordRelationDetails seems to need renaming now that it's down to
> a single member.
>
> The comment for UndoRecordBlock needs updating, as it still refers to blkprev.
>
> The comment for UndoRecordBlockPrev refers to "Identifying
> information" but it's not identifying anything.  I think you could
> just delete "Identifying information for" from this sentence and not
> lose anything.  And I think several of the nearby comments that refer
> to "Identifying information" could more simply and more correctly just
> refer to "Information".
>
> I don't think that SizeOfUrecNext needs to exist.
>
> xact.h should not add an include for undolog.h.  There are no other
> changes to this header, so presumably the header does not depend on
> anything in undolog.h.  If .c files that include xact.h need
> undolog.h, then the header should be included in those files, not in
> the header itself.  That way, we avoid making partial recompiles more
> painful than necessary.
>
> UndoGetPrevUndoRecptr looks like a strange interface.  Why not just
> arrange not to call the function in the first place if prevurp is
> valid?
>
> Every use of palloc0 in this code should be audited to check whether
> it is really necessary to zero the memory before use.  If it will be
> initialized before it's used for anything anyway, it doesn't need to
> be pre-zeroed.
>
> FinishUnpackUndo looks nice!  But it is missing a blank line in one
> place, and 'if it presents' should be 'if it is present' in a whole
> bunch of places.
>
> BeginInsertUndo also looks to be missing a few blank lines.
>
> Still need to do some cleanup of the UnpackedUndoRecord, e.g. unifying
> uur_xid and uur_xidepoch into uur_fxid.
>
> InsertUndoData ends in an unnecessary return, as does SkipInsertingUndoData.
>
> I think the argument to SkipInsertingUndoData should be the number of
> bytes to skip, rather than the starting byte within the block.
>
> Function header comment formatting is not consistent.  Compare
> FinishUnpackUndo (function name recapitulated on first line of
> comment) to ReadUndoBytes (not recapitulated) to UnpackUndoData
> (entire header comment jammed into one paragraph with function name at
> start).  I prefer the style where the function name is not restated,
> but YMMV.  Anyway, it has to be consistent.
>
> UndoGetPrevRecordLen should not declare char *page instead of Page
> page, I think.
>
> UndoRecInfo looks a bit silly, I think.  Isn't index just the index of
> this entry in the array?  You can always figure that out by ptr -
> array_base. Instead of having UndoRecPtr urp in this structure, how
> about adding that to UnpackedUndoRecord?  When inserting, caller
> leaves it unset and InsertPreparedUndo sets it; when retrieving,
> UndoFetchRecord or UndoRecordBulkFetch sets it.  With those two
> changes, I think you can get rid of UndoRecInfo 

Re: message style

2019-04-30 Thread Robert Haas
Replying to myself to resend to the list, since my previous attempt
seems to have been eaten by a grue.


On Tue, Apr 30, 2019 at 12:05 PM Robert Haas  wrote:
>
> On Tue, Apr 30, 2019 at 10:58 AM Alvaro Herrera
>  wrote:
> > My problem here is not really the replacement of the name to %s, but the
> > "may not be" part of it.  We don't use "may not be" anywhere else; most
> > places seem to use "foo cannot be X" and a small number of other places
> > use "foo must not be Y".  I'm not able to judge which of the two is
> > better (so change all messages to use that form), or if there's a
> > semantic difference and if so which one to use in this case.
>
> The message style guidelines specifically discourage the use of "may",
> IMHO for good reason.  "mumble may not be flidgy" could be trying to
> tell you that something is impermissible, as in "the children may not
> run in the house."  But it could also be warning you that something is
> doubtful, as in "the children may not receive Easter candy this year
> because there is a worldwide chocolate shortage."  Sometimes the same
> sentence can be read either way, like "this table may not be
> truncated," which can mean either that TRUNCATE is going to fail if
> run in the future, or that it is unclear whether TRUNCATE was already
> run in the past.
>
> As far as "cannot" and "must not" is murkier, but it looks to me as
> though we prefer "cannot" to "must not" about 9:1, so most often
> "cannot" is the right thing, but not always.  The distinction seems to
> be that we use "cannot" to talk about things that we are unwilling or
> unable to do in the future, whereas "must not" is used to admonish the
> user about what has already taken place.  Consider:
>
> array must not contain nulls
> header key must not contain newlines
> cast function must not return a set
> interval time zone \"%s\" must not include months or days
> function \"%s\" must not be SECURITY DEFINER
>
> vs.
>
> cannot drop %s because %s requires it
> cannot PREPARE a transaction that has manipulated logical replication workers
> cannot reindex temporary tables of other sessions
> cannot inherit from partitioned table \"%s\"
>
> The first set of messages are essentially complaints about the past.
> The array shouldn't have contained nulls, but it did!  The header key
> should not have contained newlines, but it did!  The cast function
> should not return a set, but it does!  Hence, we are sad and are
> throwing an error.  The second set are statements that we are
> unwilling or unable to proceed, but they don't necessarily carry the
> connotation that there is a problem already in the past.  You've just
> asked for something you are not going to get.
>
> I think principle that still leaves some ambiguity.  For example, you
> could phrase that second of the "cannot" message as "must not try to
> PREPARE a transaction that has manipulated logical replication
> workers."  That's grammatical and everything, but it sounds a bit
> accusatory, like the user is in trouble or something.  I think that's
> probably why we tend to prefer "cannot" in most cases.  But sometimes
> that would lead to a longer or less informative message.  For example,
> you can't just change
>
> function \"%s\" must not be SECURITY DEFINER
>
> to
>
> function \"%s\" can not be SECURITY DEFINER
>
> ...because the user will rightly respond "well, I guess it can,
> because it is."  We could say
>
> can not execute security definer functions from PL/Tcl
>
> ...but that sucks because we now have no reasonable place to put the
> function name.  We could say
>
> can not execute security definer function \"%s\" from PL/Tcl
>
> ...but that also sucks because now the message only says that this one
> particular security definer function cannot be executed, rather than
> saying that ALL security definer functions cannot be executed.   To
> really get there, you'd have to do something like
>
> function "\%s" cannot be executed by PL/Tcl because it is a security
> definer function
>
> ...which is fine, but kinda long.  On the plus side it's more clear
> about the source of the error (PL/Tcl) than the current message which
> doesn't state that explicitly, so perhaps it's an improvement anyway,
> but the general point is that sometimes I think there is no succinct
> way of expressing the complaint clearly without using "must not".
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-04-30 Thread Robert Haas
Replying to myself to resend to the list, since my previous attempt
seems to have been eaten by a grue.

On Tue, Apr 30, 2019 at 1:01 PM Robert Haas  wrote:
>
> On Tue, Apr 30, 2019 at 1:38 AM Masahiko Sawada  wrote:
> > It seems to me that encrypting table data in WAL with multiple keys
> > reduces damage in case where a key is theft. However user who has an
> > access privilege of reading WAL can obtain all keys encrypting table
> > data in WAL in the first place.
>
> That better not be true.  If you have a design where reading the WAL
> lets you get *any* encryption key, you have a bad design, I think.  If
> you have a design where reading the WAL lets you get *every*
> encryption key, that's truly terrible.  That's strictly worse than
> full-disk encryption, which at least protects against the disk being
> stolen.
>
> > So as long as postgres's access
> > control facility works fine it's the same as having single encryption
> > key dedicated for WAL.
>
> I think of postgres's access control facility works fine there is no
> need for encryption in the first place.  If we can just use REVOKE to
> block access, we should do that and forget about encryption.  The
> value of encryption only enters the picture when someone can bypass
> the database server permissions in some way, such as by reading the
> files directly.
>
> > Or do you mean the case where a malicious user
> > steals both WAL and key A? It completely depends on from what threat
> > we want to protect data by transparent data encryption but I think we
> > should rather protect from such threat by the access control in that
> > situation. I personally don't think the having an encryption key
> > dedicated for WAL would increase risk much.
>
> Well, what threat are you trying to protect against?
>
> > FWIW, binary log encryption of MySQL uses different encryption key
> > from a key used for table[1]. The key is encrypted by the master key
> > for binary log encryption and is stored in each file headers.
>
> So, if you steal the master key for binary log encryption, you can
> decrypt everything, it sounds like.
>
> > If we use the same key to encrypt the WAL for the table and indexes
> > and TOAST table for the table, what encryption key should we use for
> > temporary files for an intermediate result?
>
> For temporary files, we can just use some temporary key that is only
> stored in server memory and only for the lifetime of the session.
> Once the session ends, we don't ever need to read that data again.
>
> > And should we use each
> > different encryption keys for WAL other than table and indexes
> > resource manager?
>
> Data other than table and index data seems like it is not very
> security-sensitive.  I'm not sure we need to encrypt it at all.  If we
> do, using one key seems fine.
>
> > The advantage of having the dedicated key for WAL
> > encryption would be to make WAL encryption more simple. If we do that,
> > we can encrypt the whole WAL page by key A like the patch proposed by
> > Antonin does).
>
> Yeah.  His design is simpler because it is coarse-grained: the whole
> cluster is encrypted, so there's no need to worry about encrypting
> data differently for different tables.
>
> > Also the advantage of having multiple tablespace keys
> > would be to make postgres enable to re-encryption without rebuilt
> > database cluster.
>
> I don't understand this part, sorry.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




walsender vs. XLogBackgroundFlush during shutdown

2019-04-30 Thread Tomas Vondra

Hi,

I see there's an ongoing discussion about race conditions in walreceiver
blocking shutdown, so let me start a new thread about a race condition in
walsender during shutdown.

The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
waiting for walsender processes to catch-up and terminate indefinitely.

The reason for that is pretty simple - the walsenders are doing logical
decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
But per XLogSendLogical() this only happens if

   if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
   {
   WalSndCaughtUp = true;
   ...
   }

That is, we need to get beyong GetFlushRecPtr(). But that may never
happen, because there may be incomplete (and unflushed) page in WAL
buffers, not forced out by any transaction. So if there's a WAL record
overflowing to the unflushed page, the walsender will never catch up.

Now, this situation is apparently expected, because WalSndWaitForWal()
does this:

   /*
* If we're shutting down, trigger pending WAL to be written out,
* otherwise we'd possibly end up waiting for WAL that never gets
* written, because walwriter has shut down already.
*/
   if (got_STOPPING)
   XLogBackgroundFlush();

but unfortunately that does not actually do anything, because right at
the very beginning XLogBackgroundFlush() does this:

   /* back off to last completed page boundary */
   WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;

That is, it intentionally ignores the incomplete page, which means the
walsender can't read the record and reach GetFlushRecPtr().

XLogBackgroundFlush() does this since (at least) 2007, so how come we
never had issues with this before?

Firstly, walsenders used for physical replication don't have this issue,
because they don't need to decode the WAL record overflowing to the
incomplete/unflushed page. So it seems only walsenders used for logical
decoding are vulnerable to this.

Secondly, this seems to happen only when a bit of WAL is generated just
at the right moment during shutdown. I have initially ran into this issue
with the failover slots (i.e. the patch that was committed and reverted
in the 9.6 cycle, IIRC), which is doing exactly this - it's writing the
slot positions into WAL during shutdown. Which made it pretty trivial to
trigger this issue.

But I don't think we're safe without the failover slots patch, because
any output plugin can do something similar - say, LogLogicalMessage() or
something like that. I'm not aware of a plugin doing such things, but I
don't think it's illegal / prohibited either. (Of course, plugins that
generate WAL won't be useful for decoding on standby in the future.)

So what I think we should do is to tweak XLogBackgroundFlush() so that
during shutdown it skips the back-off to page boundary, and flushes even
the last piece of WAL. There are only two callers anyway, so something
like XLogBackgroundFlush(bool) would be simple enough.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Initializing LWLock Array from Server Code

2019-04-30 Thread Robert Haas
On Tue, Apr 30, 2019 at 5:52 PM Souvik Bhattacherjee  wrote:
> Thank you for your reply and sorry that I couldn't reply earlier.
> Since I didn't get any response within a couple of days, I took the longer 
> route -- changed the lwlock.h and lwlock.c
> for accommodating the lw locks for the shared hash table.
>
> I'll describe what I modified in the lwlock.h and lwlock.c and it seems to be 
> working fine. Although I haven't got an opportunity
> to test it extensively. If you could let me know if I missed out anything 
> that might cause problems later that would be great.

Well, I can't really vouch for your code on a quick look, but it
doesn't look insane.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Adding a test for speculative insert abort case

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 17:15:55 -0700, Melanie Plageman wrote:
> Today, while poking around the table_complete_speculative code which Ashwin
> mentioned in [1], we were trying to understand when exactly we would
> complete a
> speculative insert by aborting.

(FWIW, it's on my todo queue to look at this)


> We added a logging message to heapam_tuple_complete_speculative before it
> calls
> heap_abort_speculative and ran the regression and isolation tests to see
> what
> test failed so we knew how to exercise this codepath.
> No tests failed, so we spent some time trying to understand when 'succeeded'
> would be true coming into heap_tuple_complete_speculative.
> 
> Eventually, we figured out that if one transaction speculatively inserts a
> tuple into a table with a unique index and then pauses before inserting the
> value into the index, and while it is paused, another transaction
> successfully
> inserts a value which would conflict with that value, it would result in an
> aborted speculative insertion.
> 
> t1(id,val)
> unique index t1(id)
> 
> s1: insert into t1 values(1, 'someval') on conflict(id) do update set val =
> 'someotherval';
> s1: pause in ExecInsert before calling ExecInsertIndexTuples
> s2: insert into t1 values(1, 'someval');
> s2: continue
> 
> We don't know of a way to add this scenario to the current isolation
> framework.

> Can anyone think of a good way to put this codepath under test?

Not easily so - that's why the ON CONFLICT patch didn't add code
coverage for it :(. I wonder if you could whip something up by having
another non-unique expression index, where the expression acquires a
advisory lock? If that advisory lock where previously acquired by
another session, that should allow to write a reliable isolation test?

Alternatively, as a fallback, there's a short pgbench test, I wonder if we
could just adapt that to use ON CONFLICT UPDATE?

Greetings,

Andres Freund




Adding a test for speculative insert abort case

2019-04-30 Thread Melanie Plageman
Today, while poking around the table_complete_speculative code which Ashwin
mentioned in [1], we were trying to understand when exactly we would
complete a
speculative insert by aborting.

We added a logging message to heapam_tuple_complete_speculative before it
calls
heap_abort_speculative and ran the regression and isolation tests to see
what
test failed so we knew how to exercise this codepath.
No tests failed, so we spent some time trying to understand when 'succeeded'
would be true coming into heap_tuple_complete_speculative.

Eventually, we figured out that if one transaction speculatively inserts a
tuple into a table with a unique index and then pauses before inserting the
value into the index, and while it is paused, another transaction
successfully
inserts a value which would conflict with that value, it would result in an
aborted speculative insertion.

t1(id,val)
unique index t1(id)

s1: insert into t1 values(1, 'someval') on conflict(id) do update set val =
'someotherval';
s1: pause in ExecInsert before calling ExecInsertIndexTuples
s2: insert into t1 values(1, 'someval');
s2: continue

We don't know of a way to add this scenario to the current isolation
framework.

Can anyone think of a good way to put this codepath under test?

- Melanie & Ashwin

[1]
https://www.postgresql.org/message-id/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR%2BF1Y4w%40mail.gmail.com


Re: doc: improve PG 12 to_timestamp()/to_date() wording

2019-04-30 Thread Justin Pryzby
On Tue, Apr 30, 2019 at 09:48:14PM +0300, Alexander Korotkov wrote:
> I'd like to add couple of comments from my side.

> > -   returns an error because the second template string space is 
> > consumed
> > -   by the letter J in the input string.
> > +   returns an error because the second space in the template string 
> > consumes
> > +   the letter M from the input string.
> 
> Why M?  There is no letter "M" is input string.
> The issue here is that we already consumed "J" from "JUN" and trying
> to match "UN" to "MON".  So, I think we should live
> J here.  The rest of this change looks good.

Seems like I confused myself while resolving rebase conflict.

Thanks for checking.

Justin




Re: clean up docs for v12

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-08 09:18:28 -0500, Justin Pryzby wrote:
> From aae1a84b74436951222dba42b21de284ed8b1ac9 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Sat, 30 Mar 2019 17:24:35 -0500
> Subject: [PATCH v2 03/12] JIT typos..
> 
> ..which I sent to Andres some time ago and which I noticed were never applied
> (nor rejected).
> 
> https://www.postgresql.org/message-id/20181127184133.GM10913%40telsasoft.com
> ---
>  src/backend/jit/llvm/llvmjit_deform.c   | 22 +++---
>  src/backend/jit/llvm/llvmjit_inline.cpp |  2 +-
>  2 files changed, 12 insertions(+), 12 deletions(-)

I pushed these, minus the ones that were obsoleted by the slightly
larger changes resulting from the discussion of this patch.

Thanks for the patch!

Greetings,

Andres Freund




Re: clean up docs for v12

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-22 14:17:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
> >> I wonder
> >> also if it wouldn't be smart to explicitly check that the "guaranteeing"
> >> column is not attisdropped.
> 
> > Yea, that probably would be smart.  I don't think there's an active
> > problem, because we remove NOT NULL when deleting an attribute, but it
> > seems good to be doubly sure / explain why that's safe:
> > /* Remove any NOT NULL constraint the column may have */
> > attStruct->attnotnull = false;
> > I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
> > assume column presence?
> 
> I'd just make the code look like
> 
> /*
>  * If it's NOT NULL then it must be present in every tuple,
>  * unless there's a "missing" entry that could provide a non-null
>  * value for it.  Out of paranoia, also check !attisdropped.
>  */
> if (att->attnotnull &&
> !att->atthasmissing &&
> !att->attisdropped)
> guaranteed_column_number = attnum;
> 
> I don't think the extra check is so expensive as to be worth obsessing
> over.

Pushed. Did so separately from Justin's changes, since this is a small
functional change.

Greetings,

Andres Freund




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 18:42:36 -0400, Tom Lane wrote:
> markhor just reported in with results showing that we have worse
> problems than deadlock-prone tests in the back branches: 9.4
> for example looks like

>   -- whole tables
>   REINDEX TABLE pg_class; -- mapped, non-shared, critical
> + ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
> 8192 bytes

Ugh.  Also failed on 9.6.


> Given this, I'm rethinking my position that we can dispense with these
> test cases.  Let's try putting them in a standalone test script, and
> see whether that leads to failures or not.  Even if it does, we'd
> better keep them until we've got a fully clean bill of health from
> the buildfarm.

Yea. Seems likely this indicates a proper, distinct, bug :/

I'll move the test into a new "reindex_catalog" test, with a comment
explaining that the failure cases necessitating that are somewhere
between bugs, ugly warts, an hard to fix edge cases.

Greetings,

Andres Freund




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
Just when you thought it was safe to go back in the water ...

markhor just reported in with results showing that we have worse
problems than deadlock-prone tests in the back branches: 9.4
for example looks like

  --
  -- whole tables
  REINDEX TABLE pg_class; -- mapped, non-shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX TABLE pg_index; -- non-mapped, non-shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX TABLE pg_operator; -- non-mapped, non-shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX TABLE pg_database; -- mapped, shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX TABLE pg_shdescription; -- mapped, shared non-critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  -- Check that individual system indexes can be reindexed. That's a bit
  -- different from the entire-table case because reindex_relation
  -- treats e.g. pg_class special.
  REINDEX INDEX pg_class_oid_index; -- mapped, non-shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX INDEX pg_class_relname_nsp_index; -- mapped, non-shared, non-critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX INDEX pg_index_indexrelid_index; -- non-mapped, non-shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX INDEX pg_index_indrelid_index; -- non-mapped, non-shared, non-critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX INDEX pg_database_oid_index; -- mapped, shared, critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes
  REINDEX INDEX pg_shdescription_o_c_index; -- mapped, shared, non-critical
+ ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 
8192 bytes

No doubt this is triggered by CLOBBER_CACHE_ALWAYS.

Given this, I'm rethinking my position that we can dispense with these
test cases.  Let's try putting them in a standalone test script, and
see whether that leads to failures or not.  Even if it does, we'd
better keep them until we've got a fully clean bill of health from
the buildfarm.

regards, tom lane




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
Andres Freund  writes:
> I'm not wild to go for a separate TAP test. A separate initdb cycle for
> a a tests that takes about 30ms seems a bit over the top.

Fair enough.

> So I'm
> inclined to either try running it in a serial step on the buildfarm
> (survived a few dozen cycles with -DRELCACHE_FORCE_RELEASE
> -DCATCACHE_FORCE_RELEASE, and a few with -DCLOBBER_CACHE_ALWAYS), or
> just remove them alltogether. Or remove it alltogether until we fix
> this.  Since you indicated a preference agains the former, I'll remove
> it in a bit until I hear otherwise.

> I'll add it to my todo list to try to fix the concurrency issues for 13.

If you're really expecting to have a go at that during the v13 cycle,
I think we could live without these test cases till then.

regards, tom lane




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-30 Thread Tom Lane
I wrote:
> So maybe a reasonable compromise is to add the Assert(s) in fd.c as
> per previous patch, but *also* add PrepareTempTablespaces in
> BufFileCreateTemp, so that at least users of buffile.c are insulated
> from the issue.  buffile.c is still kind of low-level, but it's not
> part of core infrastructure in the same way as fd.c, so probably I could
> hold my nose for this solution from the system-structural standpoint.

Actually, after digging around in the related code some more, I'm having
second thoughts about those Asserts.  PrepareTempTablespaces is pretty
clear about what it thinks the contract is:

/*
 * Can't do catalog access unless within a transaction.  This is just a
 * safety check in case this function is called by low-level code that
 * could conceivably execute outside a transaction.  Note that in such a
 * scenario, fd.c will fall back to using the current database's default
 * tablespace, which should always be OK.
 */
if (!IsTransactionState())
return;

If we just add the discussed assertions and leave this bit alone,
the net effect would be that any tempfile usage outside a transaction
would suffer an assertion failure, *even if* it had called
PrepareTempTablespaces.  There doesn't seem to be any such usage in
the core code, but do we really want to forbid the case?  It seems
like fd.c shouldn't be imposing such a restriction, if it never has
before.

So now I'm feeling more favorable about the idea of adding a
PrepareTempTablespaces call to BufFileCreateTemp, and just stopping
with that.  If we want to do more, I feel like it requires a
significant amount of rethinking about what the expectations are for
fd.c, and some rejiggering of PrepareTempTablespaces's API too.
I'm not sufficiently excited about this issue to do that.

regards, tom lane




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 15:11:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-30 14:41:00 -0400, Tom Lane wrote:
> > > On 2019-04-30 12:03:08 -0700, Andres Freund wrote:
> > > > This is a pretty finnicky area of the code, with obviously not enough
> > > > test coverage.  I'm inclined to remove them from the back branches, and
> > > > try to get them working in master?
> > >
> > > I think trying to get this "working" is a v13 task now.  We've obviously
> > > never tried to stress the case before, so you're neither fixing a
> > > regression nor fixing a new-in-v12 issue.
> 
> > Well, the test *do* test that a previously existing all-branches bug
> > doesn't exist, no (albeit one just triggering an assert)?  I'm not
> > talking about making this concurrency safe, just about whether it's
> > possible to somehow keep the tests.
> 
> Well, I told you what I thought was a safe way to run the tests.

Shrug. I was responding to you talking about "neither fixing a
regression nor fixing a new-in-v12 issue", when I explicitly was talking
about tests for the bug this thread is about. Not sure why "Well, I told
you what I thought was a safe way to run the tests." is a helpful answer
in turn.

I'm not wild to go for a separate TAP test. A separate initdb cycle for
a a tests that takes about 30ms seems a bit over the top.  So I'm
inclined to either try running it in a serial step on the buildfarm
(survived a few dozen cycles with -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE, and a few with -DCLOBBER_CACHE_ALWAYS), or
just remove them alltogether. Or remove it alltogether until we fix
this.  Since you indicated a preference agains the former, I'll remove
it in a bit until I hear otherwise.

I'll add it to my todo list to try to fix the concurrency issues for 13.

Greetings,

Andres Freund




Re: Initializing LWLock Array from Server Code

2019-04-30 Thread Souvik Bhattacherjee
Hi Robert,

Thank you for your reply and sorry that I couldn't reply earlier.
Since I didn't get any response within a couple of days, I took the longer
route -- changed the lwlock.h and lwlock.c
for accommodating the lw locks for the shared hash table.

I'll describe what I modified in the lwlock.h and lwlock.c and it seems to
be working fine. Although I haven't got an opportunity
to test it extensively. If you could let me know if I missed out anything
that might cause problems later that would be great.

In lwlock.h, the following modifications were made:

*/* the number of partitions of the shared hash table */*
#define NUM_FILTERHASH_PARTITIONS 64

*/* Offsets for various chunks of preallocated lwlocks. */*
#define BUFFER_MAPPING_LWLOCK_OFFSETNUM_INDIVIDUAL_LWLOCKS
#define LOCK_MANAGER_LWLOCK_OFFSET\
(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
#define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)

*/* offset for shared filterhash lwlocks in the MainLWLockArray */*
#define FILTERHASH_OFFSET\
(PREDICATELOCK_MANAGER_LWLOCK_OFFSET + NUM_PREDICATELOCK_PARTITIONS)
*/* modified NUM_FIXED_LOCKS */*
#define NUM_FIXED_LWLOCKS\
(FILTERHASH_OFFSET + NUM_FILTERHASH_PARTITIONS)

*/* added LWTRANCHE_FILTERHASH in the BuiltinTrancheIds */*
typedef enum BuiltinTrancheIds
{
LWTRANCHE_CLOG_BUFFERS = NUM_INDIVIDUAL_LWLOCKS,
LWTRANCHE_COMMITTS_BUFFERS,
LWTRANCHE_SUBTRANS_BUFFERS,
LWTRANCHE_MXACTOFFSET_BUFFERS,
LWTRANCHE_MXACTMEMBER_BUFFERS,
LWTRANCHE_ASYNC_BUFFERS,
LWTRANCHE_OLDSERXID_BUFFERS,
LWTRANCHE_WAL_INSERT,
LWTRANCHE_BUFFER_CONTENT,
LWTRANCHE_BUFFER_IO_IN_PROGRESS,
LWTRANCHE_REPLICATION_ORIGIN,
LWTRANCHE_REPLICATION_SLOT_IO_IN_PROGRESS,
LWTRANCHE_PROC,
LWTRANCHE_BUFFER_MAPPING,
LWTRANCHE_LOCK_MANAGER,
LWTRANCHE_PREDICATE_LOCK_MANAGER,
*LWTRANCHE_FILTERHASH*,
LWTRANCHE_PARALLEL_HASH_JOIN,
LWTRANCHE_PARALLEL_QUERY_DSA,
LWTRANCHE_SESSION_DSA,
LWTRANCHE_SESSION_RECORD_TABLE,
LWTRANCHE_SESSION_TYPMOD_TABLE,
LWTRANCHE_SHARED_TUPLESTORE,
LWTRANCHE_TBM,
LWTRANCHE_PARALLEL_APPEND,
LWTRANCHE_FIRST_USER_DEFINED
}BuiltinTrancheIds;


In lwlock.c, the following modifications were made:

In function, static void InitializeLWLocks(void), the following lines were
added:

*/* Initialize filterhash LWLocks in main array */*
lock = MainLWLockArray + NUM_INDIVIDUAL_LWLOCKS +
NUM_BUFFER_PARTITIONS + NUM_LOCK_PARTITIONS +
NUM_FILTERHASH_PARTITIONS;
for (id = 0; id < NUM_FILTERHASH_PARTITIONS; id++, lock++)
LWLockInitialize(>lock, LWTRANCHE_FILTERHASH);

In function void RegisterLWLockTranches(void), the following line was added:

LWLockRegisterTranche(LWTRANCHE_FILTERHASH, "filter_hash");

All of this was done after allocating the required of space in the shared
memory.

Thanks,
-SB


On Mon, Apr 29, 2019 at 1:59 PM Robert Haas  wrote:

> On Fri, Apr 26, 2019 at 2:58 PM Souvik Bhattacherjee 
> wrote:
> > I have created a shared hash table in partitioned mode inside the
> postgres server code. In order to guard the partitions, I'm trying to
> initialize an array of LWLocks. The code that I'm trying to use for that is
> >
> > void RequestNamedLWLockTranche(const char *tranche_name, int
> num_lwlocks);
> > LWLockPadded *GetNamedLWLockTranche(const char *tranche_name);
> >
> > I'm not sure where exactly should this code be called from the server
> code. So I had placed it in
> >
> > void CreateSharedMemoryAndSemaphores(bool makePrivate, int port);
> >
> > within ipic.c. However, I'm getting the following error message when
> starting the database:
> >
> > FATAL:  requested tranche is not registered
> >
> > So at this point, I'm a little confused as to where the methods should
> be called from inside the server code. Any pointers would be appreciated.
>
> RequestNamedLWLockTranche() changes the behavior of LWLockShmemSize()
> and CreateLWLocks(), so must be called before either of those.
> GetNamedLWLockTranche() must be called after CreateLWLocks().
>
> This machinery works for pg_stat_statements, so see that as an example
> of how to make this work from an extension.  If you're hacking the
> core server code, then look at the places where the corresponding bits
> of pg_stat_statements code get called.  IIRC, _PG_init() gets called
> from process_shared_preload_libraries(), so you might look at the
> placement of the calls to that function.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-30 Thread Tom Lane
Melanie Plageman  writes:
> I also think that if there is a step that a caller should always take before
> calling a function, then there needs to be a very compelling reason not to
> move that step into the function itself.

Fair complaint.

> PrepareTempTablespaces should not be called in BufFileCreateTemp because
> it is not concerned with temp tablespaces.

Actually, my reason for thinking that was mostly "that won't fix the
problem, because what about other callers of OpenTemporaryFile?"

However, looking around, there aren't any others --- buffile.c is it.

So maybe a reasonable compromise is to add the Assert(s) in fd.c as
per previous patch, but *also* add PrepareTempTablespaces in
BufFileCreateTemp, so that at least users of buffile.c are insulated
from the issue.  buffile.c is still kind of low-level, but it's not
part of core infrastructure in the same way as fd.c, so probably I could
hold my nose for this solution from the system-structural standpoint.

regards, tom lane




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-30 Thread Melanie Plageman
On Fri, Apr 26, 2019 at 8:05 AM Tom Lane  wrote:

> The version that I posted left it to GetNextTempTableSpace to assert
> that.  That seemed cleaner to me than an Assert that has to depend
> on interXact.
>
> Running `make check` with [1] applied and one of the calls to
PrepareTempTablespaces commented out, I felt like I deserved more as a
developer than the assertion in this case.

Assertions are especially good to protect against regressions, but, in this
case, I'm just trying to use an API that is being provided.

Assertions don't give me a nice, easy-to-understand test failure. I see that
there was a crash halfway through make check and now I have to figure out
why.

If that is the default way for developers to find out that they are missing
something when using the API, it would be nice if it gave me some sort of
understandable diff or error message.

I also think that if there is a step that a caller should always take before
calling a function, then there needs to be a very compelling reason not to
move
that step into the function itself.

So, just to make sure I understand this case:

PrepareTempTablespaces should not be called in BufFileCreateTemp because it
is
not concerned with temp tablespaces.

OpenTemporaryFile is concerned with temp tablespaces, so any reference to
those
should be there.

However, PrepareTempTablespaces should not be called in OpenTemporaryFile
because it is in fd.c and no functions that make up part of the file
descriptor
API should do catalog lookups.
Is this correct?

[1] https://www.postgresql.org/message-id/11777.1556133426%40sss.pgh.pa.us

-- 
Melanie Plageman


Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 21:26:52, skrev Andres Freund mailto:and...@anarazel.de>>: > [...]
> The standard on pg lists is to write in a manner that's usable for both > 
text mail readers and the archive. Doesn't terribly matter to the > occasional 
one-off poster on -general, but you're not that... So please > try to write 
readable mails for the PG lists.
> 
> Greetings,
 > 
> Andres Freund ACK. --
 Andreas Joseph Krogh

Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 21:23:21 +0200, Andreas Joseph Krogh wrote:
> På tirsdag 30. april 2019 kl. 20:59:43, skrev Andres Freund 
>  >: [...]
>  Andreas, unfortunately your emails are pretty unreadable. Check the
>  quoted email, and the web archive:
> 
>  
> https://www.postgresql.org/message-id/VisenaEmail.41.51d7719d814a1f54.16a6f98a5e9%40tc7-visena
> 
>  Greetings,
> 
>  Andres Freund

> I know that the text-version is quite unreadable, especially when quoting. My 
> MUA is web-based and uses CKEditor for composing, and it doesn't care much to 
> try to format the text/plain version (I know because I've written it, yes and 
> have yet to fix the Re: Sv: Re: Sv: subject issue...). But it has tons of 
> benefits CRM- and usage-wise so I prefer to use it. But - how use text/plain 
> these days:-) --

The standard on pg lists is to write in a manner that's usable for both
text mail readers and the archive. Doesn't terribly matter to the
occasional one-off poster on -general, but you're not that...  So please
try to write readable mails for the PG lists.

Greetings,

Andres Freund




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 20:59:43, skrev Andres Freund mailto:and...@anarazel.de>>: [...]
 Andreas, unfortunately your emails are pretty unreadable. Check the
 quoted email, and the web archive:

 
https://www.postgresql.org/message-id/VisenaEmail.41.51d7719d814a1f54.16a6f98a5e9%40tc7-visena

 Greetings,

 Andres Freund
I know that the text-version is quite unreadable, especially when quoting. My 
MUA is web-based and uses CKEditor for composing, and it doesn't care much to 
try to format the text/plain version (I know because I've written it, yes and 
have yet to fix the Re: Sv: Re: Sv: subject issue...). But it has tons of 
benefits CRM- and usage-wise so I prefer to use it. But - how use text/plain 
these days:-) --
 Andreas Joseph Krogh

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-30 14:41:00 -0400, Tom Lane wrote:
>> I think trying to get this "working" is a v13 task now.  We've obviously
>> never tried to stress the case before, so you're neither fixing a
>> regression nor fixing a new-in-v12 issue.

> Well, the test *do* test that a previously existing all-branches bug
> doesn't exist, no (albeit one just triggering an assert)?  I'm not
> talking about making this concurrency safe, just about whether it's
> possible to somehow keep the tests.

Well, I told you what I thought was a safe way to run the tests.

regards, tom lane




Re: Turning off enable_partition_pruning doesn't

2019-04-30 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Apr-30, Tom Lane wrote:
>> regression=# explain select * from p1 where a = 3; 

> But you're reading from the partition, not from the partitioned table ...

Argh!  Where's my brown paper bag?

regards, tom lane




Re: speeding up planning with partitions

2019-04-30 Thread Tom Lane
Amit Langote  writes:
> On Tue, Apr 30, 2019 at 1:26 PM Amit Langote  wrote:
>> It would be nice if at least we fix the bug that directly accessed
>> partitions are not excluded with constraint_exclusion = on, without
>> removing PG 11's contortions in relation_excluded_by_constraints to
>> work around the odd requirements imposed by inheritance_planner, which
>> is causing the additional diffs in the regression expected output.

> FWIW, attached is a delta patch that applies on top of your patch for
> v11 branch that shows what may be one way to go about this.

OK, I tweaked that a bit and pushed both versions.

regards, tom lane




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 14:41:00 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-30 14:05:50 -0400, Tom Lane wrote:
> >> Possibly we could run them in a TAP test that configures a cluster
> >> with autovac disabled?
> 
> > Hm. Would it be sufficient to instead move them to a non-concurrent
> > test group, and stick a BEGIN; LOCK pg_class, ; COMMIT; around it?
> 
> Doubt it.  Maybe you could get away with it given that autovacuum and
> autoanalyze only do non-transactional updates to pg_class, but that
> seems like a pretty shaky assumption.

I was pondering that autovacuum shouldn't play a role because it ought
to never cause a DELETE_IN_PROGRESS, because it shouldn't effect the
OldestXmin horizon. But that reasoning, even if correct, doesn't hold
for analyze, which does (much to my chargrin), holds a full blown
snapshot.


> > This is a pretty finnicky area of the code, with obviously not enough
> > test coverage.  I'm inclined to remove them from the back branches, and
> > try to get them working in master?
> 
> I think trying to get this "working" is a v13 task now.  We've obviously
> never tried to stress the case before, so you're neither fixing a
> regression nor fixing a new-in-v12 issue.

Well, the test *do* test that a previously existing all-branches bug
doesn't exist, no (albeit one just triggering an assert)?  I'm not
talking about making this concurrency safe, just about whether it's
possible to somehow keep the tests.

Greetings,

Andres Freund




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 20:54:45 +0200, Andreas Joseph Krogh wrote:
> På tirsdag 30. april 2019 kl. 19:58:31, skrev Peter Geoghegan  >: On Tue, Apr 30, 2019 at 9:59 AM Peter Geoghegan 
>  wrote:
>  > I'll start investigating the problem right away.
> 
>  I have found what the problem is. I simply neglected to make a
>  conservative assumption about suffix truncation needing to add a heap
>  TID to a leaf page's new high key in nbtsort.c (following commit
>  dd299df8189), even though I didn't make the same mistake in
>  nbtsplitloc.c. Not sure how I managed to make such a basic error.
> 
>  Andreas' test case works fine with the attached patch. I won't push a
>  fix for this today.
> 
>  --
>  Peter Geoghegan Nice, thanks! --
>  Andreas Joseph Krogh

Andreas, unfortunately your emails are pretty unreadable. Check the
quoted email, and the web archive:

https://www.postgresql.org/message-id/VisenaEmail.41.51d7719d814a1f54.16a6f98a5e9%40tc7-visena

Greetings,

Andres Freund




Re: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 11:54 AM Andreas Joseph Krogh
 wrote:
> Nice, thanks!

Thanks for the report!

-- 
Peter Geoghegan




Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 19:58:31, skrev Peter Geoghegan mailto:p...@bowt.ie>>: On Tue, Apr 30, 2019 at 9:59 AM Peter Geoghegan 
 wrote:
 > I'll start investigating the problem right away.

 I have found what the problem is. I simply neglected to make a
 conservative assumption about suffix truncation needing to add a heap
 TID to a leaf page's new high key in nbtsort.c (following commit
 dd299df8189), even though I didn't make the same mistake in
 nbtsplitloc.c. Not sure how I managed to make such a basic error.

 Andreas' test case works fine with the attached patch. I won't push a
 fix for this today.

 --
 Peter Geoghegan Nice, thanks! --
 Andreas Joseph Krogh

Match table_complete_speculative() code to comment

2019-04-30 Thread Ashwin Agrawal
Comment for table_complete_speculative() says

/*
 * Complete "speculative insertion" started in the same transaction.
If
 * succeeded is true, the tuple is fully inserted, if false, it's
removed.
 */
static inline void
table_complete_speculative(Relation rel, TupleTableSlot *slot,
   uint32 specToken, bool succeeded)
{
rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
succeeded);
}

but code really refers to succeeded as failure. Since that argument is
passed as specConflict, means conflict happened and hence the tuple
should be removed. It would be better to fix the code to match the
comment as in AM layer its better to deal with succeeded to finish the
insertion and not other way round.

diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index 4d179881f27..241639cfc20 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -282,7 +282,7 @@ heapam_tuple_complete_speculative(Relation
relation, TupleTableSlot *slot,
HeapTuple   tuple = ExecFetchSlotHeapTuple(slot, true, );

/* adjust the tuple's state accordingly */
-   if (!succeeded)
+   if (succeeded)
heap_finish_speculative(relation, >tts_tid);
else
heap_abort_speculative(relation, >tts_tid);
diff --git a/src/backend/executor/nodeModifyTable.c
b/src/backend/executor/nodeModifyTable.c
index 444c0c05746..d545bbce8a2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -556,7 +556,7 @@ ExecInsert(ModifyTableState *mtstate,

/* adjust the tuple's state accordingly */
table_complete_speculative(resultRelationDesc, slot,
-
specToken, specConflict);
+
specToken, !specConflict);

/*
 * Wake up anyone waiting for our decision.
They will re-check

- Ashwin and Melanie




Re: Turning off enable_partition_pruning doesn't

2019-04-30 Thread Alvaro Herrera
On 2019-Apr-30, Tom Lane wrote:

> regression=# explain select * from p1 where a = 3; 

But you're reading from the partition, not from the partitioned table ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: doc: improve PG 12 to_timestamp()/to_date() wording

2019-04-30 Thread Justin Pryzby
Hi Bruce

I saw this commit;
commit ad23adc5a169b114f9ff325932cbf2ce1c5e69c1
|Author: Bruce Momjian 
|Date:   Tue Apr 30 14:06:57 2019 -0400
|
|doc:  improve PG 12 to_timestamp()/to_date() wording

which cleans up language added at cf984672.

Can I suggest this additional change, which is updated and extracted from my
larger set of documentation fixes.

Justin

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 96fafdd..b420585 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6400,20 +6400,20 @@ SELECT regexp_match('abc01234xyz', 
'(?:(.*?)(\d+)(.*)){1,1}');
   
   
If FX is specified, a separator in the template 
string
-   matches exactly one character in input string.  Notice we don't insist 
the
-   input string character be the same as the template string separator.
+   matches exactly one character in the input string.  But note that the
+   input string character is not required to be the same as the separator 
from the template string.
For example, to_timestamp('2000/JUN', 'FX MON')
works, but to_timestamp('2000/JUN', 
'FXMON')
-   returns an error because the second template string space is consumed
-   by the letter J in the input string.
+   returns an error because the second space in the template string 
consumes
+   the letter M from the input string.
   
  
 
  
   
A TZH template pattern can match a signed number.
-   Without the FX option, it can lead to ambiguity in
-   interpretation of the minus sign, which can also be interpreted as a 
separator.
+   Without the FX option, minus signs may be ambiguous,
+   and could be interpreted as a separator.
This ambiguity is resolved as follows:  If the number of separators 
before
TZH in the template string is less than the number of
separators before the minus sign in the input string, the minus sign
>From 0d304935aa5b83e56127b5a63aa3d180b4c42e16 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 30 Mar 2019 03:42:35 -0500
Subject: [PATCH v3] Clean up language in cf984672: Improve behavior of
 to_timestamp()/to_date() functions

---
 doc/src/sgml/func.sgml | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 96fafdd..b420585 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6400,20 +6400,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
   
   
If FX is specified, a separator in the template string
-   matches exactly one character in input string.  Notice we don't insist the
-   input string character be the same as the template string separator.
+   matches exactly one character in the input string.  But note that the
+   input string character is not required to be the same as the separator from the template string.
For example, to_timestamp('2000/JUN', 'FX MON')
works, but to_timestamp('2000/JUN', 'FXMON')
-   returns an error because the second template string space is consumed
-   by the letter J in the input string.
+   returns an error because the second space in the template string consumes
+   the letter M from the input string.
   
  
 
  
   
A TZH template pattern can match a signed number.
-   Without the FX option, it can lead to ambiguity in
-   interpretation of the minus sign, which can also be interpreted as a separator.
+   Without the FX option, minus signs may be ambiguous,
+   and could be interpreted as a separator.
This ambiguity is resolved as follows:  If the number of separators before
TZH in the template string is less than the number of
separators before the minus sign in the input string, the minus sign
-- 
2.7.4



Turning off enable_partition_pruning doesn't

2019-04-30 Thread Tom Lane
I'd have thought that disabling enable_partition_pruning would,
um, disable partition pruning.  It does not:

regression=# create table p (a int) partition by list (a);
CREATE TABLE
regression=# create table p1 partition of p for values in (1);
CREATE TABLE
regression=# create table p2 partition of p for values in (2);
CREATE TABLE
regression=# explain select * from p1 where a = 3; 
 QUERY PLAN 

 Seq Scan on p1  (cost=0.00..41.88 rows=13 width=4)
   Filter: (a = 3)
(2 rows)

regression=# set enable_partition_pruning TO off;
SET
regression=# explain select * from p1 where a = 3;
 QUERY PLAN 

 Seq Scan on p1  (cost=0.00..41.88 rows=13 width=4)
   Filter: (a = 3)
(2 rows)


The fact that we fail to prune the first child is a separate issue
driven by some ruleutils.c limitations, cf
https://www.postgresql.org/message-id/flat/001001d4f44b$2a2cca50$7e865ef0$@lab.ntt.co.jp

My point here is that the second EXPLAIN should have shown scanning
both partitions, shouldn't it?

(v11 behaves the same as HEAD here; didn't try v10.)

regards, tom lane




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 14:05:50 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It's the lock-upgrade problem I theorized about
> > upthread. ReindexIndex(), via RangeVarCallbackForReindexIndex(), takes a
> > ShareLock on pg_class, and then goes on to upgrade to RowExclusiveLock
> > in RelationSetNewRelfilenode(). But at that time another session
> > obviously can already have the ShareLock and would also want to upgrade.
> 
> Hmm.  Note that this is totally independent of the deadlock mechanism
> I reported in my last message on this thread.

Yea :(


> > I'm not sure it's worth fixing this.
> 
> I am not sure it's even *possible* to fix all these cases.

I think it's worth fixing the most common ones though.  It sure sucks
that a plain REINDEX TABLE pg_class; isn't safe to run.


> Even if we could, it's out of scope for v12 let alone the back branches.

Unfortunately agreed. It's possible we could come up with a fix to
backpatch after maturing some, but certainly not before the release.


> I think the only practical solution is to remove those reindex tests.
> Even if we ran them in a script with no concurrent scripts, there'd
> be risk of failures against autovacuum, I'm afraid.  Not often, but
> often enough to be annoying.

> Possibly we could run them in a TAP test that configures a cluster
> with autovac disabled?

Hm. Would it be sufficient to instead move them to a non-concurrent
test group, and stick a BEGIN; LOCK pg_class, ; COMMIT; around it? I
think that ought to make it safe against autovacuum, and theoretically
there shouldn't be any overlapping pg_class/index updates that we'd need
to wait for?

This is a pretty finnicky area of the code, with obviously not enough
test coverage.  I'm inclined to remove them from the back branches, and
try to get them working in master?

Greetings,

Andres Freund




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
Andres Freund  writes:
> It's the lock-upgrade problem I theorized about
> upthread. ReindexIndex(), via RangeVarCallbackForReindexIndex(), takes a
> ShareLock on pg_class, and then goes on to upgrade to RowExclusiveLock
> in RelationSetNewRelfilenode(). But at that time another session
> obviously can already have the ShareLock and would also want to upgrade.

Hmm.  Note that this is totally independent of the deadlock mechanism
I reported in my last message on this thread.

I also wonder whether clobber-cache testing would expose cases
we haven't seen that trace to the additional catalog accesses
caused by cache reloads.

> I'm not sure it's worth fixing this.

I am not sure it's even *possible* to fix all these cases.  Even
if we could, it's out of scope for v12 let alone the back branches.

I think the only practical solution is to remove those reindex tests.
Even if we ran them in a script with no concurrent scripts, there'd
be risk of failures against autovacuum, I'm afraid.  Not often, but
often enough to be annoying.

Possibly we could run them in a TAP test that configures a cluster
with autovac disabled?

regards, tom lane




Re: CHAR vs NVARCHAR vs TEXT performance

2019-04-30 Thread Steve Crawford
>
>
> > On Tue, Apr 30, 2019 at 5:44 AM Tom Lane  wrote:
> >> FWIW, my recommendation for this sort of thing is almost always
> >> to not use CHAR(n).  The use-case for that datatype pretty much
> >> disappeared with the last IBM Model 029 card punch.
> ...
>
>
>
Perhaps the "tip" on the character datatype page (
https://www.postgresql.org/docs/11/datatype-character.html) should be
updated as the statement "There is no performance difference among these
three types..." could easily lead a reader down the wrong path. The
statement may be true if one assumes the planner is able to make an optimal
choice but clearly there are cases that prevent that. If the situation is
better explained elsewhere in the documentation then just a link to that
explanation may be all that is needed.

Cheers,
Steve


Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:59 AM Peter Geoghegan  wrote:
> I'll start investigating the problem right away.

I have found what the problem is. I simply neglected to make a
conservative assumption about suffix truncation needing to add a heap
TID to a leaf page's new high key in nbtsort.c (following commit
dd299df8189), even though I didn't make the same mistake in
nbtsplitloc.c. Not sure how I managed to make such a basic error.

Andreas' test case works fine with the attached patch. I won't push a
fix for this today.

-- 
Peter Geoghegan


0001-Tentative-fix-for-nbtsort.c-space-bug.patch
Description: Binary data


Re: performance regression when filling in a table

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 12:32:13 +0200, Fabien COELHO wrote:
> The effect is that the first generation seems to take more time, but
> dropping the table and regenerating again much less, with a typical 40%
> performance improvement between first and second run, independently of the
> version. The reported figures above where comparisons between first for pg12
> and second or later for pg11.

Yea, that's pretty normal. The likely difference is that in the repeated
case you'll have WAL files ready to be recycled. I'd assume that the
difference between the runs would be much smaller if used unlogged
tables (or WAL on a ramdisk or somesuch).


> Performances are quite unstable, with index generation on the same scale 100
> data taking anything from 6 to 15 seconds over runs.

How comparable are the runs? Are you restarting postgres inbetween?
Perform checkpoints?


> Doing a VACUUM and checksums interact badly: vacuum time jumps from 3
> seconds to 30 seconds:-(

Hm, that's more than I normally see. What exactly did you do there?

Greetings,

Andres Freund




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
On 2019-04-30 11:51:10 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-30 00:50:20 -0400, Tom Lane wrote:
> > I suspect the problem isn't REINDEX INDEX in general, it's REINDEX INDEX
> > over catalog tables modified during reindex.
>
> So far, every one of the failures in the buildfarm looks like the REINDEX
> is deciding that it needs to wait for some other transaction, eg here
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2019-04-30%2014%3A43%3A11
>
> the relevant bit of postmaster log is
>
> 2019-04-30 14:44:13.478 UTC [16135:450] pg_regress/create_index LOG:  
> statement: REINDEX TABLE pg_class;
> 2019-04-30 14:44:14.478 UTC [16137:430] pg_regress/create_view LOG:  process 
> 16137 detected deadlock while waiting for AccessShareLock on relation 2662 of 
> database 16384 after 1000.148 ms
> 2019-04-30 14:44:14.478 UTC [16137:431] pg_regress/create_view DETAIL:  
> Process holding the lock: 16135. Wait queue: .
> 2019-04-30 14:44:14.478 UTC [16137:432] pg_regress/create_view STATEMENT:  
> DROP SCHEMA temp_view_test CASCADE;
> 2019-04-30 14:44:14.478 UTC [16137:433] pg_regress/create_view ERROR:  
> deadlock detected
> 2019-04-30 14:44:14.478 UTC [16137:434] pg_regress/create_view DETAIL:  
> Process 16137 waits for AccessShareLock on relation 2662 of database 16384; 
> blocked by process 16135.
>   Process 16135 waits for ShareLock on transaction 2875; blocked by 
> process 16137.
>   Process 16137: DROP SCHEMA temp_view_test CASCADE;
>   Process 16135: REINDEX TABLE pg_class;
> 2019-04-30 14:44:14.478 UTC [16137:435] pg_regress/create_view HINT:  See 
> server log for query details.
> 2019-04-30 14:44:14.478 UTC [16137:436] pg_regress/create_view STATEMENT:  
> DROP SCHEMA temp_view_test CASCADE;
>
> I haven't been able to reproduce this locally yet, but my guess is that
> the REINDEX wants to update some row that was already updated by the
> concurrent transaction, so it has to wait to see if the latter commits
> or not.  And, of course, waiting while holding AccessExclusiveLock on
> any index of pg_class is a Bad Idea (TM).  But I can't quite see why
> we'd be doing something like that during the reindex ...

I've reproduced something similar locally by running "REINDEX INDEX
pg_class_oid_index;" via pgbench. Fails over pretty much immediately.

It's the lock-upgrade problem I theorized about
upthread. ReindexIndex(), via RangeVarCallbackForReindexIndex(), takes a
ShareLock on pg_class, and then goes on to upgrade to RowExclusiveLock
in RelationSetNewRelfilenode(). But at that time another session
obviously can already have the ShareLock and would also want to upgrade.

The same problem exists with reindexing indexes on pg_index.

ReindexTable is also affected. It locks the table with ShareLock, but
then subsidiary routines upgrade to RowExclusiveLock. The way to fix it
would be a bit different than for ReindexIndex(), as the locking happens
via RangeVarGetRelidExtended() directly, rather than in the callback.

There's a somewhat related issue in the new REINDEX CONCURRENTLY. See
https://www.postgresql.org/message-id/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de

Attached is a *hacky* prototype patch that fixes the issues for me. This
is *not* meant as an actual fix, just a demonstration.

Turns out it's not even sufficient to take a ShareRowExclusive for
pg_class. That prevents issues of concurrent REINDEX INDEX
pg_class_oid_index blowing up, but if one runs REINDEX INDEX
pg_class_oid_index; and REINDEX TABLE pg_class; (or just the latter)
concurrently it still blows up, albeit taking longer to do so.

The problem is that other codepaths expect to be able to hold an
AccessShareLock on pg_class, and multiple pg_class indexes
(e.g. catcache initialization which is easy to hit with -C, [1]). If we
were to want this concurrency safe, I think it requires an AEL on at
least pg_class for reindex (I think ShareRowExclusiveLock might suffice
for pg_index).

I'm not sure it's worth fixing this. It's crummy and somewhat fragile
that we'd have we'd have special locking rules for catalog tables. OTOH,
it really also sucks that a lone REINDEX TABLE pg_class; can deadlock
with another session doing nothing more than establishing a connection.

I guess it's not that common, and can be fixed by users by doing an
explicit BEGIN;LOCK pg_class;REINDEX TABLE pg_class;COMMIT;, but that's
not something anybody will know to do.

Pragmatically I don't think there's a meaningful difference between
holding a ShareLock on pg_class + AEL on one or more indexes, to holding
an AEL on pg_class. Just about every pg_class access is through an
index.

Greetings,

Andres Freund


[1]
#6  0x561dac7f9a36 in WaitOnLock (locallock=0x561dae101878, 
owner=0x561dae112ee8) at 
/home/andres/src/postgresql/src/backend/storage/lmgr/lock.c:1768
#7  0x561dac7f869e in LockAcquireExtended (locktag=0x7ffd7a128650, 
lockmode=1, sessionLock=false, dontWait=false, reportMemoryError=true,

Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:56 AM Andreas Joseph Krogh  wrote:
> I've sent you guys a link (Google Drive) off-list.

I'll start investigating the problem right away.

Thanks
-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 18:48:45, skrev Peter Geoghegan mailto:p...@bowt.ie>>: On Tue, Apr 30, 2019 at 9:47 AM Tom Lane 
 wrote:
 > Andreas Joseph Krogh  writes:
 > > I have a 1.4GB dump (only one table) which reliably reproduces this error.
 > > Shall I share it off-list? --
 >
 > That's awfully large :-(. How do you have in mind to transmit it?

 I've send dumps that were larger than that by providing a Google drive
 link. Something like that should work reasonably well. I've sent you guys a 
link (Google Drive) off-list. 
 --
 Andreas Joseph Krogh

Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:47 AM Tom Lane  wrote:
> Andreas Joseph Krogh  writes:
> > I have a 1.4GB dump (only one table) which reliably reproduces this error.
> > Shall I share it off-list? --
>
> That's awfully large :-(.  How do you have in mind to transmit it?

I've send dumps that were larger than that by providing a Google drive
link. Something like that should work reasonably well.

-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Tom Lane
Andreas Joseph Krogh  writes:
> I have a 1.4GB dump (only one table) which reliably reproduces this error.
> Shall I share it off-list? --

That's awfully large :-(.  How do you have in mind to transmit it?

Maybe you could write a short script that generates dummy data
to reproduce the problem?

regards, tom lane




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:44 AM Andreas Joseph Krogh
 wrote:
> I have a 1.4GB dump (only one table) which reliably reproduces this error.
> Shall I share it off-list?

I would be quite interested in this, too, since there is a chance that
it's my bug.

-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 16:27:05, skrev Andreas Joseph Krogh <
andr...@visena.com >: [snip] Yep, happens without 
--with-llvm also. I'll try to load only the necessary table(s) to reproduce. I 
have a 1.4GB dump (only one table) which reliably reproduces this error.
 Shall I share it off-list? --
 Andreas Joseph Krogh

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
I wrote:
> I haven't been able to reproduce this locally yet, but my guess is that
> the REINDEX wants to update some row that was already updated by the
> concurrent transaction, so it has to wait to see if the latter commits
> or not.  And, of course, waiting while holding AccessExclusiveLock on
> any index of pg_class is a Bad Idea (TM).  But I can't quite see why
> we'd be doing something like that during the reindex ...

Ah-hah: the secret to making it reproducible is what prion is doing:
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE

Here's a stack trace from reindex's side:

#0  0x0033968e9223 in __epoll_wait_nocancel ()
at ../sysdeps/unix/syscall-template.S:82
#1  0x00787cb5 in WaitEventSetWaitBlock (set=0x22d52f0, timeout=-1, 
occurred_events=0x7ffc77117c00, nevents=1, 
wait_event_info=) at latch.c:1080
#2  WaitEventSetWait (set=0x22d52f0, timeout=-1, 
occurred_events=0x7ffc77117c00, nevents=1, 
wait_event_info=) at latch.c:1032
#3  0x007886da in WaitLatchOrSocket (latch=0x7f90679077f4, 
wakeEvents=, sock=-1, timeout=-1, 
wait_event_info=50331652) at latch.c:407
#4  0x0079993d in ProcSleep (locallock=, 
lockMethodTable=) at proc.c:1290
#5  0x00796ba2 in WaitOnLock (locallock=0x2200600, owner=0x2213470)
at lock.c:1768
#6  0x00798719 in LockAcquireExtended (locktag=0x7ffc77117f90, 
lockmode=, sessionLock=, 
dontWait=false, reportMemoryError=true, locallockp=0x0) at lock.c:1050
#7  0x007939b7 in XactLockTableWait (xid=2874, 
rel=, ctid=, 
oper=XLTW_InsertIndexUnique) at lmgr.c:658
#8  0x004d4841 in heapam_index_build_range_scan (
heapRelation=0x7f905eb3fcd8, indexRelation=0x7f905eb3c5b8, 
indexInfo=0x22d50c0, allow_sync=, anyvisible=false, 
progress=true, start_blockno=0, numblocks=4294967295, 
callback=0x4f8330 <_bt_build_callback>, callback_state=0x7ffc771184f0, 
scan=0x2446fb0) at heapam_handler.c:1527
#9  0x004f9db0 in table_index_build_scan (heap=0x7f905eb3fcd8, 
index=0x7f905eb3c5b8, indexInfo=0x22d50c0)
at ../../../../src/include/access/tableam.h:1437
#10 _bt_spools_heapscan (heap=0x7f905eb3fcd8, index=0x7f905eb3c5b8, 
indexInfo=0x22d50c0) at nbtsort.c:489
#11 btbuild (heap=0x7f905eb3fcd8, index=0x7f905eb3c5b8, indexInfo=0x22d50c0)
at nbtsort.c:337
#12 0x00547e33 in index_build (heapRelation=0x7f905eb3fcd8, 
indexRelation=0x7f905eb3c5b8, indexInfo=0x22d50c0, isreindex=true, 
parallel=) at index.c:2724
#13 0x00548b97 in reindex_index (indexId=2662, 
skip_constraint_checks=false, persistence=112 'p', options=0)
at index.c:3349
#14 0x005490f1 in reindex_relation (relid=, 
flags=5, options=0) at index.c:3592
#15 0x005ed295 in ReindexTable (relation=0x21e2938, options=0, 
concurrent=) at indexcmds.c:2422
#16 0x007b5f69 in standard_ProcessUtility (pstmt=0x21e2cf0, 
queryString=0x21e1f18 "REINDEX TABLE pg_class;", 
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x21e2de8, completionTag=0x7ffc77118d80 "") at utility.c:790
#17 0x007b1689 in PortalRunUtility (portal=0x2247c38, pstmt=0x21e2cf0, 
isTopLevel=, setHoldSnapshot=, 
dest=0x21e2de8, completionTag=) at pquery.c:1175
#18 0x007b2611 in PortalRunMulti (portal=0x2247c38, isTopLevel=true, 
setHoldSnapshot=false, dest=0x21e2de8, altdest=0x21e2de8, 
completionTag=0x7ffc77118d80 "") at pquery.c:1328
#19 0x007b2eb0 in PortalRun (portal=0x2247c38, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x21e2de8, 
altdest=0x21e2de8, completionTag=0x7ffc77118d80 "") at pquery.c:796
#20 0x007af2ab in exec_simple_query (
query_string=0x21e1f18 "REINDEX TABLE pg_class;") at postgres.c:1215

So basically, the problem here lies in trying to re-verify uniqueness
of pg_class's indexes --- there could easily be entries in pg_class that
haven't committed yet.

I don't think there's an easy way to make this not deadlock against
concurrent DDL.  For sure I don't want to disable the uniqueness
checks.

regards, tom lane




Re: Sv: Sv: Re: Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andrew Dunstan


Please fix or abstain from using the MUA that produces this monstrosity
of a Subject: "Sv: Sv: Re: Sv: Re: ERROR: failed to add item to the
index page"


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Failure in contrib test _int on loach

2019-04-30 Thread Teodor Sigaev

Hi!

So, if other hackers are agreed with my reasoning, the suggested fix is 
sufficient and can be committed.




Patch looks right, but I think that comment should be improved in follow piece:

if (stack->blkno != GIST_ROOT_BLKNO &&
-   stack->parent->lsn < GistPageGetNSN(stack->page))
+   ((stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+stack->retry_from_parent == true))
{
/*
 * Concurrent split detected. There's no guarantee that the

Not only concurrent split could be deteced here and it was missed long ago. But 
this patch seems a good chance to change this comment.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-30 00:50:20 -0400, Tom Lane wrote:
>> Hm?  REINDEX INDEX is deadlock-prone by definition, because it starts
>> by opening/locking the index and then it has to open/lock the index's
>> table.  Every other operation locks tables before their indexes.

> We claim to have solved that:

Oh, okay ...

> I suspect the problem isn't REINDEX INDEX in general, it's REINDEX INDEX
> over catalog tables modified during reindex.

So far, every one of the failures in the buildfarm looks like the REINDEX
is deciding that it needs to wait for some other transaction, eg here

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2019-04-30%2014%3A43%3A11

the relevant bit of postmaster log is

2019-04-30 14:44:13.478 UTC [16135:450] pg_regress/create_index LOG:  
statement: REINDEX TABLE pg_class;
2019-04-30 14:44:14.478 UTC [16137:430] pg_regress/create_view LOG:  process 
16137 detected deadlock while waiting for AccessShareLock on relation 2662 of 
database 16384 after 1000.148 ms
2019-04-30 14:44:14.478 UTC [16137:431] pg_regress/create_view DETAIL:  Process 
holding the lock: 16135. Wait queue: .
2019-04-30 14:44:14.478 UTC [16137:432] pg_regress/create_view STATEMENT:  DROP 
SCHEMA temp_view_test CASCADE;
2019-04-30 14:44:14.478 UTC [16137:433] pg_regress/create_view ERROR:  deadlock 
detected
2019-04-30 14:44:14.478 UTC [16137:434] pg_regress/create_view DETAIL:  Process 
16137 waits for AccessShareLock on relation 2662 of database 16384; blocked by 
process 16135.
Process 16135 waits for ShareLock on transaction 2875; blocked by 
process 16137.
Process 16137: DROP SCHEMA temp_view_test CASCADE;
Process 16135: REINDEX TABLE pg_class;
2019-04-30 14:44:14.478 UTC [16137:435] pg_regress/create_view HINT:  See 
server log for query details.
2019-04-30 14:44:14.478 UTC [16137:436] pg_regress/create_view STATEMENT:  DROP 
SCHEMA temp_view_test CASCADE;

I haven't been able to reproduce this locally yet, but my guess is that
the REINDEX wants to update some row that was already updated by the
concurrent transaction, so it has to wait to see if the latter commits
or not.  And, of course, waiting while holding AccessExclusiveLock on
any index of pg_class is a Bad Idea (TM).  But I can't quite see why
we'd be doing something like that during the reindex ...

regards, tom lane




Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-04-30 Thread Andres Freund
Hi,

While looking at 
https://www.postgresql.org/message-id/20190430070552.jzqgcy4ihalx7nur%40alap3.anarazel.de
I noticed that

/*
 * ReindexIndex
 *  Recreate a specific index.
 */
void
ReindexIndex(RangeVar *indexRelation, int options, bool concurrent)
{
Oid indOid;
Oid heapOid = InvalidOid;
Relationirel;
charpersistence;

/*
 * Find and lock index, and check permissions on table; use callback to
 * obtain lock on table first, to avoid deadlock hazard.  The lock level
 * used here must match the index lock obtained in reindex_index().
 */
indOid = RangeVarGetRelidExtended(indexRelation,
  
concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock,
  0,
  
RangeVarCallbackForReindexIndex,
  (void 
*) );

doesn't pass concurrent-ness to RangeVarCallbackForReindexIndex(). Which
then goes on to lock the table

static void
RangeVarCallbackForReindexIndex(const RangeVar *relation,
Oid relId, Oid 
oldRelId, void *arg)

if (OidIsValid(*heapOid))
LockRelationOid(*heapOid, ShareLock);

without knowing that it should use ShareUpdateExclusive. Which
e.g. ReindexTable knows:

/* The lock level used here should match reindex_relation(). */
heapOid = RangeVarGetRelidExtended(relation,
   
concurrent ? ShareUpdateExclusiveLock : ShareLock,
   0,
   
RangeVarCallbackOwnsTable, NULL);

so there's a lock upgrade hazard.

Creating a table
CREATE TABLE blarg(id serial primary key);
and then using pgbench to reindex it:
REINDEX INDEX CONCURRENTLY blarg_pkey;

indeed proves that there's a problem:

2019-04-30 08:12:58.679 PDT [30844][7/925] ERROR:  40P01: deadlock detected
2019-04-30 08:12:58.679 PDT [30844][7/925] DETAIL:  Process 30844 waits for 
ShareUpdateExclusiveLock on relation 50661 of database 13408; blocked by 
process 30848.
Process 30848 waits for ShareUpdateExclusiveLock on relation 50667 of 
database 13408; blocked by process 30844.
Process 30844: REINDEX INDEX CONCURRENTLY blarg_pkey;
Process 30848: REINDEX INDEX CONCURRENTLY blarg_pkey;
2019-04-30 08:12:58.679 PDT [30844][7/925] HINT:  See server log for query 
details.
2019-04-30 08:12:58.679 PDT [30844][7/925] LOCATION:  DeadLockReport, 
deadlock.c:1140
2019-04-30 08:12:58.679 PDT [30844][7/925] STATEMENT:  REINDEX INDEX 
CONCURRENTLY blarg_pkey;

I assume the fix woudl be to pass a struct {LOCKMODE lockmode; Oid
heapOid;} to RangeVarCallbackForReindexIndex().

Greetings,

Andres Freund




Re: message style

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 10:58:13 -0400, Alvaro Herrera wrote:
> I have this two message patches that I've been debating with myself
> about:
> 
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -1282,7 +1282,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection 
> direction)
>   if (unlikely(sscan->rs_rd->rd_tableam != GetHeapamTableAmRoutine()))
>   ereport(ERROR,
>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -  errmsg("only heap AM is supported")));
> +  errmsg("only heap table access method is 
> supported")));
>  
> 
> I think the original is not great

Agreed.


> but I'm not sure that the new is much
> better either.  I think this message says "only AMs that behave using
> the heapam routines are supported"; we cannot say use the literal
> "heapam" AM name because, as the comment two lines above says, it's
> possible to copy the AM with a different name and it would be
> acceptable.

I'm not sure that's something worth being bothered about - the only
reason to do that is for testing. I don't think that needs to be
refelected in error messages.


> OTOH maybe this code will not survive for long, so it
> doesn't matter much that the message is 100% correct; perhaps we should
> just change errmsg() to errmsg_internal() and be done with it.

I'd suspect some of them will survive for a while. What should a heap
specific pageinspect function do if not called for heap etc?


> diff --git a/src/backend/access/table/tableamapi.c 
> b/src/backend/access/table/tableamapi.c
> index 0053dc95cab..c8b7598f785 100644
> --- a/src/backend/access/table/tableamapi.c
> +++ b/src/backend/access/table/tableamapi.c
> @@ -103,7 +103,8 @@ check_default_table_access_method(char **newval, void 
> **extra, GucSource source)
>  {
>   if (**newval == '\0')
>   {
> - GUC_check_errdetail("default_table_access_method may not be 
> empty.");
> + GUC_check_errdetail("%s may not be empty.",
> + 
> "default_table_access_method");
>   return false;
>   }
> 
> My problem here is not really the replacement of the name to %s, but the
> "may not be" part of it.  We don't use "may not be" anywhere else; most
> places seem to use "foo cannot be X" and a small number of other places
> use "foo must not be Y".  I'm not able to judge which of the two is
> better (so change all messages to use that form), or if there's a
> semantic difference and if so which one to use in this case.

No idea about what's better here either. I don't think there's an
intentional semantic difference.

Thanks for looking at this!

Greetings,

Andres Freund




Re: "long" type is not appropriate for counting tuples

2019-04-30 Thread Tom Lane
Alvaro Herrera  writes:
> I don't know if anybody plans to do progress report for COPY, but I hope
> we don't find ourselves in a problem when some user claims that they are
> inserting more than 2^63 but less than 2^64 tuples.

At one tuple per nanosecond, it'd take a shade under 300 years to
reach 2^63.  Seems like a problem for our descendants to worry about.

regards, tom lane




Re: "long" type is not appropriate for counting tuples

2019-04-30 Thread Alvaro Herrera
On 2019-Apr-30, David Rowley wrote:

> On Tue, 30 Apr 2019 at 06:28, Peter Geoghegan  wrote:
> >
> > On Mon, Apr 29, 2019 at 11:20 AM Alvaro Herrera
> >  wrote:
> > > Agreed.  Here's a patch.  I see downthread that you also discovered the
> > > same mistake in _h_indexbuild by grepping for "long"; I got to it by
> > > examining callers of pgstat_progress_update_param and
> > > pgstat_progress_update_multi_param.  I didn't find any other mistakes of
> > > the same ilk.  Some codesites use "double" instead of "int64", but those
> > > are not broken.
> >
> > This seems fine, though FWIW I probably would have gone with int64
> > instead of uint64. There is generally no downside to using int64, and
> > being to support negative integers can be useful in some contexts
> > (though not this context).
> 
> CopyFrom() returns uint64. I think it's better to be consistent in the
> types we use to count tuples in commands.

That's not a bad argument ... but I still committed it as int64, mostly
because that's what pgstat_progress_update_param takes.  Anyway, these
are just local variables, not return values, so it's easily changeable
if we determine (??) that unsigned is better.

I don't know if anybody plans to do progress report for COPY, but I hope
we don't find ourselves in a problem when some user claims that they are
inserting more than 2^63 but less than 2^64 tuples.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Sv: Sv: Re: Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 16:03:04, skrev Andreas Joseph Krogh <
andr...@visena.com >: På tirsdag 30. april 2019 kl. 
15:53:50, skrev Tom Lane mailto:t...@sss.pgh.pa.us>>: 
Andreas Joseph Krogh  writes:
 > I built with this: make distclean && ./configure
 > --prefix=$HOME/programs/postgresql-master --with-openssl --with-llvm && 
make -j
 > 8 install-world-contrib-recurse install-world-doc-recurse

 --with-llvm, eh? Does it reproduce without that? What platform is
 this on, what LLVM version?

 > I'll see if I can create a self contained example.

 Please.

 regards, tom lane Ubuntu 19.04 $ llvm-config --version 
 8.0.0
"--with-llvm" was something I had from when pg-11 was master. It might not be 
needed anymore? I'm trying a fresh build without --with-llvm and reload of data 
now. Yep, happens without --with-llvm also. I'll try to load only the necessary 
table(s) to reproduce. --
 Andreas Joseph Krogh

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread Alvaro Herrera
On 2019-Apr-30, John Naylor wrote:

> On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila  wrote:
> > As discussed above, we need to issue an
> > invalidation for following points:  (a) when vacuum finds there is no
> > FSM and page has more space now, I think you can detect this in
> > RecordPageWithFreeSpace
> 
> I took a brief look and we'd have to know how much space was there
> before. That doesn't seem possible without first implementing the idea
> to save free space locally in the same way the FSM does. Even if we
> have consensus on that, there's no code for it, and we're running out
> of time.

Hmm ... so, if vacuum runs and frees up any space from any of the pages,
then it should send out an invalidation -- it doesn't matter what the
FSM had, just that there is more free space now.  That means every other
process will need to determine a fresh FSM, but that seems correct.
Sounds better than keeping outdated entries indicating no-space-available.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Sv: Re: Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 15:53:50, skrev Tom Lane mailto:t...@sss.pgh.pa.us>>: Andreas Joseph Krogh  writes:
 > I built with this: make distclean && ./configure
 > --prefix=$HOME/programs/postgresql-master --with-openssl --with-llvm && 
make -j
 > 8 install-world-contrib-recurse install-world-doc-recurse

 --with-llvm, eh? Does it reproduce without that? What platform is
 this on, what LLVM version?

 > I'll see if I can create a self contained example.

 Please.

 regards, tom lane Ubuntu 19.04 $ llvm-config --version 
 8.0.0
"--with-llvm" was something I had from when pg-11 was master. It might not be 
needed anymore? I'm trying a fresh build without --with-llvm and reload of data 
now. 
 --
 Andreas Joseph Krogh

Re: Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Tom Lane
Andreas Joseph Krogh  writes:
> I built with this: make distclean && ./configure 
> --prefix=$HOME/programs/postgresql-master --with-openssl --with-llvm && make 
> -j 
> 8 install-world-contrib-recurse install-world-doc-recurse

--with-llvm, eh?  Does it reproduce without that?  What platform is
this on, what LLVM version?

> I'll see if I can create a self contained example. 

Please.

regards, tom lane




Sv: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
På tirsdag 30. april 2019 kl. 15:43:16, skrev Tom Lane mailto:t...@sss.pgh.pa.us>>: Andreas Joseph Krogh  writes:
 > With master (3dbb317d32f4f084ef7badaed8ef36f5c9b85fe6) I'm getting this:
 > visena=# CREATE INDEX origo_email_part_hdrvl_value_idx ON
 > public.origo_email_part_headervalue USING btree
 > (lower(substr((header_value)::text, 0, 1000)) varchar_pattern_ops);
 > psql: ERROR: failed to add item to the index page

 Hm, your example works for me on HEAD.

 Usually, the first thing to suspect when you're tracking HEAD and get
 bizarre failures is that you have a messed-up build. Before spending
 any time diagnosing more carefully, do "make distclean", reconfigure,
 rebuild, reinstall, then see if problem is still there.

 (In theory, you can avoid this sort of failure with appropriate use
 of --enable-depend, but personally I don't trust that too much.
 I find that with ccache + autoconf cache + parallel build, rebuilding
 completely is fast enough that it's something I just do routinely
 after any git pull. I'd rather use up my remaining brain cells on
 other kinds of problems...)

 regards, tom lane I built with this: make distclean && ./configure 
--prefix=$HOME/programs/postgresql-master --with-openssl --with-llvm && make -j 
8 install-world-contrib-recurse install-world-doc-recurse
It's probably caused by the data: visena=# select count(*) from 
origo_email_part_headervalue;
 count
 --
 14609516
 (1 row)
I'll see if I can create a self contained example. 
 --
 Andreas Joseph Krogh

Re: [PATCH v4] Add \warn to psql

2019-04-30 Thread Fabien COELHO




About v5: applies, compiles, global & local make check ok, doc gen ok.

Very minor comment: \qecho is just before \o in the embedded help, 
where it should be just after. Sorry I did not see it on the preceding 
submission.


Unfortunately new TAP test doesn't pass on my machine. I'm not good at Perl 
and didn't get the reason of the failure quickly.


I guess that you have a verbose ~/.psqlrc.

Can you try with adding -X to psql option when calling psql from the tap 
test?


--
Fabien.




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Tom Lane
Andreas Joseph Krogh  writes:
> With master (3dbb317d32f4f084ef7badaed8ef36f5c9b85fe6) I'm getting this: 
> visena=# CREATE INDEX origo_email_part_hdrvl_value_idx ON 
> public.origo_email_part_headervalue USING btree 
> (lower(substr((header_value)::text, 0, 1000)) varchar_pattern_ops);
>  psql: ERROR: failed to add item to the index page

Hm, your example works for me on HEAD.

Usually, the first thing to suspect when you're tracking HEAD and get
bizarre failures is that you have a messed-up build.  Before spending
any time diagnosing more carefully, do "make distclean", reconfigure,
rebuild, reinstall, then see if problem is still there.

(In theory, you can avoid this sort of failure with appropriate use
of --enable-depend, but personally I don't trust that too much.
I find that with ccache + autoconf cache + parallel build, rebuilding
completely is fast enough that it's something I just do routinely
after any git pull.  I'd rather use up my remaining brain cells on
other kinds of problems...)

regards, tom lane




Re: [Patch] Base backups and random or zero pageheaders

2019-04-30 Thread Michael Banck
Hi,

Am Mittwoch, den 27.03.2019, 11:37 +0100 schrieb Michael Banck:
> Am Dienstag, den 26.03.2019, 19:23 +0100 schrieb Michael Banck:
> > Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund:
> > > On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> > > > /*
> > > > -* Only check pages which have not been 
> > > > modified since the
> > > > -* start of the base backup. Otherwise, 
> > > > they might have been
> > > > -* written only halfway and the 
> > > > checksum would not be valid.
> > > > -* However, replaying WAL would 
> > > > reinstate the correct page in
> > > > -* this case. We also skip completely 
> > > > new pages, since they
> > > > -* don't have a checksum yet.
> > > > +* We skip completely new pages after 
> > > > checking they are
> > > > +* all-zero, since they don't have a 
> > > > checksum yet.
> > > >  */
> > > > -   if (!PageIsNew(page) && 
> > > > PageGetLSN(page) < startptr)
> > > > +   if (PageIsNew(page))
> > > > {
> > > > -   checksum = 
> > > > pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
> > > > -   phdr = (PageHeader) page;
> > > > -   if (phdr->pd_checksum != 
> > > > checksum)
> > > > +   all_zeroes = true;
> > > > +   pagebytes = (size_t *) page;
> > > > +   for (int i = 0; i < (BLCKSZ / 
> > > > sizeof(size_t)); i++)
> > > 
> > > Can we please abstract the zeroeness check into a separate function to
> > > be used both by PageIsVerified() and this?
> > 
> > Ok, done so as PageIsZero further down in bufpage.c.
> 
> It turns out that pg_checksums (current master and back branches, not
> just the online version) needs this treatment as well as it won't catch
> zeroed-out pageheader corruption, see attached patch to its TAP tests
> which trigger it (I also added a random data check similar to
> pg_basebackup as well which is not a problem for the current codebase).
> 
> Any suggestion on how to handle this? Should I duplicate the
> PageIsZero() code in pg_checksums? Should I move PageIsZero into
> something like bufpage_impl.h for use by external programs, similar to
> pg_checksum_page()?
> 
> I've done the latter as a POC in the second attached patch.

This is still an open item for the back branches I guess, i.e. zero page
header for pg_verify_checksums and additionally random page header for
pg_basebackup's base backup.

Do you plan to work on the patch you have outlined, what would I need to
change in the patches I submitted or is another approach warranted
entirely?  Should I add my patches to the next commitfest in order to
track them?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: performance regression when filling in a table

2019-04-30 Thread Fabien COELHO



Hello Andres,


  ## pg 11.2 done in 31.51 s
  ## pg 12devel (cd3e2746) real0m38.695s

What change could explain such a significant performance regression?


I think the pre-release packages have had assertions enabled at some
point. I suggest checking that. If it's not that, profiles would be
helpful.


Thanks for the pointer.

After some more tests based on versions compiled from sources, the 
situation is different, and I was (maybe) mostly identifying another 
effect not related to postgres version.


The effect is that the first generation seems to take more time, but 
dropping the table and regenerating again much less, with a typical 40% 
performance improvement between first and second run, independently of the 
version. The reported figures above where comparisons between first for 
pg12 and second or later for pg11.


So I was wrong, there is no significant performance regression per se, 
the two versions behave mostly the same.


I'm interested if someone has an explanation about why the first run is so 
bad or others are so good. My wide guess is that there is some space reuse 
under the hood, although I do not know enough about the details to 
confirm.


A few relatively bad news nevertheless:

Performances are quite unstable, with index generation on the same scale 
100 data taking anything from 6 to 15 seconds over runs.


Doing a VACUUM and checksums interact badly: vacuum time jumps from 3 
seconds to 30 seconds:-(


--
Fabien.




Re: CHAR vs NVARCHAR vs TEXT performance

2019-04-30 Thread Rob

I agree in principle, however in this particular scenario it's not
our schema so we're a little reluctant to migrate the types etc.

We're in a bit of a bad place because the combination of NHibernate
+ npgsql3/4 + this table = seqScans everywhere. Basically when npgsql
changed their default type for strings from VARCHAR to TEXT it caused
this behaviour.

I suppose the follow up question is: should drivers
default to sending types that are preferred by postgres (i.e. TEXT)
rather than compatible types (VARCHAR). If so, is there a reason why
the JDBC driver doesn't send TEXT  (possibly a question for the JDBC
guys rather than here)?

Thanks,
Rob

On 2019-04-30 00:16, Thomas Munro wrote:

On Tue, Apr 30, 2019 at 5:44 AM Tom Lane  wrote:

FWIW, my recommendation for this sort of thing is almost always
to not use CHAR(n).  The use-case for that datatype pretty much
disappeared with the last IBM Model 029 card punch.


+1 on the recommendation for PostgreSQL.

I do think it's useful on slightly more recent IBM technology than the
029 though.  It's been a few years since I touched it, but DB2 manuals
and experts in this decade recommended fixed size types in some
circumstances, and they might in theory be useful on any
in-place-update system (and maybe us in some future table AM?).  For
example, you can completely exclude the possibility of having to spill
to another page when updating (DB2 DBAs measure and complain about
rate of 'overflow' page usage which they consider failure and we
consider SOP), you can avoid wasting space on the length (at the cost
of wasting space on trailing spaces, if the contents vary in length),
you can get O(1) access to fixed sized attributes (perhaps even
updating single attributes).  These aren't nothing, and I've seen DB2
DBAs get TPS improvements from that kind of stuff.  (From memory this
type of thing was also a reason to think carefully about which tables
should use compression, because the fixed size space guarantees went
out the window.).






ERROR: tuple concurrently updated when modifying privileges

2019-04-30 Thread nickb
Hello, hackers

we witnessed this slightly misleading error in production and it took us a 
while to figure out what was taking place.
Below are reproduction steps:


-- setup
create table trun(cate int4);

-- session 1
begin;
truncate table trun;

-- session 2
grant insert on table trun to postgres;

-- session 1
end;

-- session 2:
ERROR: XX000: tuple concurrently updated
LOCATION: simple_heap_update, heapam.c:4474

Apparently the tuple in question is the pg_class entry of the table being 
truncated. I didn't look too deep into the cause, but I'm certain the error 
message could be improved at least.

Regards,
Nick.




ERROR: failed to add item to the index page

2019-04-30 Thread Andreas Joseph Krogh
With master (3dbb317d32f4f084ef7badaed8ef36f5c9b85fe6) I'm getting this: 
visena=# CREATE INDEX origo_email_part_hdrvl_value_idx ON 
public.origo_email_part_headervalue USING btree 
(lower(substr((header_value)::text, 0, 1000)) varchar_pattern_ops);
 psql: ERROR: failed to add item to the index page
The schema looks like this: create table origo_email_part_headervalue ( 
entity_idBIGSERIAL PRIMARY KEY, version int8 not null, header_value varchar NOT 
NULL, header_id int8 references origo_email_part_header (entity_id), value_index
int NOT NULL DEFAULT0, UNIQUE (header_id, value_index) ); CREATE INDEX 
origo_email_part_hdrvl_hdr_id_idxON origo_email_part_headervalue (header_id); 
CREATE INDEXorigo_email_part_hdrvl_value_idx ON origo_email_part_headervalue (
lower(substr(header_value, 0, 1000)) varchar_pattern_ops); (haven't tried any 
other version so I'm not sure when this started to happen) -- Andreas Joseph 
Krogh 

Re: performance regression when filling in a table

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 07:12:03 +0200, Fabien COELHO wrote:
> On my SSD Ubuntu laptop, with postgres-distributed binaries and unmodified
> default settings using local connections:

>   ## pg 11.2
>   > time pgbench -i -s 100
>   ...
>   done in 31.51 s
>   # (drop tables 0.00 s, create tables 0.01 s, generate 21.30 s, vacuum 3.32 
> s, primary keys 6.88 s).
>   # real0m31.524s
> 
>   ## pg 12devel (cd3e2746)
>   > time pgbench -i -s 100
>   # done in 38.68 s
>   # (drop tables 0.00 s, create tables 0.02 s, generate 29.70 s, vacuum 2.92 
> s, primary keys 6.04 s).
>   real0m38.695s
> 
> That is an overall +20% regression, and about 40% on the generate phase
> alone. This is not a fluke, repeating the procedure shows similar results.
> 
> Is it the same for other people out there, or is it only something related
> to my setup?
> 
> What change could explain such a significant performance regression?

I think the pre-release packages have had assertions enabled at some
point. I suggest checking that. If it's not that, profiles would be
helpful.

Greetings,

Andres Freund




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-30 Thread Andres Freund
Hi,

On 2019-04-30 00:50:20 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On April 29, 2019 9:37:33 PM PDT, Tom Lane  wrote:
> >> Seems like putting reindexes of pg_class into a test script that runs
> >> in parallel with other DDL wasn't a hot idea.
> 
> > Saw that. Will try to reproduce (and if necessary either run separately or 
> > revert). But isn't that somewhat broken? They're not run in a transaction, 
> > so the locking shouldn't be deadlock prone.
> 
> Hm?  REINDEX INDEX is deadlock-prone by definition, because it starts
> by opening/locking the index and then it has to open/lock the index's
> table.  Every other operation locks tables before their indexes.

We claim to have solved that:

/*
 * ReindexIndex
 *  Recreate a specific index.
 */
void
ReindexIndex(RangeVar *indexRelation, int options, bool concurrent)


/*
 * Find and lock index, and check permissions on table; use callback to
 * obtain lock on table first, to avoid deadlock hazard.  The lock level
 * used here must match the index lock obtained in reindex_index().
 */
indOid = RangeVarGetRelidExtended(indexRelation,
  
concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock,
  0,
  
RangeVarCallbackForReindexIndex,
  (void 
*) );

and I don't see an obvious hole in the general implementation. Minus the
comment that code exists back to 9.4.

I suspect the problem isn't REINDEX INDEX in general, it's REINDEX INDEX
over catalog tables modified during reindex. The callback acquires a
ShareLock lock on the index's table, but *also* during the reindex needs
a RowExclusiveLock on pg_class, etc. E.g. in RelationSetNewRelfilenode()
on pg_class, and on pg_index in index_build(). Which means there's a
lock-upgrade hazard (Share to RowExclusive - well, that's more a
side-grade, but still deadlock prone).

I can think of ways to fix that (e.g. if reindex is on pg_class or
index, use SHARE ROW EXCLUSIVE, rather than SHARE), but we'd probably
not want to backpatch that.

I'll try to reproduce tomorrow.

Greetings,

Andres Freund




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-30 Thread Paul Guo
I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo  wrote:

>
> On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
>
>> Mmm. I posted to wrong thread. Sorry.
>>
>> At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
>> HORIGUCHI  wrote in <
>> 20190423.163949.36763221.horiguchi.kyot...@lab.ntt.co.jp>
>> > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo  wrote
>> in 
>> > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
>> database
>> > > create redo error, but I suspect some other kind of redo, which
>> depends on
>> > > the files under the directory (they are not copied since the
>> directory is
>> > > not created) and also cannot be covered by the invalid page mechanism,
>> > > could fail. Thanks.
>> >
>> > If recovery starts from just after tablespace creation, that's
>> > simple. The Symlink to the removed tablespace is already removed
>> > in the case. Hence server innocently create files directly under
>> > pg_tblspc, not in the tablespace. Finally all files that were
>> > supposed to be created in the removed tablespace are removed
>> > later in recovery.
>> >
>> > If recovery starts from recalling page in a file that have been
>> > in the tablespace, XLogReadBufferExtended creates one (perhaps
>> > directly in pg_tblspc as described above) and the files are
>> > removed later in recoery the same way to above. This case doen't
>> > cause FATAL/PANIC during recovery even in master.
>> >
>> > XLogReadBufferExtended@xlogutils.c
>> > | * Create the target file if it doesn't already exist.  This lets us
>> cope
>> > | * if the replay sequence contains writes to a relation that is later
>> > | * deleted.  (The original coding of this routine would instead
>> suppress
>> > | * the writes, but that seems like it risks losing valuable data if the
>> > | * filesystem loses an inode during a crash.  Better to write the data
>> > | * until we are actually told to delete the file.)
>> >
>> > So buffered access cannot be a problem for the reason above. The
>> > remaining possible issue is non-buffered access to files in
>> > removed tablespaces. This is what I mentioned upthread:
>> >
>> > me> but I haven't checked this covers all places where need the same
>> > me> treatment.
>>
>> RM_DBASE_ID is fixed by the patch.
>>
>> XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
>>   - are not relevant.
>>
>> HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
>>   - Resources works on buffer is not affected.
>>
>> SMGR:
>>   - Both CREATE and TRUNCATE seems fine.
>>
>> TBLSPC:
>>   - We don't nest tablespace directories. No Problem.
>>
>> I don't find a similar case.
>
>
> I took some time in digging into the related code. It seems that ignoring
> if the dst directory cannot be created directly
> should be fine since smgr redo code creates tablespace path finally by
> calling TablespaceCreateDbspace().
> What's more, I found some more issues.
>
> 1) The below error message is actually misleading.
>
> 2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory
> "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
> 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
> Database/CREATE: copy dir 1663/1 to 65546/65547
>
> That should be due to dbase_desc(). It could be simply fixed following the
> code logic in GetDatabasePath().
>
> 2) It seems that src directory could be missing then
> dbase_redo()->copydir() could error out. For example,
>
> \!rm -rf /tmp/tbspace1
> \!mkdir /tmp/tbspace1
> \!rm -rf /tmp/tbspace2
> \!mkdir /tmp/tbspace2
> create tablespace tbs1 location '/tmp/tbspace1';
> create tablespace tbs2 location '/tmp/tbspace2';
> create database db1 tablespace tbs1;
> alter database db1 set tablespace tbs2;
> drop tablespace tbs1;
>
> Let's say, the standby finishes all replay but redo lsn on pg_control is
> still the point at 'alter database', and then
> kill postgres, then in theory when startup, dbase_redo()->copydir() will
> ERROR since 'drop tablespace tbs1'
> has removed the directories (and symlink) of tbs1. Below simple code
> change could fix that.
>
> diff 

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-30 Thread John Naylor
On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila  wrote:
> As discussed above, we need to issue an
> invalidation for following points:  (a) when vacuum finds there is no
> FSM and page has more space now, I think you can detect this in
> RecordPageWithFreeSpace

I took a brief look and we'd have to know how much space was there
before. That doesn't seem possible without first implementing the idea
to save free space locally in the same way the FSM does. Even if we
have consensus on that, there's no code for it, and we're running out
of time.

> (b) invalidation to notify the existence of
> FSM, this can be done both by vacuum and backend.

I still don't claim to be anything but naive in this area, but does
the attached get us any closer?

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


clear-upon-fsm-creation.patch
Description: Binary data


Re: generate documentation keywords table automatically

2019-04-30 Thread Peter Eisentraut
On 2019-04-29 21:19, Chapman Flack wrote:
> If the reorganization happening in this thread were to make possible
> run-time-enumerable keyword lists that could be filtered for SQL92ness
> or SQL:2003ness, that might relieve an implementation headache that,
> at present, both PgJDBC and PL/Java have to deal with.

Good information, but probably too big of a change at this point.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services