Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee
On Thu, Oct 20, 2016 at 11:34 AM, Michael Paquier  wrote:

> On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
>  wrote:
> > Actually, if we could add an API which can truncate FSM to the given heap
> > block, then the user may not even need to run VACUUM, which could be
> costly
> > for very large tables.
>
> FreeSpaceMapTruncateRel()?
>

Right. We need a SQL callable function to invoke that.


>
> > Also, AFAICS we will need to backport
> > pg_truncate_visibility_map() to older releases because unless the VM is
> > truncated along with the FSM, VACUUM may not scan all pages and the FSM
> for
> > those pages won't be recorded.
>
> This would need a careful lookup, and it would not be that complicated
> to implement. And this bug justifies the presence of a tool like that
> actually... But usually those don't get a backpatch, so the
> probability of getting that backported is low IMO, personally I am not
> sure I like that either.
>

Just to clarify, I meant if we truncate the entire FSM then we'll need API
to truncate VM as well so that VACUUM rebuilds everything completely. OTOH
if we provide a function just to truncate FSM to match the size of the
table, then we don't need to rebuild the FSM. So that's probably a better
way to handle FSM corruption, as far as this particular issue is concerned.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Michael Paquier
On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
 wrote:
> Actually, if we could add an API which can truncate FSM to the given heap
> block, then the user may not even need to run VACUUM, which could be costly
> for very large tables.

FreeSpaceMapTruncateRel()?

> Also, AFAICS we will need to backport
> pg_truncate_visibility_map() to older releases because unless the VM is
> truncated along with the FSM, VACUUM may not scan all pages and the FSM for
> those pages won't be recorded.

This would need a careful lookup, and it would not be that complicated
to implement. And this bug justifies the presence of a tool like that
actually... But usually those don't get a backpatch, so the
probability of getting that backported is low IMO, personally I am not
sure I like that either.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee
On Thu, Oct 20, 2016 at 10:50 AM, Michael Paquier  wrote:

>
>  For VMs a good way would
> be to use pg_visibility's pg_truncate_visibility_map(), but only for
> 9.6~.


Ah ok..


> For FSM there is no real solution, and actually a
> pg_truncate_fsm would prove to be useful here.


Right, that's what I proposed as an alternate idea. I agree this is much
cleaner and less error prone approach.

Actually, if we could add an API which can truncate FSM to the given heap
block, then the user may not even need to run VACUUM, which could be costly
for very large tables. Also, AFAICS we will need to backport
pg_truncate_visibility_map() to older releases because unless the VM is
truncated along with the FSM, VACUUM may not scan all pages and the FSM for
those pages won't be recorded.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Michael Paquier
On Thu, Oct 20, 2016 at 2:11 PM, Pavan Deolasee
 wrote:
> 4. Remove FSM and VM files for the affected tables (I don't think if it's
> safe to do this on a running server)

Definitely not while the server is running... For VMs a good way would
be to use pg_visibility's pg_truncate_visibility_map(), but only for
9.6~. For FSM there is no real solution, and actually a
pg_truncate_fsm would prove to be useful here. So users would need to
stop once the server if there are corrupted VMs or FSMs, delete them
manually, and then run VACUUM to rebuild them.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee
On Wed, Oct 19, 2016 at 6:54 PM, Tom Lane  wrote:

>
>
> Can we document an existing procedure for repairing FSM corruption?
> (VACUUM, maybe?)


I'm afraid it may not be easy to repair the corruption with existing
facilities. Most often the corruption will be on the standby and a VACUUM
may not actually touch affected pages on the master (because they may not
even exists on the master or skipped because of visibility maps). It may
not even trigger relation truncation. So AFAICS it may not generate any WAL
activity that can fix the corruption on the standby.

One possible way would be to delete the FSM (and VM) information on the
master and standby and then run VACUUM so these maps are rebuilt. We
obviously don't need to do this for all tables, but we need a way to find
the tables with corrupt FSM [1].

Suggested procedure could be:

1. Upgrade master and standby to the latest minor release (which involves
restart)
2. Install pg_freespace extension and run the query [1] on master to find
possible corruption cases. The query checks if FSM reports free space in a
block outside the size of the relation. Unfortunately, we might have false
positives if the relation is extended while the query is running.
3. Repeat the same query on standby (if it's running in Hot standby mode,
otherwise the corruption can only be detected once it's promoted to be a
master)
4. Remove FSM and VM files for the affected tables (I don't think if it's
safe to do this on a running server)
5. VACUUM affected tables so that FSM and VM is rebuilt.

Another idea is to implement a pg_freespace_repair() function in
pg_freespace which takes an AccessExclusiveLock on the table and truncates
it to it's current size, thus generating a WAL record that the standby will
replay to fix the corruption. This probably looks more promising, easy to
explain and less error prone.


[1]  SELECT *
  FROM (
SELECT oid::regclass as relname, EXISTS (
SELECT *
  FROM (
SELECT blkno, pg_freespace(oid::regclass, blkno)
  FROM
generate_series(pg_relation_size(oid::regclass)/
current_setting('block_size')::bigint, pg_relation_size(oid::regclass,
'fsm') / 2) as blkno
   ) as avail
 WHERE pg_freespace > 0
   ) as corrupt_fsm
  FROM pg_class
 WHERE relkind = 'r'
   ) b
 WHERE b.corrupt_fsm = true;


Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Parallel bitmap heap scan

2016-10-19 Thread Amit Kapila
On Wed, Oct 19, 2016 at 9:23 PM, Dilip Kumar  wrote:
> On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund  wrote:
>> Try measuring with something more heavy on bitmap scan time
>> itself. E.g.
>> SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= 
>> '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
>> or similar.  The tpch queries don't actually spend that much time in the
>> bitmapscan itself - the parallization of the rest of the query is what
>> matters...
>
> Yeah, I agree.
>
> I have tested with this query, with exact filter condition it was
> taking parallel sequence scan, so I have modified the filter a bit and
> tested.
>
> Tested with all default configuration in my local machine. I think I
> will generate more such test cases and do detail testing in my
> performance machine.
>
>
> Explain Analyze results:
> -
> On Head:
> 
> postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
> '1996-03-31'::date);
>
>  QUERY PLAN
> ---
>  Aggregate  (cost=848805.90..848805.91 rows=1 width=32) (actual
> time=12440.165..12440.166 rows=1 loops=1)
>->  Bitmap Heap Scan on lineitem  (cost=143372.40..834833.25
> rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841
> loops=1)
>  Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
> (l_shipdate <= '1996-03-31'::date))
>  Rows Removed by Index Recheck: 20678739
>  Heap Blocks: exact=51196 lossy=528664
>  ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..141975.13 rows=5589057 width=0) (actual
> time=1093.376..1093.376 rows=5678841 loops=1)
>Index Cond: ((l_shipdate >= '1995-01-01'::date) AND
> (l_shipdate <= '1996-03-31'::date))
>  Planning time: 0.185 ms
>  Execution time: 12440.819 ms
> (9 rows)
>
> After Patch:
> ---
> postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
> '1996-03-31'::date);
>
>QUERY PLAN
>
> --
> -
>  Finalize Aggregate  (cost=792751.16..792751.17 rows=1 width=32)
> (actual time=6660.157..6660.157 rows=1 loops=1)
>->  Gather  (cost=792750.94..792751.15 rows=2 width=32) (actual
> time=6659.378..6660.117 rows=3 loops=1)
>  Workers Planned: 2
>  Workers Launched: 2
>  ->  Partial Aggregate  (cost=791750.94..791750.95 rows=1
> width=32) (actual time=6655.941..6655.941 rows=1 loops=3)
>->  Parallel Bitmap Heap Scan on lineitem
> (cost=143372.40..785929.00 rows=2328774 width=8) (actual
> time=1980.797..6204.232 rows=1892947 loops=
> 3)
>  Recheck Cond: ((l_shipdate >= '1995-01-01'::date)
> AND (l_shipdate <= '1996-03-31'::date))
>  Rows Removed by Index Recheck: 6930269
>  Heap Blocks: exact=17090 lossy=176443
>  ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..141975.13 rows=5589057 width=0) (actual
> time=1933.454..1933.454 rows=5678841
> loops=1)
>Index Cond: ((l_shipdate >=
> '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date))
>  Planning time: 0.207 ms
>  Execution time: 6669.195 ms
> (13 rows)
>
>
> Summary:
> -> With patch overall execution is 2 time faster compared to head.
> -> Bitmap creation with patch is bit slower compared to head and thats
> because of DHT vs efficient hash table.
>

I think here the impact of slowness due to Bitmap Index Scan is not
much visible, as the time it takes as compare to overall time is less.
However, I think there is an advantage of using DHT as that will allow
us to build the hash table by multiple workers using parallel index
scan in future.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-19 Thread Amit Kapila
On Thu, Oct 20, 2016 at 4:00 AM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
>> >> This seems like a might subtle thing to backpatch. If we really want to
>> >> go there, ISTM that the relevant code should stew in an unreleased
>> >> branch for a while, before being backpatched.
>> >
>> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
>> > awhile.  If it survives six months or so then we could discuss it again.
>>
>> I agree with Tom.
>
> Okay, several months have passed with this in the development branch and
> now seems a good time to backpatch this all the way back to 9.4.
>

Are you talking about commit -
3e4b7d87988f0835f137f15f5c1a40598dd21f3d?  If yes, do we want to
retain this code in its current form under define UNUSED, is there any
advantage of same.   Another point is that won't this commit make
information in xl_btree_vacuum redundant, so shouldn't we remove it
usage during WAL writing as well?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-10-19 Thread Peter Geoghegan
On Wed, Oct 19, 2016 at 8:07 PM, Amit Kapila  wrote:
> I have also checked and found that you are right.  In SQL Server, they
> are using max degree of parallelism (MAXDOP) parameter which is I
> think is common for all the sql statements.

It's not just that one that does things this way, for what it's worth.

> I can understand that it can be confusing to users, so other option
> could be to provide separate parameters like parallel_workers_build
> and parallel_workers where first can be used for index build and
> second can be used for scan.  My personal opinion is to have one
> parameter, so that users have one less thing to learn about
> parallelism.

That's my first instinct too, but I don't really have an opinion yet.

I think that this is the kind of thing where it could make sense to
take a "wait and see" approach, and then make a firm decision
immediately prior to beta. This is what we did in deciding the name of
and fine details around what ultimately became the
max_parallel_workers_per_gather GUC (plus related GUCs and storage
parameters).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-10-19 Thread Amit Kapila
On Thu, Oct 20, 2016 at 7:39 AM, Peter Geoghegan  wrote:
> On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila  wrote:
>> Create Index  With (parallel_workers = 4);
>>
>> If above syntax looks sensible, then we might need to think what
>> should be used for parallel index build.  It seems to me that parallel
>> tuple sort patch [1] proposed by Peter G. is using above syntax for
>> getting the parallel workers input from user for parallel index
>> builds.
>
> Apparently you see a similar issue with other major database systems,
> where similar storage parameter things are kind of "overloaded" like
> this (they are used by both index creation, and by the optimizer in
> considering whether it should use a parallel index scan). That can be
> a kind of a gotcha for their users, but maybe it's still worth it.
>

I have also checked and found that you are right.  In SQL Server, they
are using max degree of parallelism (MAXDOP) parameter which is I
think is common for all the sql statements.

> In
> any case, the complaints I saw about that were from users who used
> parallel CREATE INDEX with the equivalent of my parallel_workers index
> storage parameter, and then unexpectedly found this also forced the
> use of parallel index scan. Not the other way around.
>

I can understand that it can be confusing to users, so other option
could be to provide separate parameters like parallel_workers_build
and parallel_workers where first can be used for index build and
second can be used for scan.  My personal opinion is to have one
parameter, so that users have one less thing to learn about
parallelism.

> Ideally, the parallel_workers storage parameter will rarely be
> necessary because the optimizer will generally do the right thing in
> all case.
>

Yeah, we can choose not to provide any parameter for parallel index
scans, but some users might want to have a parameter similar to
parallel table scans, so it could be handy for them to use.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disable autovacuum guc?

2016-10-19 Thread Michael Paquier
On Thu, Oct 20, 2016 at 11:55 AM, Craig Ringer  wrote:
> On 20 October 2016 at 09:27, Joshua D. Drake  wrote:
>> 1. It does not hurt anyone
>
> I disagree.
>
> It's important for a number of tests, and can be very useful when
> doing diagnostic work. It's also important for some kinds of data
> recovery efforts.

+1. From time to time, it is really helpful to disable it for
development purposes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query cancel seems to be broken in master since Oct 17

2016-10-19 Thread Noah Misch
On Tue, Oct 18, 2016 at 10:03:39AM -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 10/18/2016 04:13 PM, Tom Lane wrote:
> >> There's a smoking gun in the postmaster log:
> >> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for 
> >> process 18491
> 
> > Ok, I've reverted that commit for now. It clearly needs more thought, 
> > because of this, and the pademelon failure discussed on the other thread.
> 
> I think that was an overreaction.  The problem is pretty obvious after
> adding some instrumentation:

I endorse Heikki's revert.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disable autovacuum guc?

2016-10-19 Thread Craig Ringer
On 20 October 2016 at 09:27, Joshua D. Drake  wrote:
> Hello,
>
> After all these years, we are still regularly running into people who say,
> "performance was bad so we disabled autovacuum". I am not talking about once
> in a while, it is often. I would like us to consider removing the autovacuum
> option. Here are a few reasons:
>
> 1. It does not hurt anyone

I disagree.

It's important for a number of tests, and can be very useful when
doing diagnostic work. It's also important for some kinds of data
recovery efforts.

However, I do think the documentation needs to loudly warn users not
to turn it off, and that if they think vacuum is slowing their system
down they probably actually have to make it run MORE. The problem is
people not understanding autovac, and taking away the option to turn
it off won't help. They'll just find other ways to cripple it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disable autovacuum guc?

2016-10-19 Thread Josh Berkus
On 10/19/2016 06:27 PM, Joshua D. Drake wrote:
> Hello,
> 
> After all these years, we are still regularly running into people who
> say, "performance was bad so we disabled autovacuum". I am not talking
> about once in a while, it is often. I would like us to consider removing
> the autovacuum option. Here are a few reasons:
> 
> 1. It does not hurt anyone
> 2. It removes a foot gun
> 3. Autovacuum is *not* optional, we shouldn't let it be
> 4. People could still disable it at the table level for those tables
> that do fall into the small window of, no maintenance is o.k.
> 5. People would still have the ability to decrease the max_workers to 1
> (although I could argue about that too).

People who run data warehouses where all of the data comes in as batch
loads regularly disable autovacuum, and should do so.  For the DW/batch
load use-case, it makes far more sense to do batch loads interspersed
with ANALYZEs and VACUUMS of loaded/updated tables.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-10-19 Thread Peter Geoghegan
On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila  wrote:
> Create Index  With (parallel_workers = 4);
>
> If above syntax looks sensible, then we might need to think what
> should be used for parallel index build.  It seems to me that parallel
> tuple sort patch [1] proposed by Peter G. is using above syntax for
> getting the parallel workers input from user for parallel index
> builds.

Apparently you see a similar issue with other major database systems,
where similar storage parameter things are kind of "overloaded" like
this (they are used by both index creation, and by the optimizer in
considering whether it should use a parallel index scan). That can be
a kind of a gotcha for their users, but maybe it's still worth it. In
any case, the complaints I saw about that were from users who used
parallel CREATE INDEX with the equivalent of my parallel_workers index
storage parameter, and then unexpectedly found this also forced the
use of parallel index scan. Not the other way around.

Ideally, the parallel_workers storage parameter will rarely be
necessary because the optimizer will generally do the right thing in
all case.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Disable autovacuum guc?

2016-10-19 Thread Joshua D. Drake

Hello,

After all these years, we are still regularly running into people who 
say, "performance was bad so we disabled autovacuum". I am not talking 
about once in a while, it is often. I would like us to consider removing 
the autovacuum option. Here are a few reasons:


1. It does not hurt anyone
2. It removes a foot gun
3. Autovacuum is *not* optional, we shouldn't let it be
4. People could still disable it at the table level for those tables 
that do fall into the small window of, no maintenance is o.k.
5. People would still have the ability to decrease the max_workers to 1 
(although I could argue about that too).


Sincerely,

JD
--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 08:17:46PM -0400, Robert Haas wrote:
> On Wed, Oct 19, 2016 at 12:59 PM, Bruce Momjian  wrote:
> > Uh, vacuum_defer_cleanup_age sets an upper limit on how long, in terms
> > of xids, that a standby query can run before cancel, like
> > old_snapshot_threshold, no?
> 
> No, not really.  It affects the behavior of the master, not the standby.

But it controls when/if cancels happen on the standby.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-19 Thread Peter Geoghegan
On Sat, Oct 1, 2016 at 5:15 AM, Peter Geoghegan  wrote:
> On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik
>  wrote:
>> So the question is whether it is correct that ExecOnConflictUpdate tries to
>> access and update tuple without holding lock on the buffer?
>
> You're right -- this is a bug in Postgres.

(Thomas Munro and Kevin Grittner added to CC list.)

Attached is a revision of Thomas' patch to fix a related issue; it now
also fixes this no-buffer-lock-held issue. He posted his version of
this to a -general mailing list thread a week ago [1]. This was a
thread about spurious serialization failure with ON CONFLICT DO
NOTHING. I understand that Kevin is still not happy with the behavior
under SSI even with our fix, since serialization failures will still
occur more often than necessary (see other thread for details of what
I'm talking about).

I believe this patch to be a strict improvement on master, and I
suggest it be applied in advance of the deadline to get fixes in for
the next set of point releases. Note that this revision includes
Thomas' isolation test, which is completely unchanged. I still intend
to work with Kevin to do better than this under SSI, though that will
probably be for a future point release. The behavior proposed by my
fix here is the right thing for REPEATABLE READ isolation level, which
has nothing to do with Kevin's proposed more careful handling under
SSI. Besides, the buffer pin bug reported by Konstantin on this thread
really should be addressed ASAP.

[1] 
https://www.postgresql.org/message-id/CAEepm=3ra9ngdhocdbtb4iib7mwdavqybns3f47svkh1mk-...@mail.gmail.com
-- 
Peter Geoghegan
From 648de70f226831af04e1d31329118c325161da0b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Wed, 19 Oct 2016 13:59:20 -0700
Subject: [PATCH] Fix ON CONFLICT bugs at higher isolation levels.

When used at higher isolation levels, ON CONFLICT DO NOTHING could raise
spurious serialization failure errors.  This happened when duplicate
values were proposed for insertion within a single ON CONFLICT DO
NOTHING command. (Similar ON CONFLICT DO UPDATE cases typically raised
valid cardinality violation errors instead.)

Separately, though in the same code path, insufficient care was taken
when a visibility check was performed on some existing, conflicting
tuple.  At least a shared buffer lock is required on any buffer
containing a tuple whose visibility is checked through tqual.c routines,
which may set hint bits (a buffer pin is therefore insufficient).

To fix the first issue, check if tuple was created by current
transaction as a further condition of raising a serialization failure
in relevant test.  To fix the second issue, outsource visibility test to
preexisting heap_fetch() call.

Patch by Thomas Munroe and Peter Geoghegan.  First bug reported by Jason
Dusek.  Second bug reported by Konstantin Knizhnik.

Report: 
Report: <57ee93c8.8080...@postgrespro.ru>
Backpatch-through: 9.5
---
 src/backend/executor/nodeModifyTable.c | 41 ++
 .../expected/insert-conflict-do-nothing-2.out  | 29 +++
 src/test/isolation/isolation_schedule  |  1 +
 .../specs/insert-conflict-do-nothing-2.spec| 34 ++
 4 files changed, 82 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/insert-conflict-do-nothing-2.out
 create mode 100644 src/test/isolation/specs/insert-conflict-do-nothing-2.spec

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..3c4f458 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -179,7 +179,7 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 }
 
 /*
- * ExecCheckHeapTupleVisible -- verify heap tuple is visible
+ * ExecCheckTIDVisible -- fetch tuple, and verify its visibility
  *
  * It would not be consistent with guarantees of the higher isolation levels to
  * proceed with avoiding insertion (taking speculative insertion's alternative
@@ -187,23 +187,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
  * Check for the need to raise a serialization failure, and do so as necessary.
  */
 static void
-ExecCheckHeapTupleVisible(EState *estate,
-		  HeapTuple tuple,
-		  Buffer buffer)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!HeapTupleSatisfiesVisibility(tuple, estate->es_snapshot, buffer))
-		ereport(ERROR,
-(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-			 errmsg("could not serialize access due to concurrent update")));
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckHeapTupleVisible()
- */
-static void
 ExecCheckTIDVisible(EState *estate,
 	ResultRelInfo *relinfo,
 	ItemPointer tid)
@@ -212,14 +195,26 @@ ExecCheckTIDVisible(EState *estate,
 	Buffer		buffer;
 	HeapTupleData tuple;
 
-	/* Redundantly check isolation level */
 	if (!IsolationUsesXactSnapshot())
 		return;
 
 	tuple.t_self = *tid;
-	if (!heap_fetch(rel, S

Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Robert Haas
On Wed, Oct 19, 2016 at 12:59 PM, Bruce Momjian  wrote:
> Uh, vacuum_defer_cleanup_age sets an upper limit on how long, in terms
> of xids, that a standby query can run before cancel, like
> old_snapshot_threshold, no?

No, not really.  It affects the behavior of the master, not the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-19 Thread Robert Haas
On Wed, Oct 19, 2016 at 12:04 PM, Robert Haas  wrote:
> On Thu, Oct 13, 2016 at 5:53 AM, Victor Wagner  wrote:
>> On Thu, 13 Oct 2016 12:30:59 +0530
>> Mithun Cy  wrote:
>>> On Fri, Sep 30, 2016 at 2:14 PM, Victor Wagner 
>>> wrote:
>>> Okay but for me consistency is also important. Since we agree to
>>> disagree on some of the comments and others have not expressed any
>>> problem I am moving it to committer.
>>
>> Thank you for your efforts improving my patch
>
> I haven't deeply reviewed this patch, but on a quick read-through it
> certainly seems to need a lot of cleanup work.

Some more comments:

- I am pretty doubtful that the changes to connectOptions2() work as
intended.  I think that's only going to be called before actualhost
and actualport could possibly get set.  I don't think we keep
reinvoking this function for every new host we try.

- It's pretty clear that this isn't going to work if the host list
includes a mix of hostnames and UNIX socket addresses.  The code that
handles the UNIX socket address case is totally separate from the code
that handles the multiple-hostname case.

- This creates a sort of definitional problem for functions like
PQhost().  The documentation says of this and other related functions:
"These values are fixed for the life of the PGconn object."  But with
this patch, that would no longer be true.  So at least the
documentation needs an update.  Actually, though, I think the problem
is more fundamental than that.  If the user asks for PQhost(conn), do
they want the list of all hosts, or the one to which we're currently
connected?  It might be either, depending on what they intend to do
with the information.  If they mean to construct a new connection
string, they might well want them all; but something like psql's
\conninfo only shows the current one.  There's also the question of
what "the current one" means if we're not connected to any server at
all.  I'm not sure exactly how to solve these problems, but they need
some thought.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-19 Thread Peter van Hardenberg
On Wed, Oct 19, 2016 at 3:08 PM, Robert Haas  wrote:

> On Mon, Oct 26, 2015 at 4:25 PM, Peter Eisentraut  wrote:
> > On 10/14/15 6:41 AM, Victor Wagner wrote:
> All in all, I'm still feeling pretty good about trying to support the
> same syntax that our JDBC driver already does.  It's certainly not a
> perfect solution, but it is at least compatible with MySQL's JDBC
> driver and with MongoDB, and in a world where everybody has picked a
> different approach that's not too bad.  Hey, maybe if we use the same
> syntax as MongoDB they'll let us hang out with the cool kids...
>
>
They will never let us hang out with the cool kids. Don't worry though, the
cool kids are too busy figuring out why their cluster is out of consensus
to pay attention to much else.

Supporting different ports on different servers would be a much appreciated
feature (I can't remember if it was Kafka or Cassandra that didn't do this
and it was very annoying.)

Remember, as the connection string gets more complicated, psql supports the
Postgres URL format as a single command-line argument and we may want to
begin encouraging people to use that syntax instead.


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Alvaro Herrera
Alexander Korotkov wrote:

Hi,

> Thank you for your proposal.  One question about vacuum excites me most.

> Imagine another situation: PK column was not updated, but indirect indexed
> column was updated.
> Thus, for single heap tuple we would have single PK tuple and two indirect
> index tuples (correct me if I'm wrong).
> How are we going to delete old indirect index tuple?

Yeah, this is a tough question all right.

Another thing to note is that the initial heap tuple might be removed
via HOT page pruning, so it will not be there during vacuuming either;
there would only be a redirect line pointer.  So I don't think we can
rely on that to figure out an exact cleanup of the indirect index.

One possibility is to forget about vacuuming the index altogether.
Instead we can rely solely on the killtuple interface, that is, have
obsolete tuples be removed during index scanning.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-19 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
> >> This seems like a might subtle thing to backpatch. If we really want to
> >> go there, ISTM that the relevant code should stew in an unreleased
> >> branch for a while, before being backpatched.
> >
> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> > awhile.  If it survives six months or so then we could discuss it again.
> 
> I agree with Tom.

Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.

Any objections?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Alexander Korotkov
On Wed, Oct 19, 2016 at 12:53 PM, Simon Riggs  wrote:

> On 18 October 2016 at 23:46, Alexander Korotkov
>  wrote:
>
> > Then vacuum removes (0;1) from heap, reference to (0;1) from tbl_pk_idx.
> > But how will it remove (1,1) tuple from tbl_val_indirect_idx?  Thus,
> before
> > vacuuming tbl_val_indirect_idx we should know not only values of id which
> > are being removed, but actually (id, val) pairs which are being removed.
> > Should we collect those paris while scanning heap?  But we should also
> take
> > into account that multiple heap tuples might have same (id, val) pair
> values
> > (assuming there could be other columns being updated).  Therefore, we
> should
> > take into account when last pair of particular (id, val) pair value was
> > deleted from heap.  That would be very huge change to vacuum, may be even
> > writing way more complex vacuum algorithm from scratch.  Probably, you
> see
> > the better solution of this problem.
>
> The best way to sum up the problem is to consider how we deal with
> repeated updates to a single tuple that flip the value from A to B
> then back to A then to B then A etc.. Any value in the index can point
> to multiple versions of the same tuple and multiple index values can
> point to the same tuple (PK value). This problem behaviour was already
> known to me from Claudio's earlier analysis of WARM (thanks Claudio).
>

Thank you for pointing.  I didn't follow details of WARM discussion.

Yes, VACUUMing that is likely to be a complex issue, as you say. At
> the moment I don't have a plan for that, but am not worried.
>

AFAICS, the main goal of indirect indexes is to reduce their maintenance
cost.  Indirect indexes are much easier to maintain during UPDATEs and this
is good.  But it's harder to VACUUM them.  So, we need to figure out how
much maintenance cost would be reduced for indirect indexes.  This is why I
think digging into VACUUM problems is justified for now.

Indirect indexes produce less index entries in general than current,
> so the problem is by-design much smaller than current situation.
> Indirect indexes can support killed-tuple interface, so scanning the
> index by users will result in natural index maintenance, further
> reducing the problem.
>

That makes sense.  But that is not necessary true for any workload.  For
instance, keys, which are frequently updated, are not necessary same that
keys, which are frequently selected.  Thus, there is still some risk of
bloat.

So there will be a much reduced need for bulk maintenance. Bulk
> maintainence of the index, when needed, can be performed by scanning
> the whole table via the index, after the PK index has been vacuumed.
>

That's possible, but such vacuum is going to be very IO consuming when heap
doesn't fit cache.  It's even possible that rebuilding of index would be
cheaper.


> That can be optimized using an index-only scan of the PK to avoid
> touching the heap, which should be effective since the VM has been so
> recently refreshed.


But we can't get which of indirect index keys still persist in heap by
using index only scan by PK, because PK doesn't contain those keys.  So, we
still need to scan heap for it.


> For correctness it would require the index blocks
> to be locked against write while checking for removal, so bulk
> collection of values to optimize the underlying index doesn't seem
> useful. The index scan could also be further optimized by introducing
> a visibility map for the index, which is something that would also
> optimize normal index VACUUMs as well, but that is a later project and
> not something for 10.x
>

Visibility map for indexes sounds interesting.  And that means including
visibility information into index.  It's important property of current MVCC
implementation of PostgreSQL, that while updating heap tuple, we don't have
to find location of old index tuples referring it, we only have to insert
new index tuples.  Finding location of old index tuples, even for barely
updating index visibility map, would be a substantial change.

At this stage, the discussion should be 1) can it work? 2) do we want
> it?


I think that we definitely need indirect indexes and they might work.  The
question is design.  PostgreSQL MVCC is designed so that index contain no
visibility information.  So, currently we're discussing approach which
implies expanding of this design to indirect indexes.  The downsides of
this approach are (at least): 1) we should always recheck results obtained
from index, 2) VACUUM becomes very difficult.

There is also alternative approach: include visibility information into
indirect index.  In this approach we should include fields required for
visibility (xmin, xmax, etc) into indirect index tuple and keep them up to
date.  Then while updating indexed column we would have to update old index
tuple as well.  This is the downside.  But we would be able to scan without
recheck and VACUUM will be much more easier.  We would be even able to
VACUUM indire

Re: [HACKERS] Indirect indexes

2016-10-19 Thread Claudio Freire
On Wed, Oct 19, 2016 at 2:04 PM, Bruce Momjian  wrote:
>> What we should ask is what is the difference between indirect indexes
>> and WARM and to what extent they overlap.
>>
>> My current understanding is that WARM won't help you if you update
>> parts of a JSON document and/or use GIN indexes, but is effective
>> without needing to add a new index type and will be faster for
>> retrieval than indirect indexes.
>>
>> So everybody please chirp in with benefits or comparisons.
>
> I am not sure we have even explored all the limits of WARM with btree
> indexes --- I haven't heard anyone talk about non-btree indexes yet.

AFAIK there's no fundamental reason why it wouldn't work for other
index ams, but it does require quite a bit of legwork to get
everything working there.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-19 Thread Robert Haas
On Mon, Oct 26, 2015 at 4:25 PM, Peter Eisentraut  wrote:
> On 10/14/15 6:41 AM, Victor Wagner wrote:
>> 1. It is allowed to specify several hosts in the connect string, either
>> in URL-style (separated by comma) or in param=value form (several host
>> parameters).
>
> I'm not fond of having URLs that are not valid URLs according to the
> applicable standards.  Because then they can't be parsed or composed by
> standard libraries.

I did a little bit more research on this topic and found out a few
things that are interesting, at least to me.  First, our documentation
reference RFC3986.  According to RFC3986:

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part   = "//" authority path-abempty
  / path-absolute
  / path-rootless
  / path-empty

authority   = [ userinfo "@" ] host [ ":" port ]

host= IP-literal / IPv4address / reg-name

reg-name= *( unreserved / pct-encoded / sub-delims )

sub-delims include comma but not colon, so I think that
postgresql://host1,host2,host3/ is a perfectly good URL, and so is
postgresql://host1,host2,host3:/ but
postgresql://host1:1234,host2:3456/ is not a valid URL because the :
after host1 terminates the "host" portion of the URL.  The port can't
contain anything but digits, so 1234 has to be the port, but then
there's nothing to do with the portion between the comma and the
following slash, so it is indeed an invalid URI as far as I can tell.

However, PostgreSQL's JDBC driver isn't alone in supporting something
like this.  The MySQL JDBC driver does the same thing:

http://lists.mysql.com/cluster/249
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

MongoDB refers to their connection string as a URI, but it uses
exactly this syntax:

https://docs.mongodb.com/manual/reference/connection-string/

Couchbase's syntax is also quite similar, though it's not clear that
they allow port numbers:

http://developer.couchbase.com/documentation/server/4.1/developer-guide/connecting.html

Of course, since this is a very common need and it's not obvious how
to satisfy it within the confines of the URI specification, people
have developed a variety of other answers, such as (a) multiple URIs
separated by commas, which is a terrible idea because comma can occur
*within* URIs, (b) separating multiple host names with a double-dash,
(c) including a parameter in the "query" portion of the URI to specify
alternate host names, (d) defining a new failover: URI scheme that
acts as a container for multiple connection URIs of whatever form is
normally supported, and in the case of Oracle (e) creating some
frightening monstrosity of proprietary syntax that I don't (care to)
understand.

All in all, I'm still feeling pretty good about trying to support the
same syntax that our JDBC driver already does.  It's certainly not a
perfect solution, but it is at least compatible with MySQL's JDBC
driver and with MongoDB, and in a world where everybody has picked a
different approach that's not too bad.  Hey, maybe if we use the same
syntax as MongoDB they'll let us hang out with the cool kids...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Merlin Moncure
On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian  wrote:
> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>> > to be to the system catalogs with at least two critical tables dropped
>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>> > pointing at the wrong thing.  Couple more datapoints here.
>> >
>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>> > *) Another database on the same cluster was not impacted.  However
>> > it's more olap style and may not have been written to during the
>> > outage
>> >
>> > Now, this infrastructure running this system is running maybe 100ish
>> > postgres clusters and maybe 1000ish sql server instances with
>> > approximately zero unexplained data corruption issues in the 5 years
>> > I've been here.  Having said that, this definitely smells and feels
>> > like something on the infrastructure side.  I'll follow up if I have
>> > any useful info.
>>
>> After a thorough investigation I now have credible evidence the source
>> of the damage did not originate from the database itself.
>> Specifically, this database is mounted on the same volume as the
>> operating system (I know, I know) and something non database driven
>> sucked up disk space very rapidly and exhausted the volume -- fast
>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>
> However, disk space exhaustion should not lead to corruption unless the
> underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way.   In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication).  Low disk space also does not explain the bizarre
outage I had last friday.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Josh Berkus
On 10/19/2016 09:59 AM, Bruce Momjian wrote:
> On Wed, Oct 19, 2016 at 09:00:06AM -0400, Robert Haas wrote:
>> On Wed, Oct 19, 2016 at 8:47 AM, Bruce Momjian  wrote:
>>> On Wed, Oct 19, 2016 at 08:33:20AM -0400, Robert Haas wrote:

 Actually, I think vacuum_defer_cleanup_age is, and always has been, an
 ugly hack.  But for some people it may be the ugly hack that is
 letting them continue to use PostgreSQL.
>>>
>>> I see vacuum_defer_cleanup_age as old_snapshot_threshold for standby
>>> servers --- it cancels transactions rather than delaying cleanup.
>>
>> I think it's the opposite, isn't it?  vacuum_defer_cleanup_age
>> prevents cancellations.
> 
> Uh, vacuum_defer_cleanup_age sets an upper limit on how long, in terms
> of xids, that a standby query can run before cancel, like
> old_snapshot_threshold, no?  After that, we can cancel standby queries. 
> I see hot_standby_feedback as our current behavior on the master where
> we never cancel standby queries.
> 
> To me, hot_standby_feedback extends no-cleanup-no-cancel from the
> standby to the master, while vacuum_defer_cleanup_age behaves like
> old_snapshot_threshold in that it causes cancel for long-running
> queries.

See Andres' response on this thread.  He's already covered why the
setting is still useful, but why we might want to remove it anyway.


-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gather Merge

2016-10-19 Thread Peter Geoghegan
On Tue, Oct 4, 2016 at 11:05 PM, Rushabh Lathia
 wrote:
> Query 4:  With GM 7901.480 -> Without GM 9064.776
> Query 5:  With GM 53452.126 -> Without GM 55059.511
> Query 9:  With GM 52613.132 -> Without GM 98206.793
> Query 15: With GM 68051.058 -> Without GM 68918.378
> Query 17: With GM 129236.075 -> Without GM 160451.094
> Query 20: With GM 259144.232 -> Without GM 306256.322
> Query 21: With GM 153483.497 -> Without GM 168169.916
>
> Here from the results we can see that query 9, 17 and 20 are the one which
> show good performance benefit with the Gather Merge.

Were all other TPC-H queries unaffected? IOW, did they have the same
plan as before with your patch applied? Did you see any regressions?

I assume that this patch has each worker use work_mem for its own
sort, as with hash joins today. One concern with that model when
testing is that you could end up with a bunch of internal sorts for
cases with a GM node, where you get one big external sort for cases
without one. Did you take that into consideration?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-19 Thread Peter Geoghegan
On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas  wrote:
> Gather Merge can't emit a tuple unless it has buffered at least one
> tuple from every producer; otherwise, the next tuple it receives from
> one of those producers might proceed whichever tuple it chooses to
> emit.  However, it doesn't need to wait until all of the workers are
> completely done.  The leader only needs to be at least slightly ahead
> of the slowest worker.  I'm not sure how that compares to Peter's
> approach.

I don't think that eager merging will prove all that effective,
however it's implemented. I see a very I/O bound system when parallel
CREATE INDEX merges serially. There is no obvious reason why you'd
have a straggler worker process with CREATE INDEX, really.

> What I'm worried about is that we're implementing two separate systems
> to do the same thing, and that the parallel sort approach is actually
> a lot less general.  I think it's possible to imagine a Parallel Sort
> implementation which does things Gather Merge can't.  If all of the
> workers collaborate to sort all of the data rather than each worker
> sorting its own data, then you've got something which Gather Merge
> can't match.  But this is not that.

It's not that yet, certainly. I think I've sketched a path forward for
making partitioning a part of logtape.c that is promising. The sharing
of ranges within tapes and so on will probably have a significant
amount in common with what I've come up with.

I don't think that any executor infrastructure is a particularly good
model when *batch output* is needed -- the tuple queue mechanism will
be a significant bottleneck, particularly because it does not
integrate read-ahead, etc. The best case that I saw advertised for
Gather Merge was TPC-H query 9 [1]. That doesn't look like a good
proxy for how Gather Merge adapted to parallel CREATE INDEX would do,
since it benefits from the GroupAggregate merge having many equal
values, possibly with a clustering in the original tables that can
naturally be exploited (no TID tiebreaker needed, since IndexTuples
are not being merged). Also, it looks like Gather Merge may do that
well by enabling things, rather than parallelizing the sort
effectively per se. Besides, the query 9 case is significantly less
scalable than good cases for this parallel CREATE INDEX patch have
already been shown to be.

I think I've been pretty modest about what this parallel CREATE INDEX
patch gets us from the beginning. It is a generalization of
tuplesort.c to work in parallel; we need a lot more for that to make
things like GroupAggregate as scalable as possible, and I don't
pretend that this helps much with that. There are actually more
changes to nbtsort.c to coordinate all of this than there are to
tuplesort.c in the latest version, so I think that this simpler
approach for parallel CREATE INDEX and CLUSTER is worthwhile.

The bottom line is that it's inherently difficult for me to refute the
idea that Gather Merge could do just as well as what I have here,
because proving that involves adding a significant amount of new
infrastructure (e.g., to teach the executor about IndexTuples). I
think that the argument for this basic approach is sound (it appears
to offer comparable scalability to the parallel CREATE INDEX
implementations of other systems), but it's simply impractical for me
to offer much reassurance beyond that.

[1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-19 Thread Thom Brown
On 13 October 2016 at 10:53, Victor Wagner  wrote:
> On Thu, 13 Oct 2016 12:30:59 +0530
> Mithun Cy  wrote:
>
>> On Fri, Sep 30, 2016 at 2:14 PM, Victor Wagner 
>> wrote:
>
>> Okay but for me consistency is also important. Since we agree to
>> disagree on some of the comments and others have not expressed any
>> problem I am moving it to committer.
>
>
> Thank you for your efforts improving my patch

I've found various spelling and grammatical issues in the docs
section, which I've corrected.  I've attached a revised patch with
these changes.

One thing I'm wondering is whether we should be using the term
"master", as that's usually paired with "slave", whereas, nowadays,
there's a tendency to refer to them as "primary" and "standby".  I
know we use "master" in some other places in the documentation, but it
seems inconsistent to have "master" as a parameter value, but then
having "primary" used in some other configuration parameters.

I'd also avoid referring to "the library", and just describe what
happens without making reference to what's making it happen.

Regards

Thom
1,6d0
< commit 52270559f28ab673e14124b813b7cdfb772cd1bd
< Author: mithun 
< Date:   Thu Oct 13 12:07:15 2016 +0530
< 
< libpq10
< 
8c2
< index 4e34f00..fd2fa88 100644
---
> index 4e34f00..adbad75 100644
44c38
< +   There can be serveral host specifications, optionally accompanied
---
> +   There can be several host specifications, optionally accompanied
56,59c50,53
< +   the connect string. In this case these hosts would be considered
< +   alternate entries into same database and if connect to first one
< +   fails, library would try to connect second etc. This can be used
< +   for high availability cluster or for load balancing. See
---
> +   the connection string. In this case these hosts would be considered
> +   alternate entries into same database and if connection to first one
> +   fails, the second would be attemped, and so on. This can be used
> +   for high availability clusters or for load balancing. See the
63,66c57,60
< +   Network host name can be accompanied with port number, separated by
< +   colon. If so, this port number is used only when connected to
< +   this host. If there is no port number, port specified in the
< +parameter would be used.
---
> +   The network host name can be accompanied by a port number, separated 
> by
> +   colon. This port number is used only when connected to
> +   this host. If there is no port number, the port specified in the
> +parameter would be used instead.
71c65
< @@ -943,7 +964,43 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
---
> @@ -943,7 +964,42 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
79c73
< +  Specifies how to choose host from list of alternate hosts,
---
> +  Specifies how to choose the host from the list of alternate hosts,
83,86c77,79
< +  If value of this argument is sequential (the
< +  default) library connects to the hosts in order they specified,
< +  and tries to connect second one only if connection to the first
< +  fails.
---
> +  If the value of this argument is sequential (the
> +  default) connections to the hosts will be attempted in the order in
> +  which they are specified.
89,93c82,86
< +  If value is random host to connect is randomly
< +  picked from the list. It allows to balance load between several
< +  cluster nodes. However, currently PostgreSQL doesn't support
< +  multimaster clusters. So, without use of third-party products,
< +  only read-only connections can take advantage from the
---
> +  If the value is random, the host to connect to
> +  will be randomly picked from the list. It allows load balacing between
> +  several cluster nodes. However, PostgreSQL doesn't currently support
> +  multimaster clusters. So, without the use of third-party products,
> +  only read-only connections can take advantage from
102,104c95,97
< +  If this parameter is master upon successful 
connection
< +  library checks if host is in recovery state, and if it is so,
< +  tries next host in the connect string. If this parameter is
---
> +  If this parameter is master, upon successful 
> connection
> +  the host is checked to determine whether it is in a recovery state.  
> If it
> +  is, it then tries next host in the connection string. If this 
> parameter is
115c108
< @@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
---
> @@ -985,7 +1041,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
123c116
< @@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
---
> @@ -996,7 +1051,28 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
132c125
< + Maximum time to cyclically retry all the hosts in connect string.
---
> + Maximum time to cyclically retry all the hosts in the connection string.
138,141c131,134
< + take some time to promote one of

Re: [HACKERS] Portable check for unportable macro usage

2016-10-19 Thread Tom Lane
Andres Freund  writes:
> Hm. I'd be kind of inclined to instead do something akin to

> #include 
> #define system_isupper(c) isupper(c)
> #undef isupper

Note that that doesn't do what you are probably thinking it does.

What is actually happening there, I believe, is that you're forcing
a fallback to the plain-function definition of isupper().  Ralph's
proposal accomplishes the same thing less messily by parenthesizing
the new macro's reference to isupper.  But in either case, we're
disabling any possible macro optimization in , which makes
me not want to say it's something we'd enable unconditionally.

> #define isupper(c) (AssertVariableIsOfTypeMacro(c, unsigned char), isupper(c))
> =>
> /home/andres/src/postgresql/src/include/c.h:745:7: error: static assertion 
> failed: "c does not have type unsigned char"

Not sure we really want that; it breaks the standard-intended usage
of applying these functions to the result of getc().  It might be
all right for the core code, but I could see third-party authors
getting pretty upset with us for unilaterally imposing non-POSIX
semantics on these functions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] incorrect libpq comment

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 01:16:28PM -0400, Robert Haas wrote:
> Bruce's commit 5d305d86bd917723f09ab4f15c075d90586a210a back in April
> of 2014 includes this change:
> 
>  /* See PQconnectPoll() for how we use 'int' and not 'pgsocket'. */
> -int sock;   /* Unix FD for socket, -1 if not connected */
> +pgsocketsock;   /* FD for socket, PGINVALID_SOCKET if
> unconnected */
> 
> I suppose Bruce must have overlooked the fact that the comment on the
> previous line is now false.  I think we should remove it, because it
> makes no sense to say how we are using 'int' rather than 'pgsocket'
> when in fact we are not using 'int' any more.

Agreed.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 01:25:33PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Wed, Oct 19, 2016 at 09:44:05AM -0700, David G. Johnston wrote:
> >> ​I think the theory of having all system tables except LO on SSD storage, 
> >> and
> >> having LO on a less performant device, makes sense.
> 
> > OK, so is this a TODO item?
> 
> According to
> https://www.postgresql.org/message-id/200407110311.i6B3BBW24899%40candle.pha.pa.us
> it already is ;-)
> 
> Personally though I'd narrow the scope to just consider pg_largeobject
> (and maybe pg_largeobject_metadata).  I see no use-case for moving other
> catalogs out of the DB's default tablespace, and doing so would create a
> large space of poorly-tested opportunities for failure.

Ah, I see it now:

Allow toast tables to be moved to a different tablespace

moving toast table to its own tablespace
patch : Allow toast tables to be moved to a different tablespace 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Portable check for unportable macro usage

2016-10-19 Thread Andres Freund
On 2016-10-19 12:14:50 -0400, Tom Lane wrote:
> A portability hazard that we fight constantly is that the 
> macros such as isalpha() can't safely be called on "char" values.
> Per POSIX, you have to make sure the input is cast to "unsigned char".
> That's far too easy to forget, but on most machines you won't get any
> warning about it.

> We have a couple of buildfarm machines that will generate "char value
> used as array subscript" warnings for this, because their implementations
> of these macros are indeed just array lookups.  But I think gaur is
> the only one that's still active, and it's not going to last forever.
> 
> Ralph Corderoy, one of the nmh hackers, was looking at this problem
> and came up with this cute solution [1]:
> 
> #ifndef NDEBUG
> #if EOF != -1
> #error "Please report this to nmh's authors."
> #endif
> 
> extern int ctype_identity[257]; /* [n] = n-1 */
> #define isupper(c) ((isupper)((ctype_identity + 1)[c]))
> ...
> #endif

Hm. I'd be kind of inclined to instead do something akin to

#include 
#define system_isupper(c) isupper(c)
#undef isupper
#define isupper(c) (AssertVariableIsOfTypeMacro(c, unsigned char), isupper(c))
=>
/home/andres/src/postgresql/src/include/c.h:745:7: error: static assertion 
failed: "c does not have type unsigned char"

that would probably result in better error messages,and we can leave it
enabled regardless of debug mode.

That requires one usage adoption in wparser_def.c's p_iswhat()

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Tom Lane
Bruce Momjian  writes:
> On Wed, Oct 19, 2016 at 09:44:05AM -0700, David G. Johnston wrote:
>> ​I think the theory of having all system tables except LO on SSD storage, and
>> having LO on a less performant device, makes sense.

> OK, so is this a TODO item?

According to
https://www.postgresql.org/message-id/200407110311.i6B3BBW24899%40candle.pha.pa.us
it already is ;-)

Personally though I'd narrow the scope to just consider pg_largeobject
(and maybe pg_largeobject_metadata).  I see no use-case for moving other
catalogs out of the DB's default tablespace, and doing so would create a
large space of poorly-tested opportunities for failure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] incorrect libpq comment

2016-10-19 Thread Robert Haas
Bruce's commit 5d305d86bd917723f09ab4f15c075d90586a210a back in April
of 2014 includes this change:

 /* See PQconnectPoll() for how we use 'int' and not 'pgsocket'. */
-int sock;   /* Unix FD for socket, -1 if not connected */
+pgsocketsock;   /* FD for socket, PGINVALID_SOCKET if
unconnected */

I suppose Bruce must have overlooked the fact that the comment on the
previous line is now false.  I think we should remove it, because it
makes no sense to say how we are using 'int' rather than 'pgsocket'
when in fact we are not using 'int' any more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 09:44:05AM -0700, David G. Johnston wrote:
> On Wed, Oct 19, 2016 at 9:29 AM, Bruce Momjian  wrote:
> 
> On Tue, Oct 18, 2016 at 04:51:54PM +0200, Andreas Joseph Krogh wrote:
> >     > 2. Being able to move pg_largeobject to a different tablespace
> >     >    *without* turning on system_table_mods. This is important for
> >     >    people storing LOTS of large-objects on separate
> >     >    disks (non-SSD) and hence in a different tablespace.
>
> I think an open question is why you would not want to move the other
> system tables at the same time you move pg_largeobject.
> 
> 
> 
> ​I think the theory of having all system tables except LO on SSD storage, and
> having LO on a less performant device, makes sense.

OK, so is this a TODO item?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 11:47:25AM -0400, Robert Haas wrote:
> It seems to me that the only way to really make this feature robust is
> to have a background worker as part of the equation.  The background
> worker launches at startup and looks around for local state that tells
> it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
> operations pending that weren't completed during the last server
> lifetime, whether because of a local crash or remote unavailability.

Yes, you really need both commit on foreign servers before acknowledging
commit to the client, and a background process to clean things up from
an abandoned server.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 06:58:05PM +0200, Simon Riggs wrote:
> >> I agree. Also, I think the recheck mechanism will have to be something like
> >> what I wrote for WARM i.e. only checking for index quals won't be enough 
> >> and we
> >> would actually need to verify that the heap tuple satisfies the key in the
> >> indirect index.
> >
> > I personally would like to see how far we get with WARM before adding
> > this feature that requires a DBA to evaluate and enable it.
> 
> Assuming WARM is accepted, that *might* be fine.

First, I love WARM because everyone gets the benefits by default.  For
example, a feature that improves performance by 10% but is only used by
1% of users has a usefulness of 0.1% --- at least that is how I think of
it.

> What we should ask is what is the difference between indirect indexes
> and WARM and to what extent they overlap.
> 
> My current understanding is that WARM won't help you if you update
> parts of a JSON document and/or use GIN indexes, but is effective
> without needing to add a new index type and will be faster for
> retrieval than indirect indexes.
> 
> So everybody please chirp in with benefits or comparisons.

I am not sure we have even explored all the limits of WARM with btree
indexes --- I haven't heard anyone talk about non-btree indexes yet.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 09:00:06AM -0400, Robert Haas wrote:
> On Wed, Oct 19, 2016 at 8:47 AM, Bruce Momjian  wrote:
> > On Wed, Oct 19, 2016 at 08:33:20AM -0400, Robert Haas wrote:
> >> On Tue, Oct 18, 2016 at 4:33 PM, Josh Berkus  wrote:
> >> > Based on that argument, we would never be able to remove any
> >> > configuration parameter ever.
> >>
> >> Well... no.  Based on that argument, we should only remove
> >> configuration parameters if we're fairly certain that they are not
> >> useful any more, which will be rare, but is not never.  I agree that
> >> *if* vacuum_defer_cleanup_age is no longer useful, it should be
> >> removed.  I'm just not convinced that it's truly obsolete, and you
> >> haven't really offered much of an argument for that proposition.  It
> >> does something sufficiently different from hot_standby_feedback that
> >> I'm not sure it's accurate to say that one can substitute for the
> >> other, and indeed, I see Andres has already suggested some scenarios
> >> where it could still be useful.
> >>
> >> Actually, I think vacuum_defer_cleanup_age is, and always has been, an
> >> ugly hack.  But for some people it may be the ugly hack that is
> >> letting them continue to use PostgreSQL.
> >
> > I see vacuum_defer_cleanup_age as old_snapshot_threshold for standby
> > servers --- it cancels transactions rather than delaying cleanup.
> 
> I think it's the opposite, isn't it?  vacuum_defer_cleanup_age
> prevents cancellations.

Uh, vacuum_defer_cleanup_age sets an upper limit on how long, in terms
of xids, that a standby query can run before cancel, like
old_snapshot_threshold, no?  After that, we can cancel standby queries. 
I see hot_standby_feedback as our current behavior on the master where
we never cancel standby queries.

To me, hot_standby_feedback extends no-cleanup-no-cancel from the
standby to the master, while vacuum_defer_cleanup_age behaves like
old_snapshot_threshold in that it causes cancel for long-running
queries.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Simon Riggs
On 19 October 2016 at 18:40, Bruce Momjian  wrote:
> On Wed, Oct 19, 2016 at 07:23:28PM +0530, Pavan Deolasee wrote:
>> AFAICS, even without considering VACUUM, indirect indexes would be always
>> used with recheck.
>> As long as they don't contain visibility information.  When indirect
>> indexed column was updated, indirect index would refer same PK with
>> different index keys.
>> There is no direct link between indirect index tuple and heap tuple, only
>> logical link using PK.  Thus, you would anyway have to recheck.
>>
>>
>>
>> I agree. Also, I think the recheck mechanism will have to be something like
>> what I wrote for WARM i.e. only checking for index quals won't be enough and 
>> we
>> would actually need to verify that the heap tuple satisfies the key in the
>> indirect index.
>
> I personally would like to see how far we get with WARM before adding
> this feature that requires a DBA to evaluate and enable it.

Assuming WARM is accepted, that *might* be fine.

What we should ask is what is the difference between indirect indexes
and WARM and to what extent they overlap.

My current understanding is that WARM won't help you if you update
parts of a JSON document and/or use GIN indexes, but is effective
without needing to add a new index type and will be faster for
retrieval than indirect indexes.

So everybody please chirp in with benefits or comparisons.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Andreas Joseph Krogh
På onsdag 19. oktober 2016 kl. 18:44:24, skrev Tom Lane mailto:t...@sss.pgh.pa.us>>:
Andreas Joseph Krogh  writes:
 > På onsdag 19. oktober 2016 kl. 18:29:31, skrev Bruce Momjian 
  I think an open question is why you would not want to move the other
 >  system tables at the same time you move pg_largeobject.

 > Are you saying that if I move all system-tables to the tablespace I moved
 > pg_largeobject to it'll work? If so, is there a convenient way to move all
 > system-tables to a tablespace?

 Not sure about moving them after the fact, but you could create the
 database with its default tablespace being the one you want pg_largeobject
 in.

 I think though that there's a fairly clear counterexample to Bruce's
 question: if you're worried about moving pg_largeobject at all, you
 probably are trying to put it on a relatively large and slow storage
 device.  You don't necessarily want all the system catalogs there.

 regards, tom lane
 
Thanks for the tip. How do I conveniently move all the 
tables/indexes/sequences etc. (basically everything in schema=public) except 
the system-tables to another tablespace?
I don't see any "ALTER SCHEMA public SET TABLESPACE myspace" command...
 
This is great when dealing with new databases, but do you have any hints 
helping me out getting pg_upgrade working now that I already have moved 
pg_largeobject (see my answer to Bruce)?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
 


 


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Claudio Freire
On Wed, Oct 19, 2016 at 10:21 AM, Simon Riggs  wrote:
>> Simon objected that putting the PK
>> into the index tuple would disable HOT, but I don't think that's a
>> valid objection.
>
> Just to be clear, that's not what I objected to. Claudio appeared to
> be suggesting that an indirect index is the same thing as an index
> with PK tacked onto the end, which I re-confirm is not the case since
> doing that would not provide the primary objective of indirect
> indexes.

No, I was suggesting using the storage format of those indexes.
Perhaps I wasn't clear.

CREATE INDEX could be implemented entirely as the rewrite I mention, I
believe. But everything else can't, as you say.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] packing/alignment annotation for ItemPointerData redux

2016-10-19 Thread Tom Lane
Greg Stark  writes:
> Sorry -- with the obvious error fixed:

You didn't show -E output from this version, but the other one had

>  __attribute__((packed))
>  __attribute__((aligned(2)))

so it appears that clang 4.0 does accept these attributes but then
produces the warning anyway.  I suggest filing this as a bug in clang 4.0,
and marking it as a regression from older versions which did not produce
such a warning.

If you get pushback claiming it's intentional, I'd be inclined to hack
our macro definitions so that we don't believe clang understands
attribute(aligned), because it evidently doesn't.  But let's see
their response first.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Andreas Joseph Krogh
På onsdag 19. oktober 2016 kl. 18:42:11, skrev Bruce Momjian mailto:br...@momjian.us>>:
On Wed, Oct 19, 2016 at 06:33:55PM +0200, Andreas Joseph Krogh wrote:
 > På onsdag 19. oktober 2016 kl. 18:29:31, skrev Bruce Momjian 
 >:
 >
 >     On Tue, Oct 18, 2016 at 04:51:54PM +0200, Andreas Joseph Krogh wrote:
 >     >     > 2. Being able to move pg_largeobject to a different tablespace
 >     >     >    *without* turning on system_table_mods. This is important for
 >     >     >    people storing LOTS of large-objects on separate
 >     >     >    disks (non-SSD) and hence in a different tablespace.
 >     >     > Anyone willing to discuss this?
 >     >     > 
 >     >     This was proposed a few years ago but no one cared to draft a 
patch.
 >     >
 >     >  
 >     > So that why I'm re-raising the issue:-)
 >     > Having "everything in the database" adds lots of benefits, 
conceptually
 >     > (follows tx-semantics, consistent backups etc.), however it's 
currently
 >     not so
 >     > easy in practice.
 >
 >     Yeah, rereading that old thread was interesting, and unfortunate that no
 >     one mentioned the system catalog change would break pg_upgrade, though
 >     pg_upgrade was not popular at the time that thread was started.
 >
 >     I think an open question is why you would not want to move the other
 >     system tables at the same time you move pg_largeobject.
 >
 >  
 > The thing is that I don't understand what the problem really is. I have no
 > problem moving the other system-tables as well if that fixes the problem.
 > I tried moving pg_largeobject back to the same tablespace as the database 
but
 > that too gave the error.
 >  
 > Are you saying that if I move all system-tables to the tablespace I moved
 > pg_largeobject to it'll work? If so, is there a convenient way to move all
 > system-tables to a tablespace?

 Sure, use:

       ALTER DATABASE name SET TABLESPACE new_tablespace

       ...

       The fourth form changes the default tablespace of the database. Only
       the database owner or a superuser can do this; you must also have
       create privilege for the new tablespace. This command physically
       moves any tables or indexes in the database's old default tablespace
       to the new tablespace. The new default tablespace must be empty for
       this database, and no one can be connected to the database. Tables
       and indexes in non-default tablespaces are unaffected.
 
The thing is; I've created the database with explicit tablespace, like this:
createdb --tablespace=mydb -O andreak mydb
 
Then I've moved pg_largeobject:
 
alter table pg_largeobject set tablespace mydb_lo
 
What options do I now have to make pg_upgrade work? I have 6TB db which I'd 
like to upgrade to 9.6 using pg_upgrade so any help accomplishing that is 
greatly appreciated:-)
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
 


 


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Tom Lane
Andreas Joseph Krogh  writes:
> På onsdag 19. oktober 2016 kl. 18:29:31, skrev Bruce Momjian 
>   I think an open question is why you would not want to move the other
>  system tables at the same time you move pg_largeobject.

> Are you saying that if I move all system-tables to the tablespace I moved 
> pg_largeobject to it'll work? If so, is there a convenient way to move all 
> system-tables to a tablespace?

Not sure about moving them after the fact, but you could create the
database with its default tablespace being the one you want pg_largeobject
in.

I think though that there's a fairly clear counterexample to Bruce's
question: if you're worried about moving pg_largeobject at all, you
probably are trying to put it on a relatively large and slow storage
device.  You don't necessarily want all the system catalogs there.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread David G. Johnston
On Wed, Oct 19, 2016 at 9:29 AM, Bruce Momjian  wrote:

> On Tue, Oct 18, 2016 at 04:51:54PM +0200, Andreas Joseph Krogh wrote:
> > > 2. Being able to move pg_largeobject to a different tablespace
> > >*without* turning on system_table_mods. This is important for
> > >people storing LOTS of large-objects on separate
> > >disks (non-SSD) and hence in a different tablespace.
>
> I think an open question is why you would not want to move the other
> system tables at the same time you move pg_largeobject.
>
>
​I think the theory of having all system tables except LO on SSD storage,
and having LO on a less performant device, makes sense.

David J.​


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 06:33:55PM +0200, Andreas Joseph Krogh wrote:
> På onsdag 19. oktober 2016 kl. 18:29:31, skrev Bruce Momjian  >:
> 
> On Tue, Oct 18, 2016 at 04:51:54PM +0200, Andreas Joseph Krogh wrote:
> >     > 2. Being able to move pg_largeobject to a different tablespace
> >     >    *without* turning on system_table_mods. This is important for
> >     >    people storing LOTS of large-objects on separate
> >     >    disks (non-SSD) and hence in a different tablespace.
> >     > Anyone willing to discuss this?
> >     > 
> >     This was proposed a few years ago but no one cared to draft a patch.
> >
> >  
> > So that why I'm re-raising the issue:-)
> > Having "everything in the database" adds lots of benefits, conceptually
> > (follows tx-semantics, consistent backups etc.), however it's currently
> not so
> > easy in practice.
> 
> Yeah, rereading that old thread was interesting, and unfortunate that no
> one mentioned the system catalog change would break pg_upgrade, though
> pg_upgrade was not popular at the time that thread was started.
> 
> I think an open question is why you would not want to move the other
> system tables at the same time you move pg_largeobject.
> 
>  
> The thing is that I don't understand what the problem really is. I have no
> problem moving the other system-tables as well if that fixes the problem.
> I tried moving pg_largeobject back to the same tablespace as the database but
> that too gave the error.
>  
> Are you saying that if I move all system-tables to the tablespace I moved
> pg_largeobject to it'll work? If so, is there a convenient way to move all
> system-tables to a tablespace?

Sure, use:

  ALTER DATABASE name SET TABLESPACE new_tablespace

  ...

  The fourth form changes the default tablespace of the database. Only
  the database owner or a superuser can do this; you must also have
  create privilege for the new tablespace. This command physically
  moves any tables or indexes in the database's old default tablespace
  to the new tablespace. The new default tablespace must be empty for
  this database, and no one can be connected to the database. Tables
  and indexes in non-default tablespaces are unaffected.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 07:23:28PM +0530, Pavan Deolasee wrote:
> AFAICS, even without considering VACUUM, indirect indexes would be always
> used with recheck.
> As long as they don't contain visibility information.  When indirect
> indexed column was updated, indirect index would refer same PK with
> different index keys.
> There is no direct link between indirect index tuple and heap tuple, only
> logical link using PK.  Thus, you would anyway have to recheck.
> 
> 
> 
> I agree. Also, I think the recheck mechanism will have to be something like
> what I wrote for WARM i.e. only checking for index quals won't be enough and 
> we
> would actually need to verify that the heap tuple satisfies the key in the
> indirect index. 

I personally would like to see how far we get with WARM before adding
this feature that requires a DBA to evaluate and enable it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] packing/alignment annotation for ItemPointerData redux

2016-10-19 Thread Greg Stark
Sorry -- with the obvious error fixed:


$ /usr/bin/clang-4.0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -Wall -c  clang-bug.c
clang-bug.c:55:9: warning: taking address of packed member 'ip_blkid'
of class or structure
  'ItemPointerData' may result in an unaligned pointer value
[-Waddress-of-packed-member]
return ItemPointerGetBlockNumber(&ip);
   ^~
clang-bug.c:49:25: note: expanded from macro 'ItemPointerGetBlockNumber'
BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
~~~^~~~
clang-bug.c:39:19: note: expanded from macro 'BlockIdGetBlockNumber'
(BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) (blockId)->bi_lo)) \
 ^~~
clang-bug.c:55:9: warning: taking address of packed member 'ip_blkid'
of class or structure
  'ItemPointerData' may result in an unaligned pointer value
[-Waddress-of-packed-member]
return ItemPointerGetBlockNumber(&ip);
   ^~
clang-bug.c:49:25: note: expanded from macro 'ItemPointerGetBlockNumber'
BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
~~~^~~~
clang-bug.c:39:55: note: expanded from macro 'BlockIdGetBlockNumber'
(BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) (blockId)->bi_lo)) \
 ^~~
2 warnings generated.
#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
#define pg_attribute_aligned(a) __attribute__((aligned(a)))
#define pg_attribute_packed() __attribute__((packed))
#endif

typedef unsigned BlockNumber;
typedef unsigned short uint16;
typedef uint16 OffsetNumber;

typedef struct BlockIdData
{
	uint16		bi_hi;
	uint16		bi_lo;
} BlockIdData;


typedef struct ItemPointerData
{
	BlockIdData ip_blkid;
	OffsetNumber ip_posid;
}
/* If compiler understands packed and aligned pragmas, use those */
#if defined(pg_attribute_packed) && defined(pg_attribute_aligned)
pg_attribute_packed()
pg_attribute_aligned(2)
#endif
ItemPointerData;

typedef ItemPointerData *ItemPointer;



/*
 * BlockIdGetBlockNumber
 *		Retrieve the block number from a block identifier.
 */
#define BlockIdGetBlockNumber(blockId) \
( \
	(BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) (blockId)->bi_lo)) \
)


/*
 * ItemPointerGetBlockNumber
 *		Returns the block number of a disk item pointer.
 */
#define ItemPointerGetBlockNumber(pointer) \
( \
	BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
)

int main() {
	ItemPointerData ip;

	return ItemPointerGetBlockNumber(&ip);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] packing/alignment annotation for ItemPointerData redux

2016-10-19 Thread Greg Stark
Ah. Here we go:

$ /usr/bin/clang-4.0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -Wall -c  clang-bug.c
clang-bug.c:54:9: error: use of undeclared identifier 'BlockNumber'
return ItemPointerGetBlockNumber(&ip);
   ^
clang-bug.c:48:2: note: expanded from macro 'ItemPointerGetBlockNumber'
BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
^
clang-bug.c:38:3: note: expanded from macro 'BlockIdGetBlockNumber'
(BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) (blockId)->bi_lo)) \
 ^
1 error generated.


Preprocessor output:

# 1 "clang-bug.c"
# 1 "" 1
# 1 "" 3
# 317 "" 3
# 1 "" 1
# 1 "" 2
# 1 "clang-bug.c" 2





typedef unsigned short uint16;
typedef uint16 OffsetNumber;

typedef struct BlockIdData
{
 uint16 bi_hi;
 uint16 bi_lo;
} BlockIdData;


typedef struct ItemPointerData
{
 BlockIdData ip_blkid;
 OffsetNumber ip_posid;
}


__attribute__((packed))
__attribute__((aligned(2)))

ItemPointerData;

typedef ItemPointerData *ItemPointer;
# 51 "clang-bug.c"
int main() {
 ItemPointerData ip;

 return ( ( (BlockNumber) (((&(&ip)->ip_blkid)->bi_hi << 16) |
((uint16) (&(&ip)->ip_blkid)->bi_lo)) ) );
}
#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
#define pg_attribute_aligned(a) __attribute__((aligned(a)))
#define pg_attribute_packed() __attribute__((packed))
#endif

typedef unsigned short uint16;
typedef uint16 OffsetNumber;

typedef struct BlockIdData
{
	uint16		bi_hi;
	uint16		bi_lo;
} BlockIdData;


typedef struct ItemPointerData
{
	BlockIdData ip_blkid;
	OffsetNumber ip_posid;
}
/* If compiler understands packed and aligned pragmas, use those */
#if defined(pg_attribute_packed) && defined(pg_attribute_aligned)
pg_attribute_packed()
pg_attribute_aligned(2)
#endif
ItemPointerData;

typedef ItemPointerData *ItemPointer;



/*
 * BlockIdGetBlockNumber
 *		Retrieve the block number from a block identifier.
 */
#define BlockIdGetBlockNumber(blockId) \
( \
	(BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) (blockId)->bi_lo)) \
)


/*
 * ItemPointerGetBlockNumber
 *		Returns the block number of a disk item pointer.
 */
#define ItemPointerGetBlockNumber(pointer) \
( \
	BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
)

int main() {
	ItemPointerData ip;

	return ItemPointerGetBlockNumber(&ip);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Andreas Joseph Krogh
På onsdag 19. oktober 2016 kl. 18:29:31, skrev Bruce Momjian mailto:br...@momjian.us>>:
On Tue, Oct 18, 2016 at 04:51:54PM +0200, Andreas Joseph Krogh wrote:
 >     > 2. Being able to move pg_largeobject to a different tablespace
 >     >    *without* turning on system_table_mods. This is important for
 >     >    people storing LOTS of large-objects on separate
 >     >    disks (non-SSD) and hence in a different tablespace.
 >     > Anyone willing to discuss this?
 >     > 
 >     This was proposed a few years ago but no one cared to draft a patch.
 >
 >  
 > So that why I'm re-raising the issue:-)
 > Having "everything in the database" adds lots of benefits, conceptually
 > (follows tx-semantics, consistent backups etc.), however it's currently not 
so
 > easy in practice.

 Yeah, rereading that old thread was interesting, and unfortunate that no
 one mentioned the system catalog change would break pg_upgrade, though
 pg_upgrade was not popular at the time that thread was started.

 I think an open question is why you would not want to move the other
 system tables at the same time you move pg_largeobject.
 
The thing is that I don't understand what the problem really is. I have no 
problem moving the other system-tables as well if that fixes the problem.
I tried moving pg_largeobject back to the same tablespace as the database but 
that too gave the error.
 
Are you saying that if I move all system-tables to the tablespace I moved 
pg_largeobject to it'll work? If so, is there a convenient way to move all 
system-tables to a tablespace?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
 


 


Re: [HACKERS] [COMMITTERS] packing/alignment annotation for ItemPointerData redux

2016-10-19 Thread Greg Stark
On Wed, Oct 19, 2016 at 5:20 PM, Tom Lane  wrote:
> Don't know how that version number compares to "3.8.0".

Argh. Just to further confuse matters apparently the warnings are from
clang 4.0. I had been testing with 3.8.0 earlier but updated at some
point.

And I'm not being able to reproduce them with a minimal test case yet.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move pg_largeobject to a different tablespace *without* turning on system_table_mods.

2016-10-19 Thread Bruce Momjian
On Tue, Oct 18, 2016 at 04:51:54PM +0200, Andreas Joseph Krogh wrote:
> > 2. Being able to move pg_largeobject to a different tablespace
> >    *without* turning on system_table_mods. This is important for
> >    people storing LOTS of large-objects on separate
> >    disks (non-SSD) and hence in a different tablespace.
> > Anyone willing to discuss this?
> > 
> This was proposed a few years ago but no one cared to draft a patch.
> 
>  
> So that why I'm re-raising the issue:-)
> Having "everything in the database" adds lots of benefits, conceptually
> (follows tx-semantics, consistent backups etc.), however it's currently not so
> easy in practice.

Yeah, rereading that old thread was interesting, and unfortunate that no
one mentioned the system catalog change would break pg_upgrade, though
pg_upgrade was not popular at the time that thread was started.

I think an open question is why you would not want to move the other
system tables at the same time you move pg_largeobject.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] packing/alignment annotation for ItemPointerData redux

2016-10-19 Thread Tom Lane
[ moved to -hackers ]

Greg Stark  writes:
> Back in 2001 a hack to add __attribute__((packed)) to ItemPtr was
> added with a comment "Appropriate whack upside the head for ARM"
> (dcbbdb1b3ee). I don't know if this is still a factor in 2016 or not
> but it has already resulted in some collateral damage in 2015 when
> some compiler took that as license to align the whole struct on single
> byte alignment when it was buried inside another struct
> (d4b538ea367de).

> I just tried compiling with Clang 3.8.0 and got tons of warnings about
> this because:

>   'ItemPointerData' may result in an unaligned pointer value
> [-Waddress-of-packed-member]
>   ...ItemPointerGetBlockNumber(&(xlrec->target_tid)),
>  ^~~
> ../../../src/include/storage/itemptr.h:69:25: note: expanded from macro
>   'ItemPointerGetBlockNumber'
> BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
> ~~~^~~~
> ../../../src/include/storage/block.h:118:19: note: expanded from macro
> 'BlockIdGetBlockNumber'
> (BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) 
> (blockId)->bi_lo)) \
>  ^~~

> Which seems to indicate that clang may not understand the
> "pg_attribute_aligned(2)" or perhaps it does and just doesn't take it
> into account when generating these warnings.

Ick.  Can you look to see how those macros are expanding on your clang?

> I'm sure there are other people testing clang -- isn't it the default
> on MacOS? Do they not see these warnings?

I've never seen this on MacOS.  The current compiler version (on Sierra)
is

$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr 
--with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin16.0.0
Thread model: posix
InstalledDir: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Don't know how that version number compares to "3.8.0".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-19 Thread Kevin Grittner
On Wed, Oct 19, 2016 at 11:11 AM, Bruce Momjian  wrote:
> On Wed, Oct 19, 2016 at 11:08:28AM -0500, Kevin Grittner wrote:
>> On Wed, Oct 19, 2016 at 10:04 AM, Bruce Momjian  wrote:
>>
>>> Slide 10 of this presentation has an example showing
>>> old_snapshot_threshold set to '1min':
>>>
>>> http://momjian.us/main/writings/pgsql/features.pdf
>>
>> If the presentation is intending to show reasonable values for
>> production use, that would be better as, maybe, '2h'.

> The example is just to illustrate the activities required to trigger it,
> e.g.  pg_sleep(), VACUUM.

Yeah, in such a demonstration you probably don't want to have
everyone sit for 2 hours, so 1 minute makes perfect sense.  :-)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] packing/alignment annotation for ItemPointerData redux

2016-10-19 Thread Greg Stark
[resending to -hackers sorry]

Back in 2001 a hack to add __attribute__((packed)) to ItemPtr was
added with a comment "Appropriate whack upside the head for ARM"
(dcbbdb1b3ee). I don't know if this is still a factor in 2016 or not
but it has already resulted in some collateral damage in 2015 when
some compiler took that as license to align the whole struct on single
byte alignment when it was buried inside another struct
(d4b538ea367de).

I just tried compiling with Clang 3.8.0 and got tons of warnings about
this because:

  'ItemPointerData' may result in an unaligned pointer value
[-Waddress-of-packed-member]
  ...ItemPointerGetBlockNumber(&(xlrec->target_tid)),
 ^~~
../../../src/include/storage/itemptr.h:69:25: note: expanded from macro
  'ItemPointerGetBlockNumber'
BlockIdGetBlockNumber(&(pointer)->ip_blkid) \
~~~^~~~
../../../src/include/storage/block.h:118:19: note: expanded from macro
'BlockIdGetBlockNumber'
(BlockNumber) (((blockId)->bi_hi << 16) | ((uint16) (blockId)->bi_lo)) \
 ^~~

Which seems to indicate that clang may not understand the
"pg_attribute_aligned(2)" or perhaps it does and just doesn't take it
into account when generating these warnings.

I'm sure there are other people testing clang -- isn't it the default
on MacOS? Do they not see these warnings?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Portable check for unportable macro usage

2016-10-19 Thread Tom Lane
A portability hazard that we fight constantly is that the 
macros such as isalpha() can't safely be called on "char" values.
Per POSIX, you have to make sure the input is cast to "unsigned char".
That's far too easy to forget, but on most machines you won't get any
warning about it.

We have a couple of buildfarm machines that will generate "char value
used as array subscript" warnings for this, because their implementations
of these macros are indeed just array lookups.  But I think gaur is
the only one that's still active, and it's not going to last forever.

Ralph Corderoy, one of the nmh hackers, was looking at this problem
and came up with this cute solution [1]:

#ifndef NDEBUG
#if EOF != -1
#error "Please report this to nmh's authors."
#endif

extern int ctype_identity[257]; /* [n] = n-1 */
#define isupper(c) ((isupper)((ctype_identity + 1)[c]))
...
#endif

This essentially converts the macros to always have an array lookup
as the first step, and thus you'll get the char-as-array-subscript
warning if your compiler has one.  (AFAIK, all versions of gcc will
emit that with -Wall.)

You wouldn't want this in production, of course, since it's pure overhead
and the only value is as a compile-time check.  For our purposes, rather
than "#ifndef NDEBUG" we could either use "#ifdef USE_ASSERT_CHECKING"
or invent a separate macro to control this replacement.  I'd be a bit
inclined to invent a separate macro and set up a buildfarm member to
#define that symbol and throw an error on char-as-array-subscript.
Individual developers could #define it for test purposes but I wouldn't
expect most of us to bother, as long as there's a backstop in the build
farm.

Thoughts, objections?

regards, tom lane

[1] http://lists.nongnu.org/archive/html/nmh-workers/2016-10/msg00313.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 11:08:28AM -0500, Kevin Grittner wrote:
> On Wed, Oct 19, 2016 at 10:04 AM, Bruce Momjian  wrote:
> 
> > Slide 10 of this presentation has an example showing
> > old_snapshot_threshold set to '1min':
> >
> > http://momjian.us/main/writings/pgsql/features.pdf
> 
> If the presentation is intending to show reasonable values for
> production use, that would be better as, maybe, '2h'.
> 
> As the documentation says at:
> 
> https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-OLD-SNAPSHOT-THRESHOLD

The example is just to illustrate the activities required to trigger it,
e.g.  pg_sleep(), VACUUM.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-19 Thread Kevin Grittner
On Wed, Oct 19, 2016 at 10:04 AM, Bruce Momjian  wrote:

> Slide 10 of this presentation has an example showing
> old_snapshot_threshold set to '1min':
>
> http://momjian.us/main/writings/pgsql/features.pdf

If the presentation is intending to show reasonable values for
production use, that would be better as, maybe, '2h'.

As the documentation says at:

https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-OLD-SNAPSHOT-THRESHOLD

| Useful values for production work probably range from a small
| number of hours to a few days. The setting will be coerced to a
| granularity of minutes, and small numbers (such as 0 or 1min) are
| only allowed because they may sometimes be useful for testing.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-19 Thread Robert Haas
On Thu, Oct 13, 2016 at 5:53 AM, Victor Wagner  wrote:
> On Thu, 13 Oct 2016 12:30:59 +0530
> Mithun Cy  wrote:
>> On Fri, Sep 30, 2016 at 2:14 PM, Victor Wagner 
>> wrote:
>> Okay but for me consistency is also important. Since we agree to
>> disagree on some of the comments and others have not expressed any
>> problem I am moving it to committer.
>
> Thank you for your efforts improving my patch

I haven't deeply reviewed this patch, but on a quick read-through it
certainly seems to need a lot of cleanup work.

+  
+  hostorder
+  

I don't think that putting everything at the same indentation level
matches the style of the surrounding documentation.

@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   

Spurious hunk.

+/* Host has explicitely specified port */

Spelling.  This same error is repeated elsewhere.

+/* Check if port is numeric */
+for (nptr=r+1;nptr'9')

Formatting.  Consider using pgindent to get this into correct PostgreSQL style.

+ * behavoir.

Spelling.

+ * the beginning (and putting its addres into given pointer.

Spelling.

+ *Returns 1 if address is choosen and 0 if there are no more addresses

Spelling.

+ * Initialize random number generator in case if nobody have done it
+ * before. Use value from rand along with time in case random number

Grammar ("in case nobody has done it already").  Also, is it really OK
for libpq to call srand() as a side effect?  I think that seems rather
unfriendly.

+ * taget_server_type is set to 'any' or is not exlicitely

Multiple spelling mistakes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2016-10-19 Thread Dilip Kumar
On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund  wrote:
> Try measuring with something more heavy on bitmap scan time
> itself. E.g.
> SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= 
> '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
> or similar.  The tpch queries don't actually spend that much time in the
> bitmapscan itself - the parallization of the rest of the query is what
> matters...

Yeah, I agree.

I have tested with this query, with exact filter condition it was
taking parallel sequence scan, so I have modified the filter a bit and
tested.

Tested with all default configuration in my local machine. I think I
will generate more such test cases and do detail testing in my
performance machine.


Explain Analyze results:
-
On Head:

postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-03-31'::date);

 QUERY PLAN
---
 Aggregate  (cost=848805.90..848805.91 rows=1 width=32) (actual
time=12440.165..12440.166 rows=1 loops=1)
   ->  Bitmap Heap Scan on lineitem  (cost=143372.40..834833.25
rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841
loops=1)
 Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate <= '1996-03-31'::date))
 Rows Removed by Index Recheck: 20678739
 Heap Blocks: exact=51196 lossy=528664
 ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..141975.13 rows=5589057 width=0) (actual
time=1093.376..1093.376 rows=5678841 loops=1)
   Index Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate <= '1996-03-31'::date))
 Planning time: 0.185 ms
 Execution time: 12440.819 ms
(9 rows)

After Patch:
---
postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-03-31'::date);

   QUERY PLAN

--
-
 Finalize Aggregate  (cost=792751.16..792751.17 rows=1 width=32)
(actual time=6660.157..6660.157 rows=1 loops=1)
   ->  Gather  (cost=792750.94..792751.15 rows=2 width=32) (actual
time=6659.378..6660.117 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=791750.94..791750.95 rows=1
width=32) (actual time=6655.941..6655.941 rows=1 loops=3)
   ->  Parallel Bitmap Heap Scan on lineitem
(cost=143372.40..785929.00 rows=2328774 width=8) (actual
time=1980.797..6204.232 rows=1892947 loops=
3)
 Recheck Cond: ((l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <= '1996-03-31'::date))
 Rows Removed by Index Recheck: 6930269
 Heap Blocks: exact=17090 lossy=176443
 ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..141975.13 rows=5589057 width=0) (actual
time=1933.454..1933.454 rows=5678841
loops=1)
   Index Cond: ((l_shipdate >=
'1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date))
 Planning time: 0.207 ms
 Execution time: 6669.195 ms
(13 rows)


Summary:
-> With patch overall execution is 2 time faster compared to head.
-> Bitmap creation with patch is bit slower compared to head and thats
because of DHT vs efficient hash table.

I found one defect in v2 patch, that I induced during last rebasing.
That is fixed in v3.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel-bitmap-heap-scan-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-19 Thread Robert Haas
On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
 wrote:
> However, when I briefly read the description in "Transaction Management in
> the R* Distributed Database Management System (C. Mohan et al)" [2], it
> seems that what Ashutosh is saying might be a correct way to proceed after
> all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs.  In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process.  Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes.  This can't happen in design #1.  On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug.  If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart.  In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits.  On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true.  You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server.  Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR.  But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure.  We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error.  And what about interrupts?  We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here.  We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user.  But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation.  The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically.  When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker.  If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is
unaffected.

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems.  The XTM API proposed by Postgres Pro aims to address both
with a single stroke.  I don't think that API was well-designed, but
maybe the idea is good even if the code is not.  Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work.  One p

Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-19 Thread Bruce Momjian
On Mon, Oct 17, 2016 at 08:04:43AM -0500, Kevin Grittner wrote:
> > We have regression test for this feature but it sets
> > old_snapshot_threshold = 0, I doubt about we can test it properly.
> > Am I missing something?
> 
> This is a hard feature to test properly, and certainly hard to test
> without the test running for a long time.  The zero setting is
> really not intended to be used in production, but only to allow
> some half-way decent testing that doesn't take extreme lengths of
> time.  If you add some delays of a few minutes each at key points
> in a test, you should be able to get a test that works with a
> setting of 1min.  It is not impossible that we might need to add a
> memory barrier to one or two places to get such tests to behave
> consistently, but I have not been able to spot where, if anywhere,
> that would be.

Slide 10 of this presentation has an example showing
old_snapshot_threshold set to '1min':

http://momjian.us/main/writings/pgsql/features.pdf

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> > to be to the system catalogs with at least two critical tables dropped
> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
> > pointing at the wrong thing.  Couple more datapoints here.
> >
> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> > *) Another database on the same cluster was not impacted.  However
> > it's more olap style and may not have been written to during the
> > outage
> >
> > Now, this infrastructure running this system is running maybe 100ish
> > postgres clusters and maybe 1000ish sql server instances with
> > approximately zero unexplained data corruption issues in the 5 years
> > I've been here.  Having said that, this definitely smells and feels
> > like something on the infrastructure side.  I'll follow up if I have
> > any useful info.
> 
> After a thorough investigation I now have credible evidence the source
> of the damage did not originate from the database itself.
> Specifically, this database is mounted on the same volume as the
> operating system (I know, I know) and something non database driven
> sucked up disk space very rapidly and exhausted the volume -- fast
> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-19 Thread Robert Haas
On Mon, Oct 17, 2016 at 8:36 AM, Amit Kapila  wrote:
>> This project of mine is about parallelizing tuplesort.c, which isn't
>> really what you want for parallel query -- you shouldn't try to scope
>> the problem as "make the sort more scalable using parallelism" there.
>> Rather, you want to scope it at "make the execution of the entire
>> query more scalable using parallelism", which is really quite a
>> different thing, which necessarily involves the executor having direct
>> knowledge of partition boundaries.
>
> Okay, but what is the proof or why do you think second is going to
> better than first?  One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

Gather Merge can't emit a tuple unless it has buffered at least one
tuple from every producer; otherwise, the next tuple it receives from
one of those producers might proceed whichever tuple it chooses to
emit.  However, it doesn't need to wait until all of the workers are
completely done.  The leader only needs to be at least slightly ahead
of the slowest worker.  I'm not sure how that compares to Peter's
approach.

What I'm worried about is that we're implementing two separate systems
to do the same thing, and that the parallel sort approach is actually
a lot less general.  I think it's possible to imagine a Parallel Sort
implementation which does things Gather Merge can't.  If all of the
workers collaborate to sort all of the data rather than each worker
sorting its own data, then you've got something which Gather Merge
can't match.  But this is not that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Robert Haas
On Wed, Oct 19, 2016 at 9:21 AM, Simon Riggs  wrote:
> The main problem IMV is GIN indexes. It's relatively easy to discuss
> variable length PKs with btrees, but the GIN format is designed around
> use of 6byte values, so expanding beyond that would require
> significant redesign/reimplementation. That would be at least a year's
> work for not much benefit, so cannot occur for the first release.

That doesn't bother me.  We can add an amcanindirect flag or similar.

>> The whole point of an indirect index is that it
>> doesn't disable HOT, and the physical location within the index page
>> you stick the PK value doesn't have any impact on whether that's safe.
>
> Agreed, though it does have an impact on the length of the index tuple
> and thus the size and effectiveness of the index.
>
> Perhaps its best to see the restriction to 6byte PKs as both the first
> phase of implementation and an optimization, rather than an ugly wart.

Perhaps, but I'm not ready to rule out "ugly wart".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Draft for next update release (scheduled for 27th Oct)

2016-10-19 Thread Umair Shahid
On Wed, Oct 19, 2016 at 6:34 PM, Tom Lane  wrote:

> Umair Shahid  writes:
> > We are in the process of writing up draft notes for the upcoming update
> > releases. To initiate the process, we need to list out the major fixes
> > going into the release. Need help from -hackers to generate that list.
>
> Why don't you just wait for the release notes to be written?  I'll
> probably do that towards the end of the week.  But the raw material
> for that comes from "git log", or even more specifically
> src/tools/git_changelog.
>

Yes, I'll lean on the release notes for the 'Additional Bug Fixes' section.
This mail, however, is intended to gather info about the major issues that
are being addressed.


>
> regards, tom lane
>


Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Merlin Moncure
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
>  wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-#   SELECT ...
>>> ERROR:  42809: "pg_cast_oid_index" is an index
>>> LINE 11:   FROM ApartmentSample s
>>> ^
>>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here.  You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion.  A lot of the OIDs seemed to be
> pointing at the wrong thing.  Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted.  However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here.  Having said that, this definitely smells and feels
> like something on the infrastructure side.  I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Pavan Deolasee
On Wed, Oct 19, 2016 at 7:19 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Wed, Oct 19, 2016 at 3:52 PM, Robert Haas 
> wrote:
>
>> The VACUUM problems seem fairly serious.  It's true that these indexes
>> will be less subject to bloat, because they only need updating when
>> the PK or the indexed columns change, not when other indexed columns
>> change.  On the other hand, there's nothing to prevent a PK from being
>> recycled for an unrelated tuple.  We can guarantee that a TID won't be
>> recycled until all index references to the TID are gone, but there's
>> no such guarantee for a PK.  AFAICT, that would mean that an indirect
>> index would have to be viewed as unreliable: after looking up the PK,
>> you'd *always* have to recheck that it actually matched the index
>> qual.
>>
>
> AFAICS, even without considering VACUUM, indirect indexes would be always
> used with recheck.
> As long as they don't contain visibility information.  When indirect
> indexed column was updated, indirect index would refer same PK with
> different index keys.
> There is no direct link between indirect index tuple and heap tuple, only
> logical link using PK.  Thus, you would anyway have to recheck.
>
>
I agree. Also, I think the recheck mechanism will have to be something like
what I wrote for WARM i.e. only checking for index quals won't be enough
and we would actually need to verify that the heap tuple satisfies the key
in the indirect index.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Alexander Korotkov
On Wed, Oct 19, 2016 at 3:52 PM, Robert Haas  wrote:

> The VACUUM problems seem fairly serious.  It's true that these indexes
> will be less subject to bloat, because they only need updating when
> the PK or the indexed columns change, not when other indexed columns
> change.  On the other hand, there's nothing to prevent a PK from being
> recycled for an unrelated tuple.  We can guarantee that a TID won't be
> recycled until all index references to the TID are gone, but there's
> no such guarantee for a PK.  AFAICT, that would mean that an indirect
> index would have to be viewed as unreliable: after looking up the PK,
> you'd *always* have to recheck that it actually matched the index
> qual.
>

AFAICS, even without considering VACUUM, indirect indexes would be always
used with recheck.
As long as they don't contain visibility information.  When indirect
indexed column was updated, indirect index would refer same PK with
different index keys.
There is no direct link between indirect index tuple and heap tuple, only
logical link using PK.  Thus, you would anyway have to recheck.

Another approach would be to include visibility information into indirect
indexes themselves.  In this case, index should support snapshots by itself
and don't produce false positives.
This approach would require way more intrusive changes in index AMs.  We
would probably not able to reuse same index AM and have to make a new one.
But it seems like rather better design for me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Draft for next update release (scheduled for 27th Oct)

2016-10-19 Thread Tom Lane
Umair Shahid  writes:
> We are in the process of writing up draft notes for the upcoming update
> releases. To initiate the process, we need to list out the major fixes
> going into the release. Need help from -hackers to generate that list.

Why don't you just wait for the release notes to be written?  I'll
probably do that towards the end of the week.  But the raw material
for that comes from "git log", or even more specifically
src/tools/git_changelog.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee
On Wed, Oct 19, 2016 at 6:44 PM, Heikki Linnakangas  wrote:

> On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:
>
>>
>>>
>> Oh, forgot that this needs to be backported, of course. Will do that
>> shortly...
>>
>
> Done.
>

Thanks!


>
> This didn't include anything to cope with an already-corrupt FSM, BTW. Do
> we still want to try something for that? I think it's good enough if we
> prevent the FSM corruption from happening, but not sure what the consensus
> on that might be..
>
>
I thought it will be nice to handle already corrupt FSM since our customer
found it immediately after a failover and then it was a serious issue. In
one case, a system table was affected, thus preventing all DDLs from
running. Having said that, I don't have a better idea to handle the problem
without causing non-trivial overhead for normal cases (see my original
patch). If you've better ideas, it might be worth pursuing.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Tom Lane
Heikki Linnakangas  writes:
> This didn't include anything to cope with an already-corrupt FSM, BTW. 
> Do we still want to try something for that? I think it's good enough if 
> we prevent the FSM corruption from happening, but not sure what the 
> consensus on that might be..

Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?)  We're going to need to be able to explain how to
fix busted visibility maps in 9.6.1, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_ctl promote -w in TAP tests

2016-10-19 Thread Peter Eisentraut
On 10/10/16 8:37 AM, Michael Paquier wrote:
> Now that we have support for the wait mode of pg_ctl promote, I think
> that it would be a good idea to switch to it in the TAP tests. This
> allows avoiding extra logic with poll_query_until() to be sure that a
> promoted standby is ready for read-write queries.
> See the patch attached.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Simon Riggs
On 19 October 2016 at 14:52, Robert Haas  wrote:
> On Tue, Oct 18, 2016 at 2:28 PM, Alvaro Herrera
>  wrote:
>> I propose we introduce the concept of "indirect indexes".  I have a toy
>> implementation and before I go further with it, I'd like this assembly's
>> input on the general direction.
>>
>> Indirect indexes are similar to regular indexes, except that instead of
>> carrying a heap TID as payload, they carry the value of the table's
>> primary key.  Because this is laid out on top of existing index support
>> code, values indexed by the PK can only be six bytes long (the length of
>> ItemPointerData); in other words, 281,474,976,710,656 rows are
>> supported, which should be sufficient for most use cases.[1]
>
> So, I think that this is a really promising direction, but also that
> you should try very hard to try to get out from under this 6-byte PK
> limitation.  That seems really ugly, and in practice it probably means
> your PK is probably going to be limited to int4, which is kind of sad
> since it leaves people using int8 or text PKs out in the cold.  I
> believe Claudio Freire is on to something when he suggests storing the
> PK in the index tuple; one could try to skip storing the TID, or
> always store it as all-zeroes.

The main problem IMV is GIN indexes. It's relatively easy to discuss
variable length PKs with btrees, but the GIN format is designed around
use of 6byte values, so expanding beyond that would require
significant redesign/reimplementation. That would be at least a year's
work for not much benefit, so cannot occur for the first release.

A limit 281 trillion rows means the row headers alone will be 9
Petabytes, before we even include block wastage and data payload per
row. So that is more than we'll need for a few years. Setting the max
sequence value to 281474976710656 should be easy enough.

> Simon objected that putting the PK
> into the index tuple would disable HOT, but I don't think that's a
> valid objection.

Just to be clear, that's not what I objected to. Claudio appeared to
be suggesting that an indirect index is the same thing as an index
with PK tacked onto the end, which I re-confirm is not the case since
doing that would not provide the primary objective of indirect
indexes.

> The whole point of an indirect index is that it
> doesn't disable HOT, and the physical location within the index page
> you stick the PK value doesn't have any impact on whether that's safe.

Agreed, though it does have an impact on the length of the index tuple
and thus the size and effectiveness of the index.

Perhaps its best to see the restriction to 6byte PKs as both the first
phase of implementation and an optimization, rather than an ugly wart.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Heikki Linnakangas

On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:

On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.


Ok, committed. Thanks for the analysis!


Oh, forgot that this needs to be backported, of course. Will do that
shortly...


Done.

This didn't include anything to cope with an already-corrupt FSM, BTW. 
Do we still want to try something for that? I think it's good enough if 
we prevent the FSM corruption from happening, but not sure what the 
consensus on that might be..


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Robert Haas
On Wed, Oct 19, 2016 at 8:47 AM, Bruce Momjian  wrote:
> On Wed, Oct 19, 2016 at 08:33:20AM -0400, Robert Haas wrote:
>> On Tue, Oct 18, 2016 at 4:33 PM, Josh Berkus  wrote:
>> > Based on that argument, we would never be able to remove any
>> > configuration parameter ever.
>>
>> Well... no.  Based on that argument, we should only remove
>> configuration parameters if we're fairly certain that they are not
>> useful any more, which will be rare, but is not never.  I agree that
>> *if* vacuum_defer_cleanup_age is no longer useful, it should be
>> removed.  I'm just not convinced that it's truly obsolete, and you
>> haven't really offered much of an argument for that proposition.  It
>> does something sufficiently different from hot_standby_feedback that
>> I'm not sure it's accurate to say that one can substitute for the
>> other, and indeed, I see Andres has already suggested some scenarios
>> where it could still be useful.
>>
>> Actually, I think vacuum_defer_cleanup_age is, and always has been, an
>> ugly hack.  But for some people it may be the ugly hack that is
>> letting them continue to use PostgreSQL.
>
> I see vacuum_defer_cleanup_age as old_snapshot_threshold for standby
> servers --- it cancels transactions rather than delaying cleanup.

I think it's the opposite, isn't it?  vacuum_defer_cleanup_age
prevents cancellations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Robert Haas
On Tue, Oct 18, 2016 at 2:28 PM, Alvaro Herrera
 wrote:
> I propose we introduce the concept of "indirect indexes".  I have a toy
> implementation and before I go further with it, I'd like this assembly's
> input on the general direction.
>
> Indirect indexes are similar to regular indexes, except that instead of
> carrying a heap TID as payload, they carry the value of the table's
> primary key.  Because this is laid out on top of existing index support
> code, values indexed by the PK can only be six bytes long (the length of
> ItemPointerData); in other words, 281,474,976,710,656 rows are
> supported, which should be sufficient for most use cases.[1]

So, I think that this is a really promising direction, but also that
you should try very hard to try to get out from under this 6-byte PK
limitation.  That seems really ugly, and in practice it probably means
your PK is probably going to be limited to int4, which is kind of sad
since it leaves people using int8 or text PKs out in the cold.  I
believe Claudio Freire is on to something when he suggests storing the
PK in the index tuple; one could try to skip storing the TID, or
always store it as all-zeroes.  Simon objected that putting the PK
into the index tuple would disable HOT, but I don't think that's a
valid objection.  The whole point of an indirect index is that it
doesn't disable HOT, and the physical location within the index page
you stick the PK value doesn't have any impact on whether that's safe.

The VACUUM problems seem fairly serious.  It's true that these indexes
will be less subject to bloat, because they only need updating when
the PK or the indexed columns change, not when other indexed columns
change.  On the other hand, there's nothing to prevent a PK from being
recycled for an unrelated tuple.  We can guarantee that a TID won't be
recycled until all index references to the TID are gone, but there's
no such guarantee for a PK.  AFAICT, that would mean that an indirect
index would have to be viewed as unreliable: after looking up the PK,
you'd *always* have to recheck that it actually matched the index
qual.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-19 Thread Tomas Vondra


On 10/19/2016 12:27 AM, Petr Jelinek wrote:
> On 18/10/16 22:25, Robert Haas wrote:
>> On Wed, Oct 5, 2016 at 12:22 AM, Tomas Vondra
>>  wrote:
>>> attached is v3 of the patches, with a few minor fixes in Slab, and much
>>> larger fixes in GenSlab.
>>>
>>> Slab (minor fixes)
>>> --
>>> - Removed the unnecessary memset() of new blocks in SlabAlloc(), although we
>>> still need to zero the free bitmap at the end of the block.
>>> - Renamed minFreeCount to minFreeChunks, added a few comments explaining
>>> why/how firstFreeChunk and minFreeChunks are maintained.
>>> - Fixed / improved a bunch of additional comments, based on feedback.
>>
>> I had a look at 0001 today, but it seems to me that it still needs
>> work.  It's still got a lot of remnants of where you've
>> copy-and-pasted aset.c.  I dispute this allegation:
>>
>> + * SlabContext is our standard implementation of MemoryContext.
>>
> 
> Are you looking at old version of the patch? I complained about this as
> well and Tomas has changed that.
> 
>> And then there's this:
>>
>> +#ifdef HAVE_ALLOCINFO
>> +#define SlabFreeInfo(_cxt, _chunk) \
>> +fprintf(stderr, "AllocFree: %s: %p, %d\n", \
>> +(_cxt)->header.name, (_chunk), (_chunk)->size)
>> +#define SlabAllocInfo(_cxt, _chunk) \
>> +fprintf(stderr, "AllocAlloc: %s: %p, %d\n", \
>> +(_cxt)->header.name, (_chunk), (_chunk)->size)
>>
>> Well, it's pretty stupid that AllocSetAlloc is reporting it's name as
>> AllocAlloc, a think that, as far as I can tell, is not real. But
>> having this new context type also pretend to be AllocAlloc is even
>> dumber.
> 
> You are definitely looking at old version.
> 

Yeah.

>>
>> More broadly, I'm not sure I like this design very much.  The whole
>> point of a slab context is that all of the objects are the same size.
>> I wouldn't find it too difficult to support this patch if we were
>> adding an allocator for fixed-size objects that was then being used to
>> allocate objects which are fixed size.  However, what we seem to be
>> doing is creating an allocator for fixed-size objects and then using
>> it for variable-size tuples.  That's really pretty weird.  Isn't the
>> root of this problem that aset.c is utterly terrible at handling large
>> number of allocations?  Maybe we should try to attack that problem
>> more directly.
> 
> It's used for TXNs which are fixed and some tuples (there is
> assumption that the decoded tuples have more or less normal
> distribution).
> 

Yeah. There are three contexts in reorder buffers:

- changes (fixed size)
- txns (fixed size)
- tuples (variable size)

The first two work perfectly fine with Slab.

The last one (tuples) is used to allocate variable-sized bits, so I've
tried to come up with something smart - a sequence of Slabs + overflow
AllocSet. I agree that in hindsight it's a bit strange, and that the
"generational" aspect is the key aspect here - i.e. it might be possible
to implement a memory context that allocates variable-length chunks but
still segregates them into generations. That is, don't build this on top
of Slab. That would also fix the issue with two allocators in GenSlab.
I'll think about this.

> I agree though that the usability beyond the ReoderBuffer is limited
> because everything that will want to use it for part of allocations will
> get much more complicated by the fact that it will have to use two
> different allocators.
>

It wasn't my (primary) goal to provide allocators usable outside
ReorderBuffer. I've intended to show that perhaps using AllocSet and
then trying to compensate for the pfree() issues is the wrong direction,
and that perhaps different allocation strategy (exploiting the
ReorderBuffer specifics) would work much better. And I think the two
allocators show prove that.

>
> I was wondering if rather than trying to implement new allocator we
> should maybe implement palloc_fixed which would use some optimized
> algorithm for fixed sized objects in our current allocator. The
> advantage of that would be that we could for example use that for things
> like ListCell easily (memory management of which I see quite often in
> profiles).
> 

I don't see how inveting palloc_fixed() solves any of the problems, and
I think it's going to be much more complicated than you think. The idea
of injecting this into AllocSet seems like a dead-end to me, as the code
is already complex enough and it's likely to cause regressions no matter
what you do.

I prefer the idea of implementing separate specialized memory contexts.
If the bar is moved to "implement palloc_fixed()" or something like
that, someone else will have to do that - I'm not all that interested in
ReorderBuffer (this was the first time I actually saw that code), so my
motivation to spend much more time on this is rather small.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Se

Re: [HACKERS] Long options for pg_ctl waiting

2016-10-19 Thread Peter Eisentraut
On 9/7/16 5:08 PM, Vik Fearing wrote:
> On 09/07/2016 10:41 PM, Alvaro Herrera wrote:
>> Gavin Flower wrote:
>>
>>> possibly '--nosync' (& any similar) should have a '--no-sync' variation
>>> added, with the '--nosync' variation documented as depreciated?
>>
>> I agree -- I would go as far as just documenting --no-sync only and
>> keeping the --nosync one working with minimal (if any) visibility in
>> docs.
> 
> Okay, here's a patch to do that.  I don't think it's the other patch's
> job to do it.
> 
> I also changed --noclean to --no-clean, and --no-locale was already correct.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-10-19 Thread Peter Eisentraut
On 9/3/16 4:57 PM, Vik Fearing wrote:
> One thing that has been irking me ever since I came to PostgreSQL is the
> fact that pg_ctl -w (and -W) don't have longhand equivalents.  I like to
> use the long version in scripts and such as extra documentation, and
> I've never been able to with these.  What's more, I keep forgetting that
> --wait (and --no-wait) aren't a thing.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 08:33:20AM -0400, Robert Haas wrote:
> On Tue, Oct 18, 2016 at 4:33 PM, Josh Berkus  wrote:
> > Based on that argument, we would never be able to remove any
> > configuration parameter ever.
> 
> Well... no.  Based on that argument, we should only remove
> configuration parameters if we're fairly certain that they are not
> useful any more, which will be rare, but is not never.  I agree that
> *if* vacuum_defer_cleanup_age is no longer useful, it should be
> removed.  I'm just not convinced that it's truly obsolete, and you
> haven't really offered much of an argument for that proposition.  It
> does something sufficiently different from hot_standby_feedback that
> I'm not sure it's accurate to say that one can substitute for the
> other, and indeed, I see Andres has already suggested some scenarios
> where it could still be useful.
> 
> Actually, I think vacuum_defer_cleanup_age is, and always has been, an
> ugly hack.  But for some people it may be the ugly hack that is
> letting them continue to use PostgreSQL.

I see vacuum_defer_cleanup_age as old_snapshot_threshold for standby
servers --- it cancels transactions rather than delaying cleanup.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Draft for next update release (scheduled for 27th Oct)

2016-10-19 Thread Michael Paquier
On Wed, Oct 19, 2016 at 9:35 PM, Heikki Linnakangas  wrote:
> Off the top of my head, these two recent commits were quite significant:
>
> 1c02ee314bc17ba0ff226c33a64935d02ff1208a (FSM corruption leading to errors)
> a5f0bd77a2fab60a52dc335a63efc21abc806aa7 (use-after-free around DISTINCT
> transition function calls.)
>
> There have been a lot of other commits, though.

Another bad boy is dca25c2562199ce1e7e26367613912a8eadbbde8.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Draft for next update release (scheduled for 27th Oct)

2016-10-19 Thread Heikki Linnakangas

On 10/19/2016 12:06 PM, Umair Shahid wrote:

Hi,

We are in the process of writing up draft notes for the upcoming update
releases. To initiate the process, we need to list out the major fixes
going into the release. Need help from -hackers to generate that list.


I think you need to read through the commit log, since the last minor 
releases. There's a script to help with that, in 
src/tools/git_changelog. You can do something like 
src/tools/git_changelog  --since=2016-08-09 to list all commits since 
the last minor releases. Compared to plain "git log", it lists commits 
from all the stable branches, and back-ported commits are listed only once.


Off the top of my head, these two recent commits were quite significant:

1c02ee314bc17ba0ff226c33a64935d02ff1208a (FSM corruption leading to errors)

a5f0bd77a2fab60a52dc335a63efc21abc806aa7 (use-after-free around DISTINCT 
transition function calls.)


There have been a lot of other commits, though.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove vacuum_defer_cleanup_age

2016-10-19 Thread Robert Haas
On Tue, Oct 18, 2016 at 4:33 PM, Josh Berkus  wrote:
> Based on that argument, we would never be able to remove any
> configuration parameter ever.

Well... no.  Based on that argument, we should only remove
configuration parameters if we're fairly certain that they are not
useful any more, which will be rare, but is not never.  I agree that
*if* vacuum_defer_cleanup_age is no longer useful, it should be
removed.  I'm just not convinced that it's truly obsolete, and you
haven't really offered much of an argument for that proposition.  It
does something sufficiently different from hot_standby_feedback that
I'm not sure it's accurate to say that one can substitute for the
other, and indeed, I see Andres has already suggested some scenarios
where it could still be useful.

Actually, I think vacuum_defer_cleanup_age is, and always has been, an
ugly hack.  But for some people it may be the ugly hack that is
letting them continue to use PostgreSQL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Michael Paquier
On Wed, Oct 19, 2016 at 8:29 PM, Heikki Linnakangas  wrote:
> On 10/19/2016 01:07 PM, Pavan Deolasee wrote:
>>
>> Anyways, we seem good to go with the patch.
>
> Ok, committed. Thanks for the analysis!

Thanks! I am surprised that you kept the TAP test at the end.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Heikki Linnakangas

On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.


Ok, committed. Thanks for the analysis!


Oh, forgot that this needs to be backported, of course. Will do that 
shortly...


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Heikki Linnakangas

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.


Ok, committed. Thanks for the analysis!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-19 Thread Christoph Berg
Re: Bruce Momjian 2016-10-19 <20161018220909.ga11...@momjian.us>
> > There's actually another instance of "rename so people shoot their
> > feet less often" here: pg_resetxlog, which is a user-facing tool.
> > Folks on #postgresql have repeatedly been joking that it should rather
> > be named pg_eatmydata, given the number of "howtos" on the web that
> > make use of it. Maybe this is the chance to find a less
> > innocent-sounding name. (Or add a mandatory --rewind-my-data switch.)
> 
> I wonder how many of the uses of pg_resetxlog were caused by mistakenly
> removing the pg_xlog directory.   My point is renaming pg_xlog might
> reduce mistake use of pg_resetxlog.

I don't think there's much of a connection. There are people who
clean up disk space by removing everything that sounds like log files.
For those people renaming the directories makes sense so they don't
have that idea.

There are other people who have random database startup problems, ask
google, and end up with applying some pg_resetxlog advice (that
doesn't necessarily fit their problem). For those people renaming
pg_resetxlog to something that sounds more like "it will break your
data, use as last resort only" might make sense. (Though I don't have
a good suggestion, and the cost of renaming utilities is higher than
renaming some internal directory names.)

The question would now be if there's people who used pg_resetxlog
because they thought it freed up disk space, and if renaming either
would have prevented that. I'm less sure about that.

(tl;dr: rename pg_xlog yes, rename pg_resetxlog only if we have a good
alternative.)

Christoph


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM Address Sanitizer (ASAN) and valgrind support

2016-10-19 Thread Greg Stark
On Sat, Feb 6, 2016 at 4:52 AM, Noah Misch  wrote:
> aset.c relies on the fact that VALGRIND_MEMPOOL_ALLOC() has an implicit
> VALGRIND_MAKE_MEM_UNDEFINED() and VALGRIND_MEMPOOL_FREE() has an implicit
> VALGRIND_MAKE_MEM_NOACCESS().  #define those two accordingly.  If ASAN has no

Actually this is confusing.

aset.c doesn't actually use the MEMPOOL_ALLOC macro at all, it just
calls UNDEFINED, DEFINED, and NOACCESS. mcxt.c does however do the
MEMPOOL_ALLOC/FREE. So both layers are calling these macros for
overlapping memory areas which I find very confusing and I'm not sure
what the net effect is.

The MEMPOOL_FREE doesn't take any size argument and mcxt.c doesn't
have convenient access to a size argument. It could call the
GetChunkSpace method but that will include the allocation overhead and
in any case isn't this memory already marked noaccess by aset.c?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Pavan Deolasee
On Wed, Oct 19, 2016 at 2:37 PM, Heikki Linnakangas  wrote:

>
>>
> Actually, this is still not 100% safe. Flushing the WAL before modifying
> the FSM page is not enough. We also need to WAL-log a full-page image of
> the FSM page, otherwise we are still vulnerable to the torn page problem.
>
> I came up with the attached. This is fortunately much simpler than my
> previous attempt. I replaced the MarkBufferDirtyHint() calls with
> MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page
> image to fix the torn page issue.
>
>
Looks good to me.


> BTW any thoughts on race-condition on the primary? Comments at
>> MarkBufferDirtyHint() seems to suggest that a race condition is possible
>> which might leave the buffer without the DIRTY flag, but I'm not sure if
>> that can only happen when the page is locked in shared mode.
>>
>
> I think the race condition can only happen when the page is locked in
> shared mode. In any case, with this proposed fix, we'll use
> MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.
>
>
Yes, the fix will cover that problem (if it exists). The reason why I was
curious to know is because there are several reports of similar error in
the past and some of them did not involve as standby. Those reports mostly
remained unresolved and I wondered if this explains them. But yeah, my
conclusion was that the race is not possible with page locked in EXCLUSIVE
mode. So may be there is another problem somewhere or a crash recovery may
have left the FSM in inconsistent state.

Anyways, we seem good to go with the patch.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Indirect indexes

2016-10-19 Thread Simon Riggs
On 18 October 2016 at 23:46, Alexander Korotkov
 wrote:

> Then vacuum removes (0;1) from heap, reference to (0;1) from tbl_pk_idx.
> But how will it remove (1,1) tuple from tbl_val_indirect_idx?  Thus, before
> vacuuming tbl_val_indirect_idx we should know not only values of id which
> are being removed, but actually (id, val) pairs which are being removed.
> Should we collect those paris while scanning heap?  But we should also take
> into account that multiple heap tuples might have same (id, val) pair values
> (assuming there could be other columns being updated).  Therefore, we should
> take into account when last pair of particular (id, val) pair value was
> deleted from heap.  That would be very huge change to vacuum, may be even
> writing way more complex vacuum algorithm from scratch.  Probably, you see
> the better solution of this problem.

The best way to sum up the problem is to consider how we deal with
repeated updates to a single tuple that flip the value from A to B
then back to A then to B then A etc.. Any value in the index can point
to multiple versions of the same tuple and multiple index values can
point to the same tuple (PK value). This problem behaviour was already
known to me from Claudio's earlier analysis of WARM (thanks Claudio).

Yes, VACUUMing that is likely to be a complex issue, as you say. At
the moment I don't have a plan for that, but am not worried.

Indirect indexes produce less index entries in general than current,
so the problem is by-design much smaller than current situation.
Indirect indexes can support killed-tuple interface, so scanning the
index by users will result in natural index maintenance, further
reducing the problem.

So there will be a much reduced need for bulk maintenance. Bulk
maintainence of the index, when needed, can be performed by scanning
the whole table via the index, after the PK index has been vacuumed.
That can be optimized using an index-only scan of the PK to avoid
touching the heap, which should be effective since the VM has been so
recently refreshed. For correctness it would require the index blocks
to be locked against write while checking for removal, so bulk
collection of values to optimize the underlying index doesn't seem
useful. The index scan could also be further optimized by introducing
a visibility map for the index, which is something that would also
optimize normal index VACUUMs as well, but that is a later project and
not something for 10.x

At this stage, the discussion should be 1) can it work? 2) do we want
it? At present, I see that it can work and we just need to be careful
to measure the effectiveness of it to demonstrate that it really is a
better way of doing things in some cases. Indirect indexes are
definitely not a panacea for all problems, for me it is just another
option to add into the rich world of Postgres indexing. Getting
traction can be difficult if people don't understand the pros and cons
of the different index types, but I'm happy we have a wide spread of
knowledge now.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-19 Thread Heikki Linnakangas

On 10/18/2016 07:01 AM, Pavan Deolasee wrote:

On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas  wrote:


visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.


Right, I missed the problem with checksums.


Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.


I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.


Actually, this is still not 100% safe. Flushing the WAL before modifying 
the FSM page is not enough. We also need to WAL-log a full-page image of 
the FSM page, otherwise we are still vulnerable to the torn page problem.


I came up with the attached. This is fortunately much simpler than my 
previous attempt. I replaced the MarkBufferDirtyHint() calls with 
MarkBufferDirty(), to fix the original issue, plus WAL-logging a 
full-page image to fix the torn page issue.



BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.


I think the race condition can only happen when the page is locked in 
shared mode. In any case, with this proposed fix, we'll use 
MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.


- Heikki



truncation-vm-fsm-2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Draft for next update release (scheduled for 27th Oct)

2016-10-19 Thread Umair Shahid
Hi,

We are in the process of writing up draft notes for the upcoming update
releases. To initiate the process, we need to list out the major fixes
going into the release. Need help from -hackers to generate that list.

Thanks!

- Umair


Re: [HACKERS] minor issue: \c without parameter disconnect current user

2016-10-19 Thread Pavel Stehule
2016-10-18 20:11 GMT+02:00 Alvaro Herrera :

> Robert Haas wrote:
> > On Mon, Oct 17, 2016 at 12:49 AM, Pavel Stehule 
> wrote:
> > > I expect so \c without parameters has only informational character.
> But \c
> > > reset user.
> >
> > Yeah, I use that feature all the time.
>
> And so do the regression tests.
>
> I think what Pavel wants is \conninfo.
>

yes - still something can new for me :)

Pavel


>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] Parallel bitmap heap scan

2016-10-19 Thread Andres Freund
On 2016-10-19 09:43:10 +0530, Dilip Kumar wrote:
> On Tue, Oct 18, 2016 at 1:42 AM, Robert Haas  wrote:
> > But what's the impact on performance?  Presumably parallel bitmap heap
> > scan was already slower than the non-parallel version, and that commit
> > presumably widens the gap.  Seems like something to worry about...
> 
> I have checked the performance in my local machine and there is no
> impact on the gap.

Try measuring with something more heavy on bitmap scan time
itself. E.g.
SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= 
'1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
or similar.  The tpch queries don't actually spend that much time in the
bitmapscan itself - the parallization of the rest of the query is what
matters...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers