date:20150602

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Amit Langote


Fabien,

On 2015-06-03 PM 02:53, Fabien COELHO wrote:
> 
>>
>> It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
>> understand this and the previous discussion(s) correctly, the patch tries to
>> alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
>> the sync in step with writes (which do abide by the
>> checkpoint_completion_target). Given that impression, it seems
>> *_sync_on_write may even do the job.
> 
> I desagree with this one, because the sync is only *initiated*, not done. For
> this reason I think that "flush" seems a better word. I understand "sync" as
> "committed to disk". For the data to be synced, it should call with the "wait
> after" option, which is a partial "fsync", but that would be terrible for
> performance as all checkpointed pages would be written one by one, without any
> opportunity for reordering them.
> 
> For what it's worth and for the record, Linux sync_file_range documentation
> says "This is an asynchronous flush-to-disk operation" to describe the
> corresponding option. This is probably where I took it.
> 

Ah, okay! I didn't quite think about the async aspect here. But, I sure do
hope that the added mechanism turns out to be *less* async than kernel's own
dirty cache handling to achieve the hoped for gain.

> So two contenders:
> 
>   *_flush_to_disk
>   *_flush_on_write
> 

Yep!

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO




I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.


Yes, I do agree with the 4 clients, but I doubt that many user run their 
application at maximum available throughput all the time (like always 
driving foot to the floor). So for me throttled runs are more 
representative of real life.



Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.


Hmmm. The switch says "I prefer stable (say latency bounded) performance", 
if you run a web site probably you should want that.


Anyway, I'll look at sorting when I have some time.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO



Hello Amit,


It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination", why
not...

[...]

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If 
I understand this and the previous discussion(s) correctly, the patch 
tries to alleviate the problems caused by one-big-sync-at-the 
end-of-writes by doing the sync in step with writes (which do abide by 
the checkpoint_completion_target). Given that impression, it seems 
*_sync_on_write may even do the job.


I desagree with this one, because the sync is only *initiated*, not done. 
For this reason I think that "flush" seems a better word. I understand 
"sync" as "committed to disk". For the data to be synced, it should call 
with the "wait after" option, which is a partial "fsync", but that would 
be terrible for performance as all checkpointed pages would be written one 
by one, without any opportunity for reordering them.


For what it's worth and for the record, Linux sync_file_range 
documentation says "This is an asynchronous flush-to-disk operation" to 
describe the corresponding option. This is probably where I took it.


So two contenders:

  *_flush_to_disk
  *_flush_on_write

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO



That might be the case in a database with a single small table; i.e. 
where all the writes go to a single file. But as soon as you have 
large tables (i.e. many segments) or multiple tables, a significant 
part of the writes issued independently from checkpointing will be 
outside the processing of the individual segment.


Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.


That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.


I meant my argument with "continuous flushing" activated, so there is no 
up to 30 seconds delay induced my the memory manager. Hmmm, maybe I do not 
understood your argument.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Amit Kapila

On Tue, Jun 2, 2015 at 6:45 PM, Fabien COELHO  wrote:
>
>
> Hello Amit,
>
>> [...]
>>>
>>> The objective is to help avoid PG stalling when fsyncing on checkpoints,
>>> and in general to get better latency-bound performance.
>>
>>
>> Won't this lead to more-unsorted writes (random I/O) as the
>> FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
>> per files or order of blocks on disk?
>
>
> Yep, probably. Under "moderate load" this is not an issue. The
io-scheduler and other hd firmware will probably reorder writes anyway.
Also, if several data are updated together, probably they are likely to be
already neighbours in memory as well as on disk.
>
>> I remember sometime back there was some discusion regarding
>> sorting writes during checkpoint, one idea could be try to
>> check this idea along with that patch.  I just saw that Andres has
>> also given same suggestion which indicates that it is important
>> to see both the things together.
>
>
> I would rather separate them, unless this is a blocker. This version
seems already quite effective and very light. ISTM that adding a sort phase
would mean reworking significantly how the checkpointer processes pages.
>

I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.

Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.
I think users might be able to predict that at table level, but inspite of
that
I don't think having any such knob can give us ticket to flush the buffers
in random order.

>> Also here another related point is that I think currently even fsync
>> requests are not in order of the files as they are stored on disk so
>> that also might cause random I/O?
>
>
> I think that currently the fsync is on the file handler, so what happens
depends on how fsync is implemented by the system.
>

That can also lead to random I/O if the fsync for different files is not in
order as they are actually stored on disk.

>> Yet another idea could be to allow BGWriter to also fsync the dirty
>> buffers,
>
>
> ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".
>

I think patch just issues an async operation not the actual flush.  Why
I have suggested so is that in your tests when the checkpoint_timeout
is small it seems there is a good gain in performance that means if
keep on flushing dirty buffers at regular intervals, the system's
performance
is good and BGWriter is the process where that can be done conveniently
apart from checkpoint,  one might think that if same can be achieved by
using
shorter checkpoint_timeout interval, then why to do this incremental flushes
by bgwriter, but in reality I think checkpoint is responsible for other
things
as well other than dirty buffers, so we can't leave everything till
checkpoint
happens.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-02 Thread Amit Kapila

On Tue, Jun 2, 2015 at 10:26 PM, Andrew Dunstan  wrote:

>
> On 05/15/2015 02:21 AM, Amit Kapila wrote:
>>
>>
>> Find the patch which gets rid of rmtree usage.  I have made it as
>> a separate function because the same code is used from
>> create_tablespace_directories() as well.  I thought of extending the
>> same API for using it from destroy_tablespace_directories() as well,
>> but due to special handling (especially for ENOENT) in that function,
>> I left it as of now.
>>
>>
>>
>>
>
>
> Well, it seems to me the new function is being altogether way too trusting
> about the nature of what it's being asked to remove. In the first place,
> the S_ISDIR/rmdir branch should only be for Windows, and secondly in the
> other branch we should be checking that S_ISLNK is true. It would actually
> be nice if we could test for a junction point on Windows, but that seems to
> be a bit difficult.

I think during recovery for tablespace related operations, it is
quite possible to have a directory instead of symlink in some
special cases (see function TablespaceCreateDbspace() and comments
in destroy_tablespace_directories() { ..Try to remove the symlink..}).
Also this new function is being called from create_tablespace_directories()
which uses the code as written in new function, so it doesn't make much
sense to change it Windows and non-Windows specific code.

> The function should probably return a boolean to indicate any error,
> rather than void, so that the caller can take action accordingly.

I think returning boolean is good if the function has Windows
and non-Windows specific code, else we might need short, int16
or something like that as there will be three return values
(0 - success, 1 - fail to remove directory, 2 - fail to remove
symlink).

On the whole this function can be mainly used for tablespace
related paths during recovery, it isn't generic enough to be
use from all paths.  I think as proposed name of the new
function (rmsymlink) looks quite generic, so either we can
change it to (rm_tablespace_symlinks) or may be I can just
duplicate the code in read_tablespace_map() related check and
then we don't need this new function.

> In the present case we should probably be stopping recovery dead in its
> tracks, but I certainly don't think we should just be ignoring it.
>
>
Do you think we should do anything before returning error?
This operation haapens in the beginning of recovery before
replay of any records, so throwing ERROR from here shouldn't
have any impact unless we have done any operation which
can create problem when user next time starts the recovery.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Alvaro Herrera

Thomas Munro wrote:
> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> wrote:
> > My guess is that the file existed, and perhaps had one or more pages,
> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > back.  read() returns 0 in this case but doesn't set errno.
> >
> > I didn't find a way to set things so that the file exists but is of
> > shorter contents than oldestMulti by the time the checkpoint record is
> > replayed.
> 
> I'm just starting to learn about the recovery machinery, so forgive me
> if I'm missing something basic here, but I just don't get this.  As I
> understand it, offsets/0046 should either have been copied with that
> page present in it if it existed before the backup started (apparently
> not in this case), or extended to contain it by WAL records that come
> after the backup label but before the checkpoint record that
> references it (also apparently not in this case).

Exactly --- that's the spot at which I am, also.  I have had this
spinning in my head for three days now, and tried every single variation
that I could think of, but like you I was unable to reproduce the issue.
However, our customer took a second base backup and it failed in exactly
the same way, module some changes to the counters (the file that
didn't exist was 004B rather than 0046).  I'm still at a loss at what
the failure mode is.  We must be missing some crucial detail ...


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why does txid_current() assign new transaction-id?

2015-06-02 Thread Fujii Masao

On Wed, Jun 3, 2015 at 9:04 AM, Naoya Anzai  wrote:
>> > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
>> > index 89a609f..6485b42 100644
>> > --- a/doc/src/sgml/func.sgml
>> > +++ b/doc/src/sgml/func.sgml
>> > @@ -16233,7 +16233,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
>> >
>> > 
>> > txid_current()
>> > bigint
>> > -   get current transaction ID
>> > +   get current transaction ID, assigning a new one if current 
>> > transaction does not have one
>> ^ the
>
> That looks good.

Committed. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tom Lane

Tomas Vondra  writes:
> On 06/02/15 23:27, Tom Lane wrote:
>> Do we have instructions around here anyplace on how to set up/use
>>> TPC-DS? I couldn't find anything about it on the wiki ...

> Not that I'm aware of, but it's not really all that difficult.
> [ instructions... ]

Thanks.  I added some debugging printouts to trace the logic, and found
that once in awhile we get cases like this:

patch would change precheck result to fail for path cost 133752.12..136070.23 
vs old path 135378.72..135550.22
phantom path causes rejection of old path with cost 135378.72..135550.22
phantom path has final cost 133752.12..136161.64, accepted

That is, we have an unparameterized path whose (initial estimate of) total
cost is just slightly more than the total cost of some old path, but whose
startup cost is less.  The existing logic in add_path_precheck allows
construction of this path to proceed.  If its finished total cost is still
within 1% of the old path's cost, it is able to supplant the old path on
the grounds that the total costs are fuzzily the same while its startup
cost is less.  The patch's change to add_path_precheck causes the startup
cost to be disregarded so that the new path is rejected immediately, and
the path replacement doesn't happen.

Now, what this demonstrates is that add_path_precheck is not operating as
intended --- as I said earlier, it's supposed to just save path
construction cycles, not change the outcome of any path replacement tests.
What we ought to do is fix it to do the cost comparisons fuzzily, and then
it should be possible for it to disregard startup cost when appropriate
without changing the results beyond that.  However making the comparisons
fuzzy will certainly result in some plan changes at the margin, because it
will now let through some paths that it should have let through and
didn't.  Some of those changes will be for the better and others for the
worse.  (I don't have any faith in your conclusion that the patch
as-posted always results in better plans.  There's no reason to believe
that sub-one-percent differences in planner cost estimates will reliably
predict real-world wins.)

What it seems like we should do, if we want to back-patch this, is apply
it without the add_path_precheck changes.  Then as an independent
HEAD-only patch, change add_path_precheck so that it's behaving as
designed.  It looks to me like that will save some planning time in any
case --- changing add_path_precheck to disregard startup cost when
appropriate seems to let it reject a lot more paths than it used to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro

On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  wrote:
> My guess is that the file existed, and perhaps had one or more pages,
> but the wanted page doesn't exist, so we tried to read but got 0 bytes
> back.  read() returns 0 in this case but doesn't set errno.
>
> I didn't find a way to set things so that the file exists but is of
> shorter contents than oldestMulti by the time the checkpoint record is
> replayed.

I'm just starting to learn about the recovery machinery, so forgive me
if I'm missing something basic here, but I just don't get this.  As I
understand it, offsets/0046 should either have been copied with that
page present in it if it existed before the backup started (apparently
not in this case), or extended to contain it by WAL records that come
after the backup label but before the checkpoint record that
references it (also apparently not in this case).  If neither of these
things happened then that is completely different from the
segment-does-not-exist case where we read zeroes if in recovery on the
assumption that later WAL records must be about to delete the file.
There is no way that future WAL records will make an existing segment
file shorter! So at this point don't we know that there is something
wrong with the backup itself?

Put another way, if you bring this up under 9.4.1, won't it also be
unable to access multixact 4624559 at this point?  Of course it won't
try to do so during recovery like 9.4.2 does, but I'm just trying to
understand how this is supposed to work for 9.4.1 if it needs to
access that multixact for other reasons once normal running is reached
(say you recover up to that checkpoint, and then run
pg_get_multixact_members, or a row has that xmax and its members to be
looked up by a vacuum or any normal transaction).  In other words,
isn't this a base backup that is somehow broken, not at all like the
pg_upgrade corruption case which involved the specific case of
multixact 1 and an entirely missing segment file, and 9.4.2 just tells
you about it sooner than 9.4.1?

For what it's worth, I've also spent a lot of time trying to reproduce
basebackup problems with multixact creation, vacuums and checkpoints
injected at various points between copying backup label, pg_multixact,
and pg_control.  I have so far failed to produce anything more
interesting than the 'reading zeroes' case (see attached
"copy-after-trunction.sh") and a case where the control file points at
a segment that doesn't exist, but it doesn't matter because the backup
label points at a checkpoint from a time when it did and
oldestMultiXactId is updated from there, and then procedes exactly as
it should (see "copy-before-truncation.sh").  I updated my scripts to
look a bit more like your nicely automated example (though mine use a
different trick to create small quantities of multixacts so they run
against unpatched master).  I have also been considering a scenario
where multixact ID wraparound occurs during basebackup with some
ordering that causes trouble, but I don't yet see why it would break
if you replay the WAL from the backup label checkpoint (and I think
the repro would take days/weeks to run...)

-- 
Thomas Munro
http://www.enterprisedb.com

copy-after-truncation.sh
Description: Bourne shell script

copy-before-truncation.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Amit Langote

Hi,

On 2015-06-02 PM 07:19, Fabien COELHO wrote:
> 
>> Not that the GUC naming is the most pressing issue here, but do you think
>> "*_flush_on_write" describes what the patch does?
> 
> It is currently "*_flush_to_disk". In Andres Freund version the name is
> "sync_on_checkpoint_flush", but I did not found it very clear. Using
> "*_flush_on_write" instead as your suggest, would be fine as well, it
> emphasizes the "when/how" it occurs instead of the final "destination", why
> not...
> 
> About words: checkpoint "write"s pages, but this really mean passing the pages
> to the memory manager, which will think about it... "flush" seems to suggest a
> more effective write, but really it may mean the same, the page is just passed
> to the OS. So "write/flush" is really "to OS" and not "to disk". I like the
> data to be on "disk" in the end, and as soon as possible, hence the choice to
> emphasize that point.
> 
> Now I would really be okay with anything that people find simple to
> understand, so any opinion is welcome!
> 

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems *_sync_on_write
may even do the job.

Again, this is a minor issue.

By the way, I tend to agree with others here that there needs to be found a
good balance such that this sync-blocks-one-at-time-in-random-order approach
does not hurt generalized workload too much although it seems to help with
solving the latency problem that you seem set out to solve.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why does txid_current() assign new transaction-id?

2015-06-02 Thread Naoya Anzai

> > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> > index 89a609f..6485b42 100644
> > --- a/doc/src/sgml/func.sgml
> > +++ b/doc/src/sgml/func.sgml
> > @@ -16233,7 +16233,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
> >
> > 
> > txid_current()
> > bigint
> > -   get current transaction ID
> > +   get current transaction ID, assigning a new one if current 
> > transaction does not have one
> ^ the

That looks good.

Regards,

Naoya

---
Naoya Anzai
Engineering Department
NEC Solution Inovetors, Ltd.
E-Mail: nao-an...@xc.jp.nec.com
---


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor issue with BRIN regression tests

2015-06-02 Thread Peter Geoghegan

On Mon, Jun 1, 2015 at 7:35 PM, Peter Geoghegan  wrote:
> Attached patch adjusts BRIN regression tests to make a non-obvious
> dependency on tuple order explicit. Currently, an index-only scan plan
> is used by the query that I've adjusted. I'd rather be sure that that
> continues.

Here is another patch, this time removing a useless ON CONFLICT DO UPDATE test.

-- 
Peter Geoghegan
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index eca9690..325e88b 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -245,7 +245,6 @@ ERROR:  there is no unique or exclusion constraint matching the ON CONFLICT spec
 insert into insertconflicttest values (6, 'Passionfruit') on conflict (lower(fruit)) do update set fruit = excluded.fruit;
 ERROR:  there is no unique or exclusion constraint matching the ON CONFLICT specification
 -- Check the target relation can be aliased
-insert into insertconflicttest values (6, 'Passionfruits') on conflict (key) do update set fruit = excluded.fruit;
 insert into insertconflicttest AS ict values (6, 'Passionfruit') on conflict (key) do update set fruit = excluded.fruit; -- ok, no reference to target table
 insert into insertconflicttest AS ict values (6, 'Passionfruit') on conflict (key) do update set fruit = ict.fruit; -- ok, alias
 insert into insertconflicttest AS ict values (6, 'Passionfruit') on conflict (key) do update set fruit = insertconflicttest.fruit; -- error, references aliased away name
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index a0bdd7f..7dd5032 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -102,7 +102,6 @@ insert into insertconflicttest values (5, 'Lemon') on conflict (fruit) do update
 insert into insertconflicttest values (6, 'Passionfruit') on conflict (lower(fruit)) do update set fruit = excluded.fruit;
 
 -- Check the target relation can be aliased
-insert into insertconflicttest values (6, 'Passionfruits') on conflict (key) do update set fruit = excluded.fruit;
 insert into insertconflicttest AS ict values (6, 'Passionfruit') on conflict (key) do update set fruit = excluded.fruit; -- ok, no reference to target table
 insert into insertconflicttest AS ict values (6, 'Passionfruit') on conflict (key) do update set fruit = ict.fruit; -- ok, alias
 insert into insertconflicttest AS ict values (6, 'Passionfruit') on conflict (key) do update set fruit = insertconflicttest.fruit; -- error, references aliased away name

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread David Steele

On 5/31/15 1:46 PM, Tom Lane wrote:
> Hm.  I think the impact on third-party backup tools would be rather bad,
> but there's a simple modification of the idea that might fix that:
> just always create pg_xlog as a symlink to pg_xjournal during initdb.
> Anybody who blindly removes pg_xlog won't have done anything
> irreversible.  We could deprecate pg_xlog and stop creating the symlink
> after a few releases, once third-party tools have had a reasonable
> amount of time to adjust.

As the author of a third-party backup tool I'd prefer to make a clean
break and just rename the directories in a single release.  9.5 has
similar backup/restore related changes with no nod to backwards
compatibility.

And that's fine.  Applications can iterate faster than databases and
they should.

Two options to make lives easier:

1) An initdb option to create the necessary symlinks as Tom suggests,
just not by default.
2) Instructions in the release notes for users who did not see the
initdb option in the first place.

-- 
- David Steele
da...@pgmasters.net

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tom Lane

I wrote:
> Hm.  In principle, add_path_precheck shouldn't be doing anything except
> rejecting paths that would get rejected anyway later on.  However it
> strikes me that there's a logic error in it; specifically, this
> simplifying assumption is wrong:
>  * For speed, we make exact rather than fuzzy cost comparisons. If an
>  * old path dominates the new path exactly on both costs, it will
>  * surely do so fuzzily.

After staring at this awhile longer, I still don't see exactly why it's
changing the results.  Do we have instructions around here anyplace on
how to set up/use TPC-DS?  I couldn't find anything about it on the
wiki ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> > the disk it'll always get one at a segment boundary, right? I'm not sure
> > that's actually ok; because the value at the beginning of the segment
> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> > filled the page with zeros.
> >
> > I think that should be harmless, the worst that can happen is that
> > oldestOffset errorneously is 0, which should be correct, even though
> > possibly overly conservative, in these cases.
> 
> Uh oh.  That seems like a real bad problem for this approach.  What
> keeps that from being the opposite of too conservative?  There's no
> "safe" value in a circular numbering space.

I think it *might* (I'm really jetlagged) be fine because that'll only
happen after a upgrade from < 9.3. And in that case we initialize
nextOffset to 0. That ought to safe us?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund  wrote:
> I'm not really convinced tying things closer to having done trimming is
> easier to understand than tying things to recovery having finished.
>
> E.g.
> if (did_trim)
> oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
> imo is harder to understand than if (!InRecovery).
>
> Maybe we could just name it finishedStartup and rename the functions 
> accordingly?

Basing that particular call site on InRecovery doesn't work, because
InRecovery isn't set early enough.  But I'm fine to rename it to
whatever.

> Maybe it's worthwhile to add a 'NB: At this stage the data directory is
> not yet necessarily consistent' StartupMultiXact's comments, to avoid
> reintroducing problems like this?

Sure.

>>   /*
>> +  * We can read this without a lock, because it only changes when 
>> nothing
>> +  * else is running.
>> +  */
>> + did_trim = MultiXactState->didTrimMultiXact;
>
> Err, Hot Standby? It might be ok to not lock, but the comment is
> definitely wrong. I'm inclined to simply use locking, this doesn't look
> sufficiently critical performancewise.

/me nods.  Good point.

> Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> the disk it'll always get one at a segment boundary, right? I'm not sure
> that's actually ok; because the value at the beginning of the segment
> can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> filled the page with zeros.
>
> I think that should be harmless, the worst that can happen is that
> oldestOffset errorneously is 0, which should be correct, even though
> possibly overly conservative, in these cases.

Uh oh.  That seems like a real bad problem for this approach.  What
keeps that from being the opposite of too conservative?  There's no
"safe" value in a circular numbering space.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

On 2015-06-01 14:22:32 -0400, Robert Haas wrote:

> commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
> Author: Robert Haas 
> Date:   Fri May 29 14:35:53 2015 -0400
> 
> foo

Hehe!

> diff --git a/src/backend/access/transam/multixact.c 
> b/src/backend/access/transam/multixact.c
> index 9568ff1..aca829d 100644
> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -199,8 +199,9 @@ typedef struct MultiXactStateData
>   MultiXactOffset nextOffset;
>  
>   /*
> -  * Oldest multixact that is still on disk.  Anything older than this
> -  * should not be consulted.  These values are updated by vacuum.
> +  * Oldest multixact that may still be referenced from a relation.
> +  * Anything older than this should not be consulted.  These values are
> +  * updated by vacuum.
>*/
>   MultiXactId oldestMultiXactId;
>   Oid oldestMultiXactDB;
> @@ -213,6 +214,18 @@ typedef struct MultiXactStateData
>*/
>   MultiXactId lastCheckpointedOldest;
>  
> + /*
> +  * This is the oldest file that actually exist on the disk.  This value
> +  * is initialized by scanning pg_multixact/offsets, and subsequently
> +  * updated each time we complete a truncation.  We need a flag to
> +  * indicate whether this has been initialized yet.
> +  */
> + MultiXactId oldestMultiXactOnDisk;
> + boololdestMultiXactOnDiskValid;
> +
> + /* Has TrimMultiXact been called yet? */
> + booldidTrimMultiXact;

I'm not really convinced tying things closer to having done trimming is
easier to understand than tying things to recovery having finished.

E.g.
if (did_trim)
oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
imo is harder to understand than if (!InRecovery).

Maybe we could just name it finishedStartup and rename the functions 
accordingly?

> @@ -1956,12 +1971,6 @@ StartupMultiXact(void)
>*/
>   pageno = MXOffsetToMemberPage(offset);
>   MultiXactMemberCtl->shared->latest_page_number = pageno;
> -
> - /*
> -  * compute the oldest member we need to keep around to avoid old member
> -  * data overrun.
> -  */
> - DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
>  }

Maybe it's worthwhile to add a 'NB: At this stage the data directory is
not yet necessarily consistent' StartupMultiXact's comments, to avoid
reintroducing problems like this?

>   /*
> +  * We can read this without a lock, because it only changes when nothing
> +  * else is running.
> +  */
> + did_trim = MultiXactState->didTrimMultiXact;

Err, Hot Standby? It might be ok to not lock, but the comment is
definitely wrong. I'm inclined to simply use locking, this doesn't look
sufficiently critical performancewise.

> +static MultiXactOffset
> +GetOldestReferencedOffset(MultiXactId oldestMXact)
> +{
> + MultiXactId earliest;
> + MultiXactOffset oldestOffset;
> +
> + /*
> +  * Because of bugs in early 9.3.X and 9.4.X releases (see comments in
> +  * TrimMultiXact for details), oldest_datminmxid might point to a
> +  * nonexistent multixact.  If so, use the oldest one that actually 
> +  * exists.  Anything before this can't be successfully used anyway.
> +  */
> + earliest = GetOldestMultiXactOnDisk();
> + if (MultiXactIdPrecedes(oldestMXact, earliest))
> + oldestMXact = earliest;

Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
the disk it'll always get one at a segment boundary, right? I'm not sure
that's actually ok; because the value at the beginning of the segment
can very well end up being a 0, as MaybeExtendOffsetSlru() will have
filled the page with zeros.

I think that should be harmless, the worst that can happen is that
oldestOffset errorneously is 0, which should be correct, even though
possibly overly conservative, in these cases.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tom Lane

Tomas Vondra  writes:
> So, if required_outer=false and both p->_startup=false, we get 
> consider_startup=false irrespectedly of the required_outer value, so
>  (!consider_startupe) != required_outer
> so that the conditions are not equivalent. And indeed, by reverting the 
> if condition to the previous form, we get the same plans as on master.

Ah, I see we arrived at the same conclusions independently.

> I don't know whether this is correct behavior or not, but in all three 
> cases I've observed on TPC-DS, the new plans performed better

Hm.  In principle, add_path_precheck shouldn't be doing anything except
rejecting paths that would get rejected anyway later on.  However it
strikes me that there's a logic error in it; specifically, this
simplifying assumption is wrong:

 * For speed, we make exact rather than fuzzy cost comparisons. If an
 * old path dominates the new path exactly on both costs, it will
 * surely do so fuzzily.

The old path's cost might be fuzzily the same, not fuzzily better.  This
is less likely to be an issue for total cost (since even if they're
fuzzily the same at this point, they probably won't be after we get done
adding more components to the new path's cost estimate).  But it could
well be an issue for startup cost since that's often exactly zero.  So
I guess this must be rejecting or letting through a slightly different set
of non-parameterized paths than it did before.  I'm not prepared to assume
the results are always superior, because we haven't fixed the actual logic
oversight.

> 1) Shouldn't the CONSIDER_PATH_STARTUP_COST(p) be defined rather like
> this? I mean, if it's a parametric query, use any of the flags?

No.  The paths are parametric, not the query as a whole.

> 2) Is it really possible to get different values for the startup
> flags on path1 and path2 in compare_path_costs_fuzzily()?

If we were comparing a parameterized and an unparameterized path,
it might be that one's startup cost is of interest while the other
one's isn't.  This is written to let a path win on startup cost
as long as its startup cost is actually of interest.

> If not
> (if I get it right, path1->parent == path2->parent anyway), then
> maybe passing that as a function parameter (just like before)
> would be better. The macro probably won't be as nice, but I find
> it easier to understand.

Well, we could pass the parent RelOptInfo explicitly instead of passing
the two flags, but I don't find that to be an improvement particularly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tomas Vondra




On 06/02/15 21:39, Tom Lane wrote:

I wrote:

I'm a bit disturbed by that, because AFAICS from the plans, these queries
did not involve any semi or anti joins, which should mean that the patch
would not have changed the planner's behavior.


After contemplating my navel for awhile, it occurred to me that maybe the
patch's changes in add_path_precheck could have caused behavioral changes
at the margins.  Could you try a run without that (without the last two
hunks in pathnode.c)?


Yes, I think that's the cause. See the end of the message I sent just 
before you posted this.



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tom Lane

I wrote:
> I'm a bit disturbed by that, because AFAICS from the plans, these queries
> did not involve any semi or anti joins, which should mean that the patch
> would not have changed the planner's behavior.

After contemplating my navel for awhile, it occurred to me that maybe the
patch's changes in add_path_precheck could have caused behavioral changes
at the margins.  Could you try a run without that (without the last two
hunks in pathnode.c)?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Andres Freund

On 2015-06-02 18:59:05 +0200, Fabien COELHO wrote:
> 
> >>>IMO this feature, if done correctly, should result in better performance
> >>>in 95+% of the workloads
> >>
> >>To demonstrate that would require time...
> >
> >Well, that's part of the contribution process. Obviously you can't test
> >100% of the problems, but you can work hard with coming up with very
> >adversarial scenarios and evaluate performance for those.
> 
> I did spent time (well, a machine spent time, really) to collect some
> convincing data for the simple version without sorting to demonstrate that
> it brings a clear value, which seems not to be enough...

"which seems not to be enough" - man. It's trivial to make things
faster/better/whatever if you don't care about regressions in other
parts. And if we'd add a guc for each of these cases we'd end up with
thousands of them.

> My opinion is that throughput is given too much attention in general, but if
> both can be kept/improved, this would be easier to sell, obviously.

Your priorities are not everyone's. That's life.


> >That might be the case in a database with a single small table;
> >i.e. where all the writes go to a single file. But as soon as you have
> >large tables (i.e. many segments) or multiple tables, a significant part
> >of the writes issued independently from checkpointing will be outside
> >the processing of the individual segment.
> 
> Statistically, I think that it would reduce the number of unrelated writes
> taken in a fsync by about half: the last table to be written on a
> tablespace, at the end of the checkpoint, will have accumulated
> checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
> time, while the first table will have avoided most of them.

That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

On 2015-06-02 11:49:56 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund  wrote:
> > On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> >> The exact circumstances under which we're willing to replace a
> >> relminmxid with a newly-computed one that differs are not altogether
> >> clear to me, but there's an "if" statement protecting that logic, so
> >> there are some circumstances in which we'll leave the existing value
> >> intact.
> >
> > I guess we'd have to change that then.
> 
> Yeah, but first we'd need to assess why it's like that.  Tom was the
> one who installed the current logic, but I haven't been able to fully
> understand it.

We're talking about:
/* Similarly for relminmxid */
if (MultiXactIdIsValid(minmulti) &&
pgcform->relminmxid != minmulti &&
(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 MultiXactIdPrecedes(ReadNextMultiXactId(), 
pgcform->relminmxid)))
{
pgcform->relminmxid = minmulti;
dirty = true;
}

right? Before that change (78db307bb/87f830e0ce) we only updated
relminmxid if the new value was newer than the old one. That's to avoid
values from going backwards, e.g. when a relation is first VACUUM
FREEZEd and then a normal VACUUM is performed (these values are
unfortunately based on the cutoff values instead of the observed
minima). The new thing is the || MultiXactIdPrecedes(ReadNextMultiXactId(), 
pgcform->relminmxid)
which is there to recognize values from the future. E.g. the 1
errorneously left in place by pg_upgrade.

Makes sense?

> >> It would similarly do so when the oldest MXID reference in the
> >> relation is in fact 1, but that value can't be vacuumed away yet.
> >
> > I'd thought of just forcing consumption of one multixact in that
> > case. Not pretty, but imo acceptable.
> 
> What if multixact 1 still has living members?

I think we should just trigger the logic if 1 is below the multi
freezing horizon - in which case a) the 1 will automatically be
replaced, because the horizon is newer b) it can't have a living member
anyway.

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Christian Ullrich

* Michael Nolan wrote:

> Why not take a simpler approach and create a zero length file in directories 
> that should not be fiddled with by non-experts using a file name something 
> like "DO.NOT.DELETE.THESE.FILES"? 

Move the critical things into a new subdirectory?
 $PGDATA/pg_critical/pg_xlog? Perhaps "pg_internal" instead, or
 "pg_indispensable", or "pg_here_be_dragons", or
 "pg_delete_this_or_anything_in_here_and_you_wil
l_have_utterly_destroyed_your_database"? That last one should be
 clear enough, if a tad long for deeply nested PGDATA
 locations.

Symlinking the old path to the new one would be impossible, of
 course, if the intent is to protect against "rm -rf *log*/*"
 fiends.

-- 

Christian

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO




IMO this feature, if done correctly, should result in better performance
in 95+% of the workloads


To demonstrate that would require time...


Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.


I did spent time (well, a machine spent time, really) to collect some 
convincing data for the simple version without sorting to demonstrate that 
it brings a clear value, which seems not to be enough...



I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.


Hmmm.

My opinion is that throughput is given too much attention in general, but 
if both can be kept/improved, this would be easier to sell, obviously.




It's also not just the sequential writes making this important, it's also
that it allows to do the final fsync() of the individual segments as soon
as their last buffer has been written out.


Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.


That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.


Statistically, I think that it would reduce the number of unrelated writes 
taken in a fsync by about half: the last table to be written on a 
tablespace, at the end of the checkpoint, will have accumulated 
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint 
time, while the first table will have avoided most of them.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-02 Thread Andrew Dunstan

On 05/15/2015 02:21 AM, Amit Kapila wrote:
On Thu, May 14, 2015 at 10:29 PM, Andrew Dunstan > wrote:

>
>
> On 05/14/2015 10:52 AM, Robert Haas wrote:
>>
>> On Thu, May 14, 2015 at 12:12 AM, Amit Kapila 
mailto:amit.kapil...@gmail.com>> wrote:

>>>
>>> On Thu, May 14, 2015 at 2:10 AM, Andrew Dunstan 
mailto:and...@dunslane.net>> wrote:

 How about if we simply abort if we find a non-symlink where we 
want the
 symlink to be, and only remove something that is actually a 
symlink (or a

 junction point, which is more or less the same thing)?
>>>
>>> We can do that way and for that I think we need to use rmdir
>>> instead of rmtree in the code being discussed (recovery path),
>>> OTOH we should try to minimize the errors raised during
>>> recovery.
>>
>> I'm not sure I understand this issue in detail, but why would using
>> rmtree() on something you expect to be a symlink ever be a good idea?
>> It seems like if things are the way you expect them to be, it has no
>> benefit, but if they are different from what you expect, you might
>> blow away a ton of important data.
>>
>> Maybe I am just confused.
>>
>
>
> The suggestion is to get rid of using rmtree. Instead, if we find a 
non-symlink in pg_tblspc we'll make the user clean it up before we can 
continue. So your instinct is in tune with my suggestion.

>

Find the patch which gets rid of rmtree usage.  I have made it as
a separate function because the same code is used from
create_tablespace_directories() as well.  I thought of extending the
same API for using it from destroy_tablespace_directories() as well,
but due to special handling (especially for ENOENT) in that function,
I left it as of now.

Well, it seems to me the new function is being altogether way too 
trusting about the nature of what it's being asked to remove. In the 
first place, the S_ISDIR/rmdir branch should only be for Windows, and 
secondly in the other branch we should be checking that S_ISLNK is true. 
It would actually be nice if we could test for a junction point on 
Windows, but that seems to be a bit difficult. The function should 
probably return a boolean to indicate any error, rather than void, so 
that the caller can take action accordingly. In the present case we 
should probably be stopping recovery dead in its tracks, but I certainly 
don't think we should just be ignoring it.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: Remove contrib entirely

2015-06-02 Thread David Fetter

On Mon, Jun 01, 2015 at 05:48:16PM -0300, Alvaro Herrera wrote:
> Peter Eisentraut wrote:
> 
> > But it is a valid point that we'd need to build up more extension
> > API tests before we start cutting back significantly on the
> > maybe-example-maybe-real extensions that we ship in contrib.  We
> > need to find a good middle ground.
> 
> Nowadays we can test concurrent behavior with isolationtester, and
> we can test APIs with src/test/modules.  We can now easily add tests
> for lots of stuff that we don't currently test, without having
> modules that actually produce for-install programs or libraries.
> 
> In any case, +1 on moving useful extensions to src/extensions/.  I
> just read somewhere (not this thread?) that Debian for their 9.5
> packaging is going to get rid of the postgresql-contrib package, and
> instead they are going to ship all that stuff in the main package.
> Seems they were prompted to do this because of our inaction in this
> area ...

I suspect they may also have realized that anything not shipped in the
main distribution can cause enough bureaucratic pain, as in "write a
justification for the legal department for why we should install this
software," to cause things simply never to get installed.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Missing "-i / --ignore-version" in pg_dump help

2015-06-02 Thread Fujii Masao

On Fri, May 22, 2015 at 11:01 PM, Tom Lane  wrote:
> Fujii Masao  writes:
>> On Fri, May 22, 2015 at 8:59 PM, Fabrízio de Royes Mello
>>  wrote:
>>> There are some reason to "-i, --ignore-version" option doesn't appear in
>>> pg_dump help?
>
>> Because it's obsolete option, I think.
>
> Yeah, see commit c22ed3d523782c43836c163c16fa5a7bb3912826.
>
> Considering we shipped that in 8.4, maybe it's time to remove all
> trace of those switches.  We'd still have to wait a couple releases
> before it'd be safe to use -i for something else, but it'd be a start.

+1

Barring any objection, I will apply the attached patch and
remove that obsolete option..

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/ref/pg_dump.sgml
--- b/doc/src/sgml/ref/pg_dump.sgml
***
*** 280,295  PostgreSQL documentation
   
  
   
-   -i
-   --ignore-version
-   
-
- A deprecated option that is now ignored.
-
-   
-  
- 
-  
-j njobs
--jobs=njobs

--- 280,285 
*** a/doc/src/sgml/ref/pg_dumpall.sgml
--- b/doc/src/sgml/ref/pg_dumpall.sgml
***
*** 121,136  PostgreSQL documentation
   
  
   
-   -i
-   --ignore-version
-   
-
- A deprecated option that is now ignored.
-
-   
-  
- 
-  
-o
--oids

--- 121,126 
*** a/doc/src/sgml/ref/pg_restore.sgml
--- b/doc/src/sgml/ref/pg_restore.sgml
***
*** 217,232 
   
  
   
-   -i
-   --ignore-version
-   
-
- A deprecated option that is now ignored.
-
-   
-  
- 
-  
-I index
--index=index

--- 217,222 
*** a/src/bin/pg_dump/pg_dump.c
--- b/src/bin/pg_dump/pg_dump.c
***
*** 291,297  main(int argc, char **argv)
  		{"file", required_argument, NULL, 'f'},
  		{"format", required_argument, NULL, 'F'},
  		{"host", required_argument, NULL, 'h'},
- 		{"ignore-version", no_argument, NULL, 'i'},
  		{"jobs", 1, NULL, 'j'},
  		{"no-reconnect", no_argument, NULL, 'R'},
  		{"oids", no_argument, NULL, 'o'},
--- 291,296 
***
*** 377,383  main(int argc, char **argv)
  
  	InitDumpOptions(&dopt);
  
! 	while ((c = getopt_long(argc, argv, "abcCd:E:f:F:h:ij:n:N:oOp:RsS:t:T:U:vwWxZ:",
  			long_options, &optindex)) != -1)
  	{
  		switch (c)
--- 376,382 
  
  	InitDumpOptions(&dopt);
  
! 	while ((c = getopt_long(argc, argv, "abcCd:E:f:F:h:j:n:N:oOp:RsS:t:T:U:vwWxZ:",
  			long_options, &optindex)) != -1)
  	{
  		switch (c)
***
*** 418,427  main(int argc, char **argv)
  dopt.pghost = pg_strdup(optarg);
  break;
  
- 			case 'i':
- /* ignored, deprecated option */
- break;
- 
  			case 'j':			/* number of dump jobs */
  numWorkers = atoi(optarg);
  break;
--- 417,422 
*** a/src/bin/pg_dump/pg_dumpall.c
--- b/src/bin/pg_dump/pg_dumpall.c
***
*** 95,101  main(int argc, char *argv[])
  		{"file", required_argument, NULL, 'f'},
  		{"globals-only", no_argument, NULL, 'g'},
  		{"host", required_argument, NULL, 'h'},
- 		{"ignore-version", no_argument, NULL, 'i'},
  		{"dbname", required_argument, NULL, 'd'},
  		{"database", required_argument, NULL, 'l'},
  		{"oids", no_argument, NULL, 'o'},
--- 95,100 
***
*** 195,201  main(int argc, char *argv[])
  
  	pgdumpopts = createPQExpBuffer();
  
! 	while ((c = getopt_long(argc, argv, "acd:f:gh:il:oOp:rsS:tU:vwWx", long_options, &optindex)) != -1)
  	{
  		switch (c)
  		{
--- 194,200 
  
  	pgdumpopts = createPQExpBuffer();
  
! 	while ((c = getopt_long(argc, argv, "acd:f:gh:l:oOp:rsS:tU:vwWx", long_options, &optindex)) != -1)
  	{
  		switch (c)
  		{
***
*** 226,235  main(int argc, char *argv[])
  pghost = pg_strdup(optarg);
  break;
  
- 			case 'i':
- /* ignored, deprecated option */
- break;
- 
  			case 'l':
  pgdb = pg_strdup(optarg);
  break;
--- 225,230 
*** a/src/bin/pg_dump/pg_restore.c
--- b/src/bin/pg_dump/pg_restore.c
***
*** 88,94  main(int argc, char **argv)
  		{"format", 1, NULL, 'F'},
  		{"function", 1, NULL, 'P'},
  		{"host", 1, NULL, 'h'},
- 		{"ignore-version", 0, NULL, 'i'},
  		{"index", 1, NULL, 'I'},
  		{"jobs", 1, NULL, 'j'},
  		{"list", 0, NULL, 'l'},
--- 88,93 
***
*** 147,153  main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "acCd:ef:F:h:iI:j:lL:n:Op:P:RsS:t:T:U:vwWx1",
  			cmdopts, NULL)) != -1)
  	{
  		switch (c)
--- 146,152 
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "acCd:ef:F:h:I:j:lL:n:Op:P:RsS:t:T:U:vwWx1",
  			cmdopts, NULL)) != -1)
  	{
  		switch (c)
***
*** 178,186  main(int argc, char **argv)
  if (strlen(optarg) != 0)
  	opts->pghost = pg_strdup(optarg);
  break;
- 			case 'i':
- /* ignored, deprecated option */
- break;
  
  			case 'j':

Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-02 Thread Andreas Seltenreich

Tom Lane  writes:

> Andreas Seltenreich  writes:
>> The scary one is due to an integer overflow the attached patch also
>> fixes.
>
> s/int/Size/ doesn't fix anything on 32-bit machines.

Well, it changes the signedness of the computation on 32-bit, and in
combination with the fact that "len" is always smaller than 2^32, but
may exceed 2^31-1, the change avoids the dependency on the undefined
behavior of signed integer overflows in C on 32-bit as well.  But I
admit that this might be a rather academic point...

Anyway, my motivation for the patch was the improved error reporting.
Is the drive-by type change a problem here?

regards,
Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Enhanced ALTER OPERATOR

2015-06-02 Thread Robert Haas

On Fri, May 29, 2015 at 4:28 AM, Alexander Korotkov
 wrote:
> On Thu, May 28, 2015 at 6:43 PM, Tom Lane  wrote:
>>
>> Alexander Korotkov  writes:
>> > Could we address both this problems by denying changing existing
>> > commutators and negator? ISTM that setting absent commutator and negator
>> > is
>> > quite enough for ALTER OPERATOR. User extensions could need setting of
>> > commutator and negator because they could add new operators which don't
>> > exist before. But it's rather uncommon to unset or change commutator or
>> > negator.
>>
>> Note that this functionality is already covered, in that you can specify
>> the commutator/negator linkage when you create the second operator.
>> I'm not particularly convinced that we need to have it in ALTER OPERATOR.
>
>
> Yeah, I didn't notice that CREATE OPERATOR sets commutator and negator on
> opposite side as well. Then we probably can leave ALTER OPERATOR without
> altering commutator and negator.
>
> BTW, does DROP OPERATOR works correctly?
>
> # create operator == (procedure = int8eq, leftarg = bigint, rightarg =
> bigint);
> CREATE OPERATOR
> # create operator !== (procedure = int8ne, leftarg = bigint, rightarg =
> bigint, negator = ==);
> CREATE OPERATOR
> # select oid, * from pg_operator where oprnamespace = (select oid from
> pg_namespace where nspname = 'public');
>   oid  | oprname | oprnamespace | oprowner | oprkind | oprcanmerge |
> oprcanhash | oprleft | oprright | oprresult | oprcom | oprnegate | oprcode |
> oprrest | oprjoin
> ---+-+--+--+-+-++-+--+---++---+-+-+-
>  46355 | !== | 2200 |   10 | b   | f   | f
> |  20 |   20 |16 |  0 | 46354 | int8ne  | -   |
> -
>  46354 | ==  | 2200 |   10 | b   | f   | f
> |  20 |   20 |16 |  0 | 46355 | int8eq  | -   |
> -
> (2 rows)
> # create table test (id bigint);
> CREATE TABLE
> # explain verbose select * from test where not id == 10::bigint;
>   QUERY PLAN
> ---
>  Seq Scan on public.test  (cost=0.00..38.25 rows=1130 width=8)
>Output: id
>Filter: (test.id !== '10'::bigint)
> (3 rows)
> # drop operator !== (int8, int8);
> DROP OPERATOR
> # select oid, * from pg_operator where oprnamespace = (select oid from
> pg_namespace where nspname = 'public');
>   oid  | oprname | oprnamespace | oprowner | oprkind | oprcanmerge |
> oprcanhash | oprleft | oprright | oprresult | oprcom | oprnegate | oprcode |
> oprrest | oprjoin
> ---+-+--+--+-+-++-+--+---++---+-+-+-
>  46354 | ==  | 2200 |   10 | b   | f   | f
> |  20 |   20 |16 |  0 | 46355 | int8eq  | -   |
> -
> (1 row)
> # explain verbose select * from test where not id == 10::bigint;
> ERROR:  cache lookup failed for function 0
>
> DROP OPERATOR leaves broken reference in oprnegate. Should we fix it?

Yeah, that doesn't seem good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Noah Misch

On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch  wrote:
> > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:

> > Granted.  Would it be better to update both functions at the same time, and
> > perhaps to make that a master-only change?  Does the status quo cause more
> > practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?
> 
> It does in the case of the call to find_multixact_start().  If that
> fails, we take the server down for no good reason, as demonstrated by
> the original report. I'll revert the changes to the other two things
> in this function that I changed to be protected by did_trim.

Sounds good.

> > There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got 
> > set
> > to 1.  Otherwise, data loss is possible.
> 
> Yes, but in that scenario, the log message you propose wouldn't be
> triggered.  If true_minmxid > 2^31, then the stored minmxid will not
> precede the files on disk; it will follow it (assuming the older stuff
> hasn't been truncated, as is likely).

Ah, quite right.

> >   "missing pg_multixact/members files; begins at MultiXactId %u, expected 
> > %u"
> 
> This seems misleading.  In the known failure case, it's not that the
> pg_multixact files are unexpectedly missing; it's that we incorrectly
> think that they should still be there.  Maybe:
> 
> oldest MultiXactId on disk %u follows expected oldest MultiXact %u

Your wording is better.

> > For good measure, perhaps emit this when lastCheckpointedOldest > earliest 
> > by
> > more than one segment:
> >
> >   "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"
> 
> So, this scenario will happen whenever the system was interrupted in
> the middle of a truncation, or when the system is started from a base
> backup and a truncation happened after files were copied.  I'm wary of
> giving users the idea that this is an atypical event.  Perhaps a
> message at DEBUG1?

DEBUG1 works for me, or feel free to leave it out.

> >> I can see that there might be an issue there, but I can't quite put my
> >> finger on it well enough to say that it definitely is an issue.  This
> >> code is examining the offsets space rather than the members space, and
> >> the protections against offsets wraparound have been there since the
> >> original commit of this feature
> >> (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
> >> no concrete evidence that there's ever been a problem in this area.
> >> It seems like it might be safer to rejigger that code so that it
> >> considers distance-behind-current rather than using the wrapped
> >> comparison logic, but I'm reluctant to start rejiggering more things
> >> without knowing what specifically I'm fixing.
> >
> > Anything that could cause the pg_multixact/offsets tail to rotate from being
> > in the past to being in the future poses this risk.  (This is the tail from
> > the perspective of files on disk; pg_control, datminmxid, and MultiXactState
> > notions of the tail play no part here.)  I had in mind that the pg_upgrade
> > datminmxid=1 bug could be a tool for achieving that, but I've been
> > unsuccessful so far at building a credible thought experiment around it.  
> > Near
> > the beginning of your reply, you surmised that this could happen between a
> > VACUUM's SetMultiXactIdLimit() and the next checkpoint's 
> > TruncateMultiXact().
> > Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
> > currently ignores the unlink() return value; the only harm has been some 
> > disk
> > usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
> > mandatory to forestall the appearance of a wrapped state.
> 
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

We could notice the problem and halt.  You mentioned above the possibility to
have SlruScanDirCbFindEarliest() check "distance-behind-current".  Suppose it
used the current oldestMulti from pg_control as a reference point and
discovered the multixact on disk (a) most far behind that reference point and
(b) most far ahead of that reference point.  If MultiXactIdPrecedes(a, b),
we're good; adopt (a) as the new datminmxid.  Account for the possibility that
the space isn't really wrapped, but out reference point was totally wrong.
Once we determine that the space is wrapped, bail.

As you discuss dow

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tom Lane

Tomas Vondra  writes:
> On 06/02/15 16:37, Tom Lane wrote:
>> It's possible that the change was due to random variation in ANALYZE
>> statistics, in which case it was just luck.

> I don't think so. I simply loaded the data, ran ANALYZE, and then simply 
> started either master or patched master. There should be no difference 
> in statistics, I believe. Also, the plans contain pretty much the same 
> row counts, but the costs differ.

> For example look at the 'cs_ui' CTE, right at the beginning of the 
> analyze logs. The row counts are exactly the same, but the costs are 
> different. And it's not using semijoins or not nested loops ...

The cost estimates in that CTE all look exactly the same to me, and the
actual runtime's not all that different either.  The key runtime
difference in the first query seems to be in the first couple of joins
to the cs_ui CTE:

->  Nested Loop  (cost=4.63..724.34 rows=16 width=67) (actual 
time=8346.904..14024.947 rows=117 loops=1)
  Join Filter: (item.i_item_sk = store_returns.sr_item_sk)
  ->  Nested Loop  (cost=0.29..662.07 rows=1 width=59) (actual 
time=8324.352..13618.106 rows=8 loops=1)
->  CTE Scan on cs_ui  (cost=0.00..2.16 rows=108 width=4) 
(actual time=7264.670..7424.096 rows=17169 loops=1)
->  Index Scan using item_pkey on item  (cost=0.29..6.10 
rows=1 width=55) (actual time=0.356..0.356 rows=0 loops=17169)
  Index Cond: (i_item_sk = cs_ui.cs_item_sk)
  Filter: ((i_current_price >= '79'::numeric) AND 
(i_current_price <= '89'::numeric) AND (i_current_price >= '80'::numeric) AND 
(i_current_price <= '94'::numeric) AND (i_color = ANY 
('{navajo,burlywood,cornflower,olive,turquoise,linen}'::bpchar[])))
  Rows Removed by Filter: 1
  ->  Bitmap Heap Scan on store_returns  (cost=4.34..62.05 rows=18 
width=8) (actual time=32.525..50.662 rows=15 loops=8)
Recheck Cond: (sr_item_sk = cs_ui.cs_item_sk)
Heap Blocks: exact=117
->  Bitmap Index Scan on idx_sr_item_sk  (cost=0.00..4.34 
rows=18 width=0) (actual time=19.747..19.747 rows=15 loops=8)
  Index Cond: (sr_item_sk = cs_ui.cs_item_sk)

vs patched

->  Nested Loop  (cost=4.63..724.34 rows=16 width=67) (actual 
time=6867.921..7001.417 rows=117 loops=1)
  Join Filter: (item.i_item_sk = store_returns.sr_item_sk)
  ->  Nested Loop  (cost=0.29..662.07 rows=1 width=59) (actual 
time=6867.874..7000.211 rows=8 loops=1)
->  CTE Scan on cs_ui  (cost=0.00..2.16 rows=108 width=4) 
(actual time=6865.792..6924.816 rows=17169 loops=1)
->  Index Scan using item_pkey on item  (cost=0.29..6.10 
rows=1 width=55) (actual time=0.003..0.003 rows=0 loops=17169)
  Index Cond: (i_item_sk = cs_ui.cs_item_sk)
  Filter: ((i_current_price >= '79'::numeric) AND 
(i_current_price <= '89'::numeric) AND (i_current_price >= '80'::numeric) AND 
(i_current_price <= '94'::numeric) AND (i_color = ANY 
('{navajo,burlywood,cornflower,olive,turquoise,linen}'::bpchar[])))
  Rows Removed by Filter: 1
  ->  Bitmap Heap Scan on store_returns  (cost=4.34..62.05 rows=18 
width=8) (actual time=0.025..0.116 rows=15 loops=8)
Recheck Cond: (sr_item_sk = cs_ui.cs_item_sk)
Heap Blocks: exact=117
->  Bitmap Index Scan on idx_sr_item_sk  (cost=0.00..4.34 
rows=18 width=0) (actual time=0.017..0.017 rows=15 loops=8)
  Index Cond: (sr_item_sk = cs_ui.cs_item_sk)

Those are the exact same plans, and same cost estimates, but for some
reason the master run is 2X slower.  The only explanation I can think of
is disk caching effects, or maybe hint-bit setting during the first run.
It's certainly not the planner that deserves either credit or blame for
that.

The part of this plan that's actually different is a different join order
sequence for a lot of follow-on joins that are expected to get single rows
from their other table.  I think that's basically irrelevant really, but
again, this patch should not have changed anything there, since those were
plain joins not semijoins.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund  wrote:
> On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
>> The exact circumstances under which we're willing to replace a
>> relminmxid with a newly-computed one that differs are not altogether
>> clear to me, but there's an "if" statement protecting that logic, so
>> there are some circumstances in which we'll leave the existing value
>> intact.
>
> I guess we'd have to change that then.

Yeah, but first we'd need to assess why it's like that.  Tom was the
one who installed the current logic, but I haven't been able to fully
understand it.

>> It would similarly do so when the oldest MXID reference in the
>> relation is in fact 1, but that value can't be vacuumed away yet.
>
> I'd thought of just forcing consumption of one multixact in that
> case. Not pretty, but imo acceptable.

What if multixact 1 still has living members?

>> Also, the database might be really big.  Even if it were true that a
>> full scan of every table would get us out of this state, describing
>> the time that it would take to do that as "relatively short" seems to
>> me to be considerably understating the impact of a full-cluster
>> VACUUM.
>
> Well. We're dealing with a corrupted cluster. Having a way out that's
> done automatically, even if it takes a long while, is pretty good
> already. In many cases the corruption (i.e. pg_upgrade) happened long
> ago, so the table's relminmxid will already have been recomputed.  I
> think that's acceptable.

I'm a long way from being convinced the automated recovery is
possible.  There are so many different scenarios here that it's very
difficult to reason generally about what the "right" thing to do is.
I agree it would be nice if we had it, though.

>> With regard to the more general question of WAL-logging this, are you
>> going to work on that?  Are you hoping Alvaro or I will work on that?
>> Should we draw straws?  It seems like somebody needs to do it.
>
> I'm willing to invest the time to develop an initial version, but I'll
> need help evaluating it. I don't have many testing resources available
> atm, and I'm not going to trust stuff I developed while travelling by
> just looking at the code.

I'm willing to help with that.  Hopefully I'm not the only one, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund  wrote:
>> That would be a departure from the behavior of every existing release
>> that includes this code based on, to my knowledge, zero trouble
>> reports.
>
> On the other hand we're now at about bug #5 attributeable to the odd way
> truncation works for multixacts. It's obviously complex and hard to get
> right. It makes it harder to cope with the wrong values left in
> datminxid etc. So I'm still wondering whether fixing this for good isn't
> the better approach.

It may well be.  But I think we should do something more surgical
first.  Perhaps we can justify the pain and risk of making changes to
the WAL format in the back-branches, but let's not do it in a rush.
If we can get this patch to a state where it undoes the damage
inflicted in 9.3.7/9.4.2, then we will be in a state where we have as
much reliability as we had in 9.3.6 plus the protections against
member-space wraparound added in 9.3.7 - which, like the patch I'm
proposing now, were directly motivated by multiple, independent bug
reports.  That seems like a good place to get to.  If nothing else, it
will buy us some time to figure out what else we want to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> The exact circumstances under which we're willing to replace a
> relminmxid with a newly-computed one that differs are not altogether
> clear to me, but there's an "if" statement protecting that logic, so
> there are some circumstances in which we'll leave the existing value
> intact.

I guess we'd have to change that then.

> It would similarly do so when the oldest MXID reference in the
> relation is in fact 1, but that value can't be vacuumed away yet.

I'd thought of just forcing consumption of one multixact in that
case. Not pretty, but imo acceptable.

> Also, the database might be really big.  Even if it were true that a
> full scan of every table would get us out of this state, describing
> the time that it would take to do that as "relatively short" seems to
> me to be considerably understating the impact of a full-cluster
> VACUUM.

Well. We're dealing with a corrupted cluster. Having a way out that's
done automatically, even if it takes a long while, is pretty good
already. In many cases the corruption (i.e. pg_upgrade) happened long
ago, so the table's relminmxid will already have been recomputed.  I
think that's acceptable.

> With regard to the more general question of WAL-logging this, are you
> going to work on that?  Are you hoping Alvaro or I will work on that?
> Should we draw straws?  It seems like somebody needs to do it.

I'm willing to invest the time to develop an initial version, but I'll
need help evaluating it. I don't have many testing resources available
atm, and I'm not going to trust stuff I developed while travelling by
just looking at the code.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund  wrote:
> On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
>> I'm having trouble figuring out what to do about this.  I mean, the
>> essential principle of this patch is that if we can't count on
>> relminmxid, datminmxid, or the control file to be accurate, we can at
>> least look at what is present on the disk.  If we also cannot count on
>> that to be accurate, we are left without any reliable source of
>> information.  Consider a hypothetical cluster where all our stored
>> minmxids of whatever form are corrupted (say, all change to 1) and in
>> addition there are stray files in pg_multixact.  I don't think there's
>> really any way to get ourselves out of trouble in that scenario.
>
> If we were to truncate after vacuum, and only on the primary (via WAL
> logging), we could, afaics, just rely on all the values to be
> recomputed. I mean relminmxid will be recomputed after a vacuum, and
> thus, after some time, will datminmxid and the control file value.  We
> could just force a value of 1 to always trigger anti-wraparound vacuums
> (or wait for that to happen implicitly, to delay the impact?). That'll
> then should then fix the problem in a relatively short amount of time?

The exact circumstances under which we're willing to replace a
relminmxid with a newly-computed one that differs are not altogether
clear to me, but there's an "if" statement protecting that logic, so
there are some circumstances in which we'll leave the existing value
intact.  If we force non-stop vacuuming in that scenario, autovacuum
will just run like crazy without accomplishing anything, which
wouldn't be good.  It would similarly do so when the oldest MXID
reference in the relation is in fact 1, but that value can't be
vacuumed away yet.

Also, the database might be really big.  Even if it were true that a
full scan of every table would get us out of this state, describing
the time that it would take to do that as "relatively short" seems to
me to be considerably understating the impact of a full-cluster
VACUUM.

With regard to the more general question of WAL-logging this, are you
going to work on that?  Are you hoping Alvaro or I will work on that?
Should we draw straws?  It seems like somebody needs to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

On 2015-06-02 11:29:24 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund  wrote:
> > But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> > this scenario now (since a couple weeks ago) does a
> > SimpleLruReadPage_ReadOnly() in the members slru via
> > find_multixact_start(). That just won't work acceptably when we're not
> > yet consistent. There very well could not be a valid members segment at
> > that point?  Am I missing something?
> 
> Yes: that code isn't new.

Good point.

> TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
> 9.3.0 and every subsequent release until 9.3.7/9.4.2.

But back then TruncateMultiXact() wasn't called during recovery. But
you're right in that it didn't seem to have reproduced attributable
bugreprorts since 9.3.2 where vacuuming during recovery was
introduced. So it indeed doesn't seem as urgent as fixing the new
callsites.

> That would be a departure from the behavior of every existing release
> that includes this code based on, to my knowledge, zero trouble
> reports.

On the other hand we're now at about bug #5 attributeable to the odd way
truncation works for multixacts. It's obviously complex and hard to get
right. It makes it harder to cope with the wrong values left in
datminxid etc. So I'm still wondering whether fixing this for good isn't
the better approach.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tomas Vondra



On 06/02/15 16:37, Tom Lane wrote:

Tomas Vondra  writes:

OK, so I did the testing today - with TPC-H and TPC-DS benchmarks. The
results are good, IMHO.


I'm a bit disturbed by that, because AFAICS from the plans, these
queries did not involve any semi or anti joins, which should mean
that the patch would not have changed the planner's behavior. You
were using the second patch as-posted, right, without further hacking
on compare_path_costs_fuzzily?


Yes, no additional changes.


It's possible that the change was due to random variation in ANALYZE
statistics, in which case it was just luck.


I don't think so. I simply loaded the data, ran ANALYZE, and then simply 
started either master or patched master. There should be no difference 
in statistics, I believe. Also, the plans contain pretty much the same 
row counts, but the costs differ.


For example look at the 'cs_ui' CTE, right at the beginning of the 
analyze logs. The row counts are exactly the same, but the costs are 
different. And it's not using semijoins or not nested loops ...


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund  wrote:
> But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> this scenario now (since a couple weeks ago) does a
> SimpleLruReadPage_ReadOnly() in the members slru via
> find_multixact_start(). That just won't work acceptably when we're not
> yet consistent. There very well could not be a valid members segment at
> that point?  Am I missing something?

Yes: that code isn't new.

TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
9.3.0 and every subsequent release until 9.3.7/9.4.2.  The only thing
that's changed is that we've moved that logic into a function called
find_multixact_start() instead of having it directly inside that
function.  We did that because we needed to use the same logic in some
other places.  The reason why 9.3.7/9.4.2 are causing problems for
people that they didn't have previously is because those new,
additional call sites were poorly chosen and didn't include adequate
protection against calling that function with an invalid input value.
What this patch is about is getting back to the situation that we were
in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did
the thing that you're complaining about here but no one else did.

>From my point of view, I think that you are absolutely right to
question what's going on in TruncateMultiXact().  It's kooky, and
there may well be bugs buried there.  But I don't think fixing that
should be the priority right now, because we have zero reports of
problems attributable to that logic.  I think the priority should be
on undoing the damage that we did in 9.3.7/9.4.2, when we made other
places to do the same thing.  We started getting trouble reports
attributable to those changes *almost immediately*, which means that
whether or not TruncateMultiXact() is broken, these new call sites
definitely are.  I think we really need to fix those new places ASAP.

> I think at the very least we'll have to skip this step while not yet
> consistent. That really sucks, because we'll possibly end up with
> multixacts that are completely filled by the time we've reached
> consistency.

That would be a departure from the behavior of every existing release
that includes this code based on, to my knowledge, zero trouble
reports.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

If we were to truncate after vacuum, and only on the primary (via WAL
logging), we could, afaics, just rely on all the values to be
recomputed. I mean relminmxid will be recomputed after a vacuum, and
thus, after some time, will datminmxid and the control file value.  We
could just force a value of 1 to always trigger anti-wraparound vacuums
(or wait for that to happen implicitly, to delay the impact?). That'll
then should then fix the problem in a relatively short amount of time?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PATCH] Fix documentation bug in how to calculate the quasi-unique pg_log session_id

2015-06-02 Thread Joel Jacobson

Fix documentation bug in how to calculate the quasi-unique pg_log session_id

While the code is truncating the backend_start time, the query example in
the documentation is rouding.
Fix by doing the same thing in the documentation as in, i.e. truncating the
backend_start.

elog.c:
snprintf(strfbuf, sizeof(strfbuf) - 1, "%lx.%x",
(long) (MyStartTime), MyProcPid);

Example:

2015-06-02 16:50:29.045
CEST,"joel","gluepay",49262,"93.158.127.42:41093",556d7a0a.c06e,5062,"idle
in transaction",2015-06-02 11:40:26
CEST,17/19970778,0,LOG,0,"statement: select * from foo where bar =
'baz';""exec_simple_query, postgres.c:876","psql"

select backend_start, procpid, to_hex(EXTRACT(EPOCH FROM
backend_start)::integer) || '.' || to_hex(procpid) as invalid_session_id
from pg_stat_activity_log where procpid = 49262;
 backend_start | procpid | invalid_session_id
---+-+
 2015-06-02 11:40:26.516373+02 |   49262 | 556d7a0b.c06e

select backend_start, procpid, to_hex(trunc(EXTRACT(EPOCH FROM
backend_start))::integer) || '.' || to_hex(procpid) as valid_session_id
from pg_stat_activity_log where procpid = 49262;
 backend_start | procpid | valid_session_id
---+-+--
 2015-06-02 11:40:26.516373+02 |   49262 | 556d7a0a.c06e

---

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549b7d..1da7dfb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4707,7 +4707,7 @@ local0.*/var/log/postgresql
  of printing those items.  For example, to generate the session
  identifier from pg_stat_activity, use this query:
 
-SELECT to_hex(EXTRACT(EPOCH FROM backend_start)::integer) || '.' ||
+SELECT to_hex(trunc(EXTRACT(EPOCH FROM backend_start))::integer) || '.' ||
to_hex(pid)
 FROM pg_stat_activity;

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Andres Freund

On 2015-06-02 17:01:50 +0200, Fabien COELHO wrote:
> >The actual problem is sorting & fsyncing in a way that deals efficiently
> >with tablespaces, i.e. doesn't write to tablespaces one-by-one.
> >Not impossible, but it requires some thought.
> 
> Hmmm... I would have neglected this point in a first approximation,
> but I agree that not interleaving tablespaces could indeed loose some
> performance.

I think it'll be a hard to diagnose performance regression. So we'll
have to fix it. That argument actually was the blocker in previous
attempts...

> >IMO this feature, if done correctly, should result in better performance
> >in 95+% of the workloads
> 
> To demonstrate that would require time...

Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.

> >and be enabled by default.
> 
> I did not had such an ambition with the submitted patch:-)

I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.

> >And that'll not be possible without actually writing mostly sequentially.
> 
> >It's also not just the sequential writes making this important, it's also
> >that it allows to do the final fsync() of the individual segments as soon
> >as their last buffer has been written out.
> 
> Hmmm... I'm not sure this would have a large impact. The writes are
> throttled as much as possible, so fsync will catch plenty other writes
> anyway, if there are some.

That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch  wrote:
> On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
>> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch  wrote:
>> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> >> SetMultiXactIdLimit() bracketed certain parts of its
>> >> logic with if (!InRecovery), but those guards were ineffective because
>> >> it gets called before InRecovery is set in the first place.
>> >
>> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
>> > called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
>> > sense of whether it is subject to similar problems as a result?
>>
>> Well, I think it's pretty weird that those things will get done before
>> beginning recovery, even on an inconsistent cluster, but not during
>> recovery.  That is pretty strange.  I don't see that it can actually
>> do any worse than emit a few log messages at the start of recovery
>> that won't show up again until the end of recovery, though.
>
> Granted.  Would it be better to update both functions at the same time, and
> perhaps to make that a master-only change?  Does the status quo cause more
> practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

It does in the case of the call to find_multixact_start().  If that
fails, we take the server down for no good reason, as demonstrated by
the original report. I'll revert the changes to the other two things
in this function that I changed to be protected by did_trim.  There's
no special reason to think that's a necessary change.

>> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> >> until we're consistent.  Also, instead of passing
>> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> >> the earliest offset that exists on disk.  That way, it won't try to
>> >> read data that's not there.
>> >
>> > Perhaps emit a LOG message when we do that, since it's our last 
>> > opportunity to
>> > point to the potential data loss?
>>
>> If the problem is just that somebody minmxid got set to 1 instead of
>> the real value, I think that there is no data loss, because none of
>> those older values are actually present there.  But we could add a LOG
>> message anyway.  How do you suggest that we phrase that?
>
> There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
> to 1.  Otherwise, data loss is possible.

Yes, but in that scenario, the log message you propose wouldn't be
triggered.  If true_minmxid > 2^31, then the stored minmxid will not
precede the files on disk; it will follow it (assuming the older stuff
hasn't been truncated, as is likely).  So the message would be
essentially:

LOG: you didn't lose data.  but if exactly the opposite of what this
message is telling you about had happened, then you would have.
DETAIL: Have a nice day.

> I don't hope for an actionable
> message, but we might want a reporter to grep logs for it when we diagnose
> future reports.  Perhaps this:
>
>   "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

This seems misleading.  In the known failure case, it's not that the
pg_multixact files are unexpectedly missing; it's that we incorrectly
think that they should still be there.  Maybe:

oldest MultiXactId on disk %u follows expected oldest MultiXact %u

> For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
> more than one segment:
>
>   "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

So, this scenario will happen whenever the system was interrupted in
the middle of a truncation, or when the system is started from a base
backup and a truncation happened after files were copied.  I'm wary of
giving users the idea that this is an atypical event.  Perhaps a
message at DEBUG1?

>> I'm not sure what you mean about it becoming too old.  At least with
>> that fix, it should get updated to exactly the first file that we
>> didn't remove.  Isn't that right?
>
> Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
> only in that it never reads or writes the cache.  I might expect
> oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
> does hold most of the time.  It does not always hold between the start of the
> quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
> That's because raw_GOMXOD() changes at the instant we unlink the oldest
> segment, but oldestMultiXactOnDisk changes later.  Any backend may call
> GetOldestMultiXactOnDisk() via SetMultiXactIdLimit().  If a backend does that
> concurrent with the checkpointer running TruncateMultiXact() and sees a stale
> oldestMultiXactOnDisk, is that harmless?

As far as I can tell, it's pretty much harmless.  I mean, we've
already discussed the risk that the head and tail could get too far
apart, because really it should

Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-02 Thread Tom Lane

Andreas Seltenreich  writes:
> The scary one is due to an integer overflow the attached patch also
> fixes.

s/int/Size/ doesn't fix anything on 32-bit machines.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO




Hmmm. I think it should be implemented as Tom suggested, that is per chunks
of shared buffers, in order to avoid allocating a "large" memory.


I don't necessarily agree. But that's really just a minor implementation
detail.


Probably.

The actual problem is sorting & fsyncing in a way that deals efficiently 
with tablespaces, i.e. doesn't write to tablespaces one-by-one.

Not impossible, but it requires some thought.


Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some 
performance.



ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.


They're pretty closely linked from their performance impact.


Sure.

IMO this feature, if done correctly, should result in better performance 
in 95+% of the workloads


To demonstrate that would require time...


and be enabled by default.


I did not had such an ambition with the submitted patch:-)

And that'll not be possible without actually writing mostly 
sequentially.


It's also not just the sequential writes making this important, it's 
also that it allows to do the final fsync() of the individual segments 
as soon as their last buffer has been written out.


Hmmm... I'm not sure this would have a large impact. The writes are 
throttled as much as possible, so fsync will catch plenty other writes 
anyway, if there are some.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PATCH] Add error handling to byteaout.

2015-06-02 Thread Andreas Seltenreich

Hi,

when dealing with bytea values that are below the 1GB limit, but too
large to be escaped, the error messages emitted are not very helpful for
users.  Especially if they appear in an unrelated context such as during
pg_dump.  I've attached a patch that adds ereport()ing that would have
prevented some confusion.

Example:

,[ master ]
| ase=# select mkbytea(29);
| ERROR:  invalid memory alloc request size 1073741827
| ase=# set bytea_output to escape;
| ase=# select mkbytea(29);
| ERROR:  invalid memory alloc request size 18446744071562067969
`

The scary one is due to an integer overflow the attached patch also
fixes.  I don't see any security implications though as it's only the
sign bit that is affected.

,[ with patch applied ]
| ase=# set bytea_output to 'escape';
| ase=# select mkbytea(29);
| ERROR:  escaped bytea value would be too big
| DETAIL:  Value would require 2147483649 bytes.
| HINT:  Use a different bytea_output setting or binary methods such as COPY 
BINARY.
`

regards,
Andreas

create function mkbytea(power int, part bytea default '\x00')
   returns bytea as
   $$select case when power>0 then mkbytea(power-1,part||part) else part 
end;$$
   language sql;

>From f62a101a690fc9251c4c2de9c87323cedd0e9a54 Mon Sep 17 00:00:00 2001
From: Andreas Seltenreich 
Date: Mon, 1 Jun 2015 16:17:21 +0200
Subject: [PATCH] Add error handling to byteaout.

Emit a comprehensible error message when escaping fails due to
MaxAllocSize instead of waiting for palloc to fail.  Also use size_t
for size computations to prevent integer overflow in
BYTEA_OUTPUT_ESCAPE branch.
---
 src/backend/utils/adt/varlena.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 779729d..ec2594e 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -347,8 +347,21 @@ byteaout(PG_FUNCTION_ARGS)
 
if (bytea_output == BYTEA_OUTPUT_HEX)
{
+   Size len;
/* Print hex format */
-   rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+   len = (Size)VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1;
+
+   if (!AllocSizeIsValid(len)) {
+   ereport(ERROR,
+   (errcode(ERRCODE_OUT_OF_MEMORY),
+errmsg("escaped bytea value would be 
too big"),
+errdetail("Value would require %zu 
bytes.",
+  len),
+errhint("Use a different bytea_output 
setting or"
+" binary methods such 
as COPY BINARY.")));
+   }
+   
+   rp = result = palloc(len);
*rp++ = '\\';
*rp++ = 'x';
rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), 
rp);
@@ -357,8 +370,8 @@ byteaout(PG_FUNCTION_ARGS)
{
/* Print traditional escaped format */
char   *vp;
-   int len;
-   int i;
+   Sizelen;
+   Sizei;
 
len = 1;/* empty string has 1 
char */
vp = VARDATA_ANY(vlena);
@@ -371,6 +384,15 @@ byteaout(PG_FUNCTION_ARGS)
else
len++;
}
+   if (!AllocSizeIsValid(len)) {
+   ereport(ERROR,
+   (errcode(ERRCODE_OUT_OF_MEMORY),
+errmsg("escaped bytea value would be 
too big"),
+errdetail("Value would require %zu 
bytes.",
+  len),
+errhint("Use a different bytea_output 
setting or"
+" binary methods such 
as COPY BINARY.")));
+   }
rp = result = (char *) palloc(len);
vp = VARDATA_ANY(vlena);
for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
-- 
2.1.4



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Andres Freund

On 2015-06-02 15:42:14 +0200, Fabien COELHO wrote:
> >>This version seems already quite effective and very light. ISTM that
> >>adding a sort phase would mean reworking significantly how the
> >>checkpointer processes pages.
> >
> >Meh. The patch for that wasn't that big.
> 
> Hmmm. I think it should be implemented as Tom suggested, that is per chunks
> of shared buffers, in order to avoid allocating a "large" memory.

I don't necessarily agree. But that's really just a minor implementation
detail. The actual problem is sorting & fsyncing in a way that deals
efficiently with tablespaces, i.e. doesn't write to tablespaces
one-by-one.  Not impossible, but it requires some thought.

> >The problem with doing this separately is that without the sorting this
> >will be slower for throughput in a good number of cases. So we'll have
> >yet another GUC that's very hard to tune.
> 
> ISTM that the two aspects are orthogonal, which would suggests two gucs
> anyway.

They're pretty closely linked from their performance impact. IMO this
feature, if done correctly, should result in better performance in 95+%
of the workloads and be enabled by default. And that'll not be possible
without actually writing mostly sequentially.

It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out. That's important
because it means the file will get fewer writes done independently
(i.e. backends writing out dirty buffers) which will make the final
fsync more expensive.

It might be that we want to different gucs, but I don't think we can
release without both features.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Amit Langote

On Tue, Jun 2, 2015 at 11:03 PM, Abhijit Menon-Sen  wrote:
> At 2015-06-01 23:35:23 -0500, htf...@gmail.com wrote:
>>
>> No, it won't prevent the incredibly stupid from doing incredibly
>> stupid things, nothing will.
>
> I hate to speechify, but I think we should try hard to avoid framing
> such questions in terms of "incredibly stupid" people and the things
> they might do.
>
> We have anecdotal and circumstantial evidence that the names pg_xlog and
> pg_clog have given some people the impression that they can delete files
> therein. Sometimes do this when their server is in imminent danger of
> running out of space, sometimes not. But our documentation makes it
> clear that these files are important.
>
> I think naming these directories to convey the right impression is a
> straightforward interface design problem, and we also know that big
> flashing red warnings are less effective than one might hope for. I
> do not think a bigger, stripier warning is worth doing in isolation.
> I do think it's worth choosing better names.
>

+1.

Regards,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tom Lane

Tomas Vondra  writes:
> OK, so I did the testing today - with TPC-H and TPC-DS benchmarks. The 
> results are good, IMHO.

> With TPC-H, I've used 1GB and 4GB datasets, and I've seen no plan 
> changes at all. I don't plan to run the tests on larger data sets, I do 
> expect the behavior to remain the same, considering the uniformity of 
> TPC-H data sets.

> With TPC-DS (using the 63 queries supported by PostgreSQL), I've seen 
> two cases of plan changes - see the plans attached. In both cases 
> however the plan change results in much better performance. While on 
> master the queries took 23 and 18 seconds, with the two patches it's 
> only 7 and 3. This is just the 1GB dataset. I'll repeat the test with 
> the 4GB dataset and post an update if there are any changes.

I'm a bit disturbed by that, because AFAICS from the plans, these queries
did not involve any semi or anti joins, which should mean that the patch
would not have changed the planner's behavior.  You were using the second
patch as-posted, right, without further hacking on
compare_path_costs_fuzzily?

It's possible that the change was due to random variation in ANALYZE
statistics, in which case it was just luck.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Bruce Momjian

On Tue, Jun  2, 2015 at 07:33:19PM +0530, Abhijit Menon-Sen wrote:
> At 2015-06-01 23:35:23 -0500, htf...@gmail.com wrote:
> >
> > No, it won't prevent the incredibly stupid from doing incredibly
> > stupid things, nothing will.
> 
> I hate to speechify, but I think we should try hard to avoid framing
> such questions in terms of "incredibly stupid" people and the things
> they might do.
> 
> We have anecdotal and circumstantial evidence that the names pg_xlog and
> pg_clog have given some people the impression that they can delete files
> therein. Sometimes do this when their server is in imminent danger of
> running out of space, sometimes not. But our documentation makes it
> clear that these files are important.
> 
> I think naming these directories to convey the right impression is a
> straightforward interface design problem, and we also know that big
> flashing red warnings are less effective than one might hope for. I
> do not think a bigger, stripier warning is worth doing in isolation.
> I do think it's worth choosing better names.

I think we also have to consider that these admins are in crisis mode
when the run out of disk space, so they are not in a calm environment
where decisions are considered carefully.  You can say decisions should
always be considered carefully, but I am not sure how realistic that is.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Joshua D. Drake



On 06/01/2015 09:35 PM, Michael Nolan wrote:

Why not take a simpler approach and create a zero length file in
directories that should not be fiddled with by non-experts using a file
name something like "DO.NOT.DELETE.THESE.FILES"?


+1



No, it won't prevent the incredibly stupid from doing incredibly stupid
things, nothing will.


Darwin?

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

2015-06-02 Thread Tomas Vondra


On 06/02/15 01:47, Josh Berkus wrote:

On 06/01/2015 03:22 PM, Tomas Vondra wrote:


On 06/01/15 23:47, Josh Berkus wrote:

On 06/01/2015 02:18 PM, Tom Lane wrote:


Anybody else want to speak for or against back-patching the patch as
posted?  I intentionally didn't push it in before today's releases,
but I will push it later this week if there are not objections.


I would like Mark Wong to test this on DBT3, if that's possible
for him. I'm very worried about an unanticipated regression.


AFAIK Mark is busy with other stuff at the moment, but I can do
the TPC-H (which DBT3 is equal to, IIRC).


Yeah, I just want something which has a chance of catching
unanticipated regression in queries which should be unaffected by the
patch.


OK, so I did the testing today - with TPC-H and TPC-DS benchmarks. The 
results are good, IMHO.


With TPC-H, I've used 1GB and 4GB datasets, and I've seen no plan 
changes at all. I don't plan to run the tests on larger data sets, I do 
expect the behavior to remain the same, considering the uniformity of 
TPC-H data sets.


With TPC-DS (using the 63 queries supported by PostgreSQL), I've seen 
two cases of plan changes - see the plans attached. In both cases 
however the plan change results in much better performance. While on 
master the queries took 23 and 18 seconds, with the two patches it's 
only 7 and 3. This is just the 1GB dataset. I'll repeat the test with 
the 4GB dataset and post an update if there are any changes.


While this can't prove there are no regressions, in these two benchmarks 
the patches actually improve some of the queries.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


  QUERY PLAN

   
---
 Sort
   Sort Key: cs1.product_name, cs1.store_name, cs2.cnt
   CTE cs_ui
 ->  HashAggregate
   Group Key: catalog_sales.cs_item_sk
   Filter: (sum(catalog_sales.cs_ext_list_price) > ('2'::numeric * 
sum(((catalog_returns.cr_refunded_cash + catalog_returns.cr_reversed_charge) + 
catalog_returns.cr_store_credit
   ->  Merge Join
 Merge Cond: (catalog_returns.cr_order_number = 
catalog_sales.cs_order_number)
 Join Filter: (catalog_sales.cs_item_sk = 
catalog_returns.cr_item_sk)
 ->  Index Scan using idx_cr_order_number on catalog_returns
 ->  Materialize
   ->  Index Scan using idx_cs_order_number on catalog_sales
   CTE cross_sales
 ->  HashAggregate
   Group Key: item.i_product_name, item.i_item_sk, store.s_store_name, 
store.s_zip, ad1.ca_street_number, ad1.ca_street_name, ad1.ca_city, ad1.ca_zip, 
ad2.ca_street_number, ad2.ca_street_name, ad2.ca_city, ad2.ca_zip, d1.d_year, 
d2.d_year, d3.d_year
   ->  Nested Loop
 ->  Nested Loop
   ->  Nested Loop
 Join Filter: (store_sales.ss_store_sk = 
store.s_store_sk)
 ->  Nested Loop
   ->  Nested Loop
 ->  Nested Loop
   Join Filter: 
(cd1.cd_marital_status <> cd2.cd_marital_status)
   ->  Nested Loop
 ->  Nested Loop
   ->  Nested Loop
 ->  Nested Loop
   ->  
Nested Loop
 -> 
 Nested Loop

   ->  Nested Loop

 ->  Nested Loop

   ->  Nested Loop

 Join Filter: (item.i_item_sk = store_sales.ss_item_sk)

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Abhijit Menon-Sen

At 2015-06-01 23:35:23 -0500, htf...@gmail.com wrote:
>
> No, it won't prevent the incredibly stupid from doing incredibly
> stupid things, nothing will.

I hate to speechify, but I think we should try hard to avoid framing
such questions in terms of "incredibly stupid" people and the things
they might do.

We have anecdotal and circumstantial evidence that the names pg_xlog and
pg_clog have given some people the impression that they can delete files
therein. Sometimes do this when their server is in imminent danger of
running out of space, sometimes not. But our documentation makes it
clear that these files are important.

I think naming these directories to convey the right impression is a
straightforward interface design problem, and we also know that big
flashing red warnings are less effective than one might hope for. I
do not think a bigger, stripier warning is worth doing in isolation.
I do think it's worth choosing better names.

-- Abhijit

P.S. Unrelated to Michael's mail, but I also don't think it's worth
debating whether people will run "rm -rf *log" or "rm -rf log/*" or
whatever other variant you can think of. I'm arguing for correcting
a mis-perception, not try to dodge specific harmful commands. Tom's
proposal of using a symlink but dropping it after third-party tools
have had time to catch up seems like the best approach to me.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO



Hello Andres,


I would rather separate them, unless this is a blocker.


I think it is a blocker.


Hmmm. This is an argument...

This version seems already quite effective and very light. ISTM that 
adding a sort phase would mean reworking significantly how the 
checkpointer processes pages.


Meh. The patch for that wasn't that big.


Hmmm. I think it should be implemented as Tom suggested, that is per 
chunks of shared buffers, in order to avoid allocating a "large" memory.



The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.


ISTM that the two aspects are orthogonal, which would suggests two gucs 
anyway.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Andres Freund

On 2015-06-02 15:15:39 +0200, Fabien COELHO wrote:
> >Won't this lead to more-unsorted writes (random I/O) as the
> >FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
> >per files or order of blocks on disk?
> 
> Yep, probably. Under "moderate load" this is not an issue. The io-scheduler
> and other hd firmware will probably reorder writes anyway.

They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.

> Also, if several
> data are updated together, probably they are likely to be already neighbours
> in memory as well as on disk.

No, that's not how it'll happen outside of simplistic cases where you
start with an empty shared_buffers. Shared buffers are maintained by a
simplified LRU, so how often individual blocks are touched will define
the buffer replacement.

> >I remember sometime back there was some discusion regarding
> >sorting writes during checkpoint, one idea could be try to
> >check this idea along with that patch.  I just saw that Andres has
> >also given same suggestion which indicates that it is important
> >to see both the things together.
> 
> I would rather separate them, unless this is a blocker.

I think it is a blocker.

> This version seems
> already quite effective and very light. ISTM that adding a sort phase would
> mean reworking significantly how the checkpointer processes pages.

Meh. The patch for that wasn't that big.

The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO



Hello Amit,


[...]

The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.


Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?


Yep, probably. Under "moderate load" this is not an issue. The 
io-scheduler and other hd firmware will probably reorder writes anyway. 
Also, if several data are updated together, probably they are likely to be 
already neighbours in memory as well as on disk.



I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch.  I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.


I would rather separate them, unless this is a blocker. This version seems 
already quite effective and very light. ISTM that adding a sort phase 
would mean reworking significantly how the checkpointer processes pages.



Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?


I think that currently the fsync is on the file handler, so what happens 
depends on how fsync is implemented by the system.



Yet another idea could be to allow BGWriter to also fsync the dirty
buffers,


ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".

that may have side impact of not able to clear the dirty pages at speed 
required by system, but I think if that happens one can think of having 
multiple BGwriter tasks.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO



Hello Andres,


If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental.  Did you
consider combining both approaches?


Ja, I thought that it was a more complex patch with uncertain/less clear 
benefits, and as this simpler version was already effective enough as it 
was, so I decided to start with that and try to have reasonable proof of 
benefits so that it could get through.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund

On 2015-06-01 14:22:32 -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund  wrote:
> > The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> > era, where we didn't do any truncation during recovery...
> 
> Right, but now we're piggybacking on the checkpoint records, and I
> don't have any evidence that this approach can't be made robust.  It's
> possible that it can't be made robust, but that's not currently clear.

Well, it's possible that it can be made work without problems. But I
think robust is something different. Having to look at slrus, during
recovery, to find out what to truncate puts more intelligence/complexity
in the recovery path than I'm comfortable with.

> >> By the time we've reached the minimum recovery point, they will have
> >> been recreated by the same WAL records that created them in the first
> >> place.
> >
> > I'm not sure that's true. I think we could end up errorneously removing
> > files that were included in the base backup. Anyway, let's focus on your
> > patch for now.
> 
> OK, but I am interested in discussing the other thing too.  I just
> can't piece together the scenario myself - there may well be one.  The
> base backup will begin replay from the checkpoint caused by
> pg_start_backup() and remove anything that wasn't there at the start
> of the backup.  But all of that stuff should get recreated by the time
> we reach the minimum recovery point (end of backup).

I'm not sure if it's reprouceably borked. What about this scenario:
1) pg_start_backup() is called, creates a checkpoint.
2) 2**31 multixacts are created, possibly with several checkpoints
   inbetween
3) pg_multixact is copied
4) basebackup finishes

Unless I'm missing something this will lead to a crash recovery startup
where the first call to TruncateMultiXact() will have
MultiXactState->lastCheckpointedOldest wildly inconsistent with
GetOldestMultiXactOnDisk() return value. Possibly leading to truncation
being skipped errorneously.  Whether that's a problem I'm not yet
entirely sure.

But what *definitely* looks wrong to me is that a TruncateMultiXact() in
this scenario now (since a couple weeks ago) does a
SimpleLruReadPage_ReadOnly() in the members slru via
find_multixact_start(). That just won't work acceptably when we're not
yet consistent. There very well could not be a valid members segment at
that point?  Am I missing something?

> > I'm more worried about the cases where we didn't ever actually "badly
> > wrap around" (i.e. overwrite needed data); but where that's not clear on
> > the standby because the base backup isn't in a consistent state.
> 
> I agree. The current patch tries to make it so that we never call
> find_multixact_start() while in recovery, but it doesn't quite
> succeed: the call in TruncateMultiXact still happens during recovery,
> but only once we're sure that the mxact we plan to call it on actually
> exists on disk.  That won't be called until we replay the first
> checkpoint, but that might still be prior to consistency.

It'll pretty much *always* be before we reach consistency, right? It'll
called on the checkpoint created by pg_start_backup()?

I don't think the presence check (that's GetOldestMultiXactOnDisk() in
TruncateMultiXact(), right) helps very much. There's no guarantee at all
that offsets and members are in any way consistent with each other. Or
in themselves for that matter, the copy could very well have been in the
middle of a write the slru page.

I think at the very least we'll have to skip this step while not yet
consistent. That really sucks, because we'll possibly end up with
multixacts that are completely filled by the time we've reached
consistency.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Andrew Dunstan



On 06/02/2015 03:06 AM, Joel Jacobson wrote:

On Tue, Jun 2, 2015 at 6:35 AM, Michael Nolan  wrote:

Why not take a simpler approach and create a zero length file in directories
that should not be fiddled with by non-experts using a file name something
like "DO.NOT.DELETE.THESE.FILES"?

Then the smart sysadmin will say "but I didn't delete any *files*, I
just deleted a log *directory*" :-)





Fire that sysadmin, not because they are incompetent but because either 
they are an idiot or they think they can treat you as one.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Streaming replication for psycopg2

2015-06-02 Thread Shulgin, Oleksandr

Hello,

I've submitted a patch to psycopg2 to support streaming replication
protocol (COPY_BOTH): https://github.com/psycopg/psycopg2/pull/322

It would be great if more people had a chance to take a look and provide
feedback about it.  In particular, please see example usage at this github
comment[1]. Some bikeshedding is really needed here. :-)

Thank you.
--
Alex

[1] https://github.com/psycopg/psycopg2/pull/322#issuecomment-107911085

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Amit Kapila

On Mon, Jun 1, 2015 at 5:10 PM, Fabien COELHO  wrote:

>
> Hello pg-devs,
>
> This patch is a simplified and generalized version of Andres Freund's
> August 2014 patch for flushing while writing during checkpoints, with some
> documentation and configuration warnings added.
>
> For the initial patch, see:
>
>
> http://www.postgresql.org/message-id/20140827091922.gd21...@awork2.anarazel.de
>
> For the whole thread:
>
>
> http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto
>
> The objective is to help avoid PG stalling when fsyncing on checkpoints,
> and in general to get better latency-bound performance.
>
>
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool
flush_to_disk)
 {
  XLogRecPtr recptr;
  ErrorContextCallback errcallback;
@@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation
reln)
   buf->tag.forkNum,
   buf->tag.blockNum,
   bufToWrite,
-  false);
+  false,
+  flush_to_disk);

Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?

I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch.  I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.

Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?

Yet another idea could be to allow BGWriter to also fsync the dirty
buffers, that may have side impact of not able to clear the dirty pages
at speed required by system, but I think if that happens one can
think of having multiple BGwriter tasks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Andres Freund

Hi,

It's nice to see the topic being picked up.

If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental.  Did you
consider combining both approaches?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Fabien COELHO



Hello Amit,


Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?


It is currently "*_flush_to_disk". In Andres Freund version the name is 
"sync_on_checkpoint_flush", but I did not found it very clear. Using 
"*_flush_on_write" instead as your suggest, would be fine as well, it 
emphasizes the "when/how" it occurs instead of the final "destination", 
why not...


About words: checkpoint "write"s pages, but this really mean passing the 
pages to the memory manager, which will think about it... "flush" seems to 
suggest a more effective write, but really it may mean the same, the page 
is just passed to the OS. So "write/flush" is really "to OS" and not "to 
disk". I like the data to be on "disk" in the end, and as soon as 
possible, hence the choice to emphasize that point.


Now I would really be okay with anything that people find simple to 
understand, so any opinion is welcome!


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Mark Kirkwood


On 01/06/15 05:29, Joel Jacobson wrote:

While anyone who is familiar with postgres would never do something as
stupid as to delete pg_xlog,
according to Google, there appears to be a fair amount of end-users out
there having made the irrevocable mistake of deleting the precious
directory,
a decision made on the assumption that since "it has *log* in the name
so it must be unimportant"
(http://stackoverflow.com/questions/12897429/what-does-pg-resetxlog-do-and-how-does-it-work).

If we could turn back time, would we have picked "pg_xlog" as the most
optimal name for this important directory, or would we have come up with
a more user-friendly name?

Personally, I have never had any problems with pg_xlog, but I realize
there are quite a few unlucky new users who end up in trouble.

My suggestion is to use "pg_xjournal" instead of "pg_xlog" when new
users create a new data directory using initdb, and allow for both
directories to exist (exclusive or, i.e. either one or the other, but
not both). That way we don't complicate the life for any existing users,
all their tools will continue to work who rely on pg_xlog to be named
pg_xlog, but only force new users to do a bit of googling when they
can't use whatever tool that can't find pg_xlog. When they find out it's
an important directory, they can simply create a symlink and their old
not yet updated tool will work again.

Thoughts?



+1

Strongly agree - I have had people on my dba course ask about deleting 
these pesky 'log' directories (obvious confusion/conflation with pg_log 
...)! A change of name would help reduce the temptation!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-06-02 Thread Amit Langote


Hi Fabien,

On 2015-06-01 PM 08:40, Fabien COELHO wrote:
> 
> Turning "checkpoint_flush_to_disk = on" reduces significantly the number
> of late transactions. These late transactions are not uniformly distributed,
> but are rather clustered around times when pg is stalled, i.e. more or less
> unresponsive.
> 
> bgwriter_flush_to_disk does not seem to have a significant impact on these
> tests, maybe because pg shared_buffers size is much larger than the database,
> so the bgwriter is seldom active.
> 

Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] psql weird behaviour with charset encodings

2015-06-02 Thread Michael Paquier

On Sun, May 24, 2015 at 2:43 AM, Noah Misch  wrote:
> On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lane wrote:
>> hgonza...@gmail.com writes:
>> > http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
>>
>> > The last explains why they do not consider it a bug:
>>
>> > ISO C99 requires for %.*s to only write complete characters that fit
below
>> > the
>> > precision number of bytes. If you are using say UTF-8 locale, but
ISO-8859-1
>> > characters as shown in the input file you provided, some of the
strings are
>> > not valid UTF-8 strings, therefore sprintf fails with -1 because of the
>> > encoding error. That's not a bug in glibc.
>>
>> Yeah, that was about the position I thought they'd take.
>
> GNU libc eventually revisited that conclusion and fixed the bug through
commit
> 715a900c9085907fa749589bf738b192b1a2bda5.  RHEL 7.1 is fixed, but RHEL
6.6 and
> RHEL 5.11 are still affected; the bug will be relevant for another 8+
years.
>
>> So the bottom line here is that we're best off to avoid %.*s because
>> it may fail if the string contains data that isn't validly encoded
>> according to libc's idea of the prevailing encoding.
>
> Yep.  Immediate precisions like %.10s trigger the bug as effectively as
%.*s,
> so tarCreateHeader() [_tarWriteHeader() in 9.2 and earlier] is also
affected.
> Switching to strlcpy(), as attached, fixes the bug while simplifying the
code.
> The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"'
when
> the name of a file in the data directory is not a valid multibyte string.
>
>
> Commit 6dd9584 introduced a new use of .*s, to pg_upgrade.  It works
reliably
> for now, because it always runs in the C locale.  pg_upgrade never calls
> set_pglocale_pgservice() or otherwise sets its permanent locale.  It
would be
> natural for us to fix that someday, at which point non-ASCII database
names
> would perturb this status output.

I caught up the following places that need attention on top of the 4 ones
in tar.c:
src/backend/nodes/read.c:
elog(ERROR, "unrecognized integer: \"%.*s\"",
src/backend/nodes/read.c:
elog(ERROR, "unrecognized OID: \"%.*s\"",
src/backend/nodes/read.c:   elog(ERROR,
"unrecognized token: \"%.*s\"", tok_len, token);
src/backend/nodes/readfuncs.c:  elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:  elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:  elog(ERROR, "unrecognized
integer: \"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:  elog(ERROR, "unrecognized boolop
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:  elog(ERROR, "badly formatted node
string \"%.32s\"...", token);
src/backend/tsearch/wparser_def.c:   * Use of %.*s here is a bit risky
since it can misbehave if the data is
src/backend/tsearch/wparser_def.c:  fprintf(stderr, "parsing
\"%.*s\"\n", len, str);
src/backend/tsearch/wparser_def.c:  /* See note above about %.*s */
src/backend/tsearch/wparser_def.c:  fprintf(stderr, "parsing copy of
\"%.*s\"\n", prs->lenstr, prs->str);
src/backend/utils/adt/datetime.c:* Note: the uses
of %.*s in this function would be risky if the
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:   /* %.*s is safe
since all our tokens are ASCII */
src/backend/utils/adt/datetime.c:   elog(LOG, "token
too long in %s table: \"%.*s\"",
src/interfaces/ecpg/ecpglib/error.c:/* %.*s is safe here as long as
sqlstate is all-ASCII */
src/interfaces/ecpg/ecpglib/error.c:ecpg_log("raising sqlstate %.*s
(sqlcode %ld): %s\n",
src/interfaces/ecpg/pgtypeslib/dt_common.c:  * Note:
the uses of %.*s in this function would be risky if the
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/bin/pg_basebackup/pg_basebackup.c:
 ngettext("%*s/%s kB (%d%%), %d/%d tablespace (%s%-*.*s)",
src/bin/pg_basebackup/pg_basebackup.c:
  "%*s/%s kB (%d%%), %d/%d tablespaces (%s%-*.*s)",
src/bin/pg_upgrade/util.c:  printf("
 %s%-*.*s\r",

> It would be good to purge the code of precisions on "s" conversion
specifiers,
> then Assert(!pointflag) in fmtstr() to catch new introductions.  I won't
plan
> to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing backend
code relying in sprintf's port version we should on

Re: [HACKERS] auto_explain sample rate

2015-06-02 Thread Pavel Stehule

2015-06-02 9:07 GMT+02:00 Craig Ringer :

> On 29 May 2015 at 11:35, Tom Lane  wrote:
>
>> Craig Ringer  writes:
>> > It's sometimes desirable to collect auto_explain data with ANALYZE in
>> order
>> > to track down hard-to-reproduce issues, but the performance impacts can
>> be
>> > pretty hefty on the DB.
>>
>> > I'm inclined to add a sample rate to auto_explain so that it can trigger
>> > only on x percent of queries,
>>
>> That sounds reasonable ...
>>
>
> Cool, I'll cook that up then. Thanks for the sanity check.
>
>
>> > and also add a sample test hook that can be
>> > used to target statements of interest more narrowly (using a C hook
>> > function).
>>
>> You'd have to be pretty desperate, *and* knowledgeable, to write a
>> C function for that.  Can't we invent something a bit more user-friendly
>> for the purpose?  No idea what it should look like though.
>>
>
> I've been that desperate.
>
> For the majority of users I'm sure it's sufficient to just have a sample
> rate.
>
> Anything that's trying to match individual queries could be interested in
> all sorts of different things. Queries that touch a particular table being
> one of the more obvious things, or queries that mention a particular
> literal. Rather than try to design something complicated in advance that
> anticipates all needs, I'm thinking it makes sense to just throw a hook in
> there. If some patterns start to emerge in terms of useful real world
> filtering criteria then that'd better inform any more user accessible
> design down the track.
>

same method can be interesting for interactive EXPLAIN ANALYZE too. TIMING
has about 20%-30% overhead and usually we don't need a perfectly exact
numbers

Regards

Pavel


>
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>

Re: [HACKERS] auto_explain sample rate

2015-06-02 Thread Craig Ringer

On 29 May 2015 at 11:35, Tom Lane  wrote:

> Craig Ringer  writes:
> > It's sometimes desirable to collect auto_explain data with ANALYZE in
> order
> > to track down hard-to-reproduce issues, but the performance impacts can
> be
> > pretty hefty on the DB.
>
> > I'm inclined to add a sample rate to auto_explain so that it can trigger
> > only on x percent of queries,
>
> That sounds reasonable ...
>

Cool, I'll cook that up then. Thanks for the sanity check.

> > and also add a sample test hook that can be
> > used to target statements of interest more narrowly (using a C hook
> > function).
>
> You'd have to be pretty desperate, *and* knowledgeable, to write a
> C function for that.  Can't we invent something a bit more user-friendly
> for the purpose?  No idea what it should look like though.
>

I've been that desperate.

For the majority of users I'm sure it's sufficient to just have a sample
rate.

Anything that's trying to match individual queries could be interested in
all sorts of different things. Queries that touch a particular table being
one of the more obvious things, or queries that mention a particular
literal. Rather than try to design something complicated in advance that
anticipates all needs, I'm thinking it makes sense to just throw a hook in
there. If some patterns start to emerge in terms of useful real world
filtering criteria then that'd better inform any more user accessible
design down the track.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-02 Thread Joel Jacobson

On Tue, Jun 2, 2015 at 6:35 AM, Michael Nolan  wrote:
> Why not take a simpler approach and create a zero length file in directories
> that should not be fiddled with by non-experts using a file name something
> like "DO.NOT.DELETE.THESE.FILES"?

Then the smart sysadmin will say "but I didn't delete any *files*, I
just deleted a log *directory*" :-)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [NOVICE] psql readline Tab insert tab

2015-06-02 Thread Albe Laurenz

Peter Eisentraut wrote:
> On 6/1/15 7:00 AM, Albe Laurenz wrote:
>> Hans Ginzel wrote:
> how to make psql (readline) to insert tab when Tab is pressed? E.g. for
> pasting. I know, there is -n option. But then the history is not
> accessible.

 It could be done by adding the following lines to your ~/.inputrc file:

 $if Psql
 TAB: tab-insert
 $endif

>>> Great! Thank you very much. Could this be added as note to the -n option
>>> of the page http://www.postgresql.org/docs/current/static/app-psql.html,
>>> please?

>> Would that be worth an addition to the documentation?

> There is already a section about this on the psql man page.  (Look for
> "disable-completion".)

Right, I didn't notice that.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

70 matches

Mail list logo