Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2017-10-07 Thread Mark Kirkwood

On 26/09/17 20:44, Mark Kirkwood wrote:




$ pg_basebackup -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid 
argument
pg_basebackup: directory "/data0/pgdata/11/pg_tblspc/space1" exists 
but is not empty

pg_basebackup: removing contents of data directory "."



Err - actually this example is wrong - sorry. In fact pg_basebackup is 
complaining because it does not want to overwrite the contents of the 
tablespace (need to use the -T option as I'm on the same host)!


A correct example of pg_basebackup failing due to tablespaces inside 
$PGDATA/pg_tblspc can be easily demonstrated by trying to set up 
streaming replication on another host:


$ pg_basebackup -h 10.0.119.100 -P -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid argument
pg_basebackup: could not create directory "./pg_tblspc": File exists

Fortunately this can be worked around by changing to tar format:

$ pg_basebackup -h 10.0.119.100 -Ft -P -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid argument
1560632/1560632 kB (100%), 2/2 tablespaces

...however, not that great that the plain mode is busted.

regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug with pg_basebackup and 'shared' tablespace

2017-09-30 Thread Mark Kirkwood



On 30/09/17 06:43, Robert Haas wrote:

On Fri, Sep 29, 2017 at 2:06 AM, Michael Paquier
 wrote:

My tendency about this patch is still that it should be rejected. This
is presenting additional handling for no real gain.

I vehemently disagree.  If the server lets you create a tablespace,
then everything that happens after that ought to work.

On another thread, there is the issue that if you create a tablespace
inside $PGDATA, things break.  We should either unbreak those things
or not allow creating the tablespace in the first place.  On this
thread, there is the issue that if you create two tablespaces for
different PG versions in the same directory, things break.  We should
either unbreak those things or not allow creating the tablespace in
the first place.

It is completely awful behavior to let users do things and then punish
them later for having done them.  Users are not obliged to read the
minds of the developers and guess what things the developers consider
"reasonable".  They should be able to count on the principle that if
they do something that we consider wrong, they'll get an error when
they try to do it -- not have it superficially appear to work and then
explode later.

To put that another way, there should be ONE rule about what is or is
not allowable in a particular situation, and all commands, utilities,
etc. that deal with that situation should handle it in a uniform
fashion.  Each .c file shouldn't get to make up its own notion of what
is or is not supported.



+1

It seems simply wrong (and potentially dangerous) to allow users to 
arrange a system state that cannot be backed up (easily/without surgery 
etc etc).


Also the customer concerned that did exactly that started the 
conversation about it with me like this (paraphrasing) 'So this 
pg_basebackup thing is a bit temperamental...'. I'm thinking we do not 
want to be giving users this impression.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2017-09-26 Thread Mark Kirkwood

On 29/04/15 09:35, Bruce Momjian wrote:


On Fri, Apr 24, 2015 at 01:05:03PM -0400, Bruce Momjian wrote:

This way, both pg_dump and pg_upgrade will issue warnings, though, of
course, those warnings can be ignored.  I am hopeful these two warnings
will be sufficient and we will not need make these errors, with the
possible inconvenience it will cause.  I am still afraid that someone
will ignore the new errors pg_dump would generate and lose data.  I just
don't remember enough cases where we threw new errors on _data_ restore.

Frankly, those using pg_upgrade already will have to move the old
tablespaces out of the old cluster if they ever want to delete those
clusters, so I am hopeful these additional warnings will help eliminate
this practice, which is already cumbersome and useless.  I am not
planning to revisit this for 9.6.




(resurrecting an old thread) I encountered this the other day, a 
customer had created tablespaces with directories inside 
$PGDATA/pg_tblspc. This is just pathalogical - e.g (v11 checkout with 
PGDATA=/data0/pgdata/11):


bench=# CREATE TABLESPACE space1 LOCATION 
'/data0/pgdata/11/pg_tblspc/space1';

WARNING:  tablespace location should not be inside the data directory
CREATE TABLESPACE
bench=# ALTER TABLE pgbench_accounts SET  TABLESPACE space1;
ALTER TABLE

Ok, so I've been warned:

$ pg_basebackup -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid argument
pg_basebackup: directory "/data0/pgdata/11/pg_tblspc/space1" exists but 
is not empty

pg_basebackup: removing contents of data directory "."

So pg_basebackup is completely broken by this construction - should we 
not prohibit the creation of tablespace directories under $PGDATA (or at 
least $PGDATA/pg_tblspc) at this point?


regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MAIN, Uncompressed?

2017-08-25 Thread Mark Kirkwood

On 26/08/17 12:18, Simon Riggs wrote:


On 25 August 2017 at 20:53, Tom Lane  wrote:

Greg Stark  writes:

I think this is a particularly old piece of code and we're lucky the
default heuristics have served well for all this time because I doubt
many people fiddle with these storage attributes. The time may have
come to come up with a better UI for the storage attributes because
people are doing new things (like json) and wanting more control over
this heuristic.

Yeah, I could get behind a basic rethinking of the tuptoaster control
knobs.  I'm just not in love with Simon's specific proposal, especially
not if we can't think of a better name for it than "MAINU".

Extended/External would be just fine if you could set the toast
target, so I think a better suggestion would be to make "toast_target"
a per-attribute option .



+1, have thought about this myself previouslythank you for bringing 
it up!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Mark Kirkwood

On 21/07/17 15:58, Joshua D. Drake wrote:


On 07/19/2017 07:57 PM, Tom Lane wrote:

Peter Geoghegan  writes:

My argument for the importance of index bloat to the more general
bloat problem is simple: any bloat that accumulates, that cannot be
cleaned up, will probably accumulate until it impacts performance
quite noticeably.


But that just begs the question: *does* it accumulate indefinitely, or
does it eventually reach a more-or-less steady state?  The traditional
wisdom about btrees, for instance, is that no matter how full you pack
them to start with, the steady state is going to involve something like
1/3rd free space.  You can call that bloat if you want, but it's not
likely that you'll be able to reduce the number significantly without
paying exorbitant costs.

I'm not claiming that we don't have any problems, but I do think it's
important to draw a distinction between bloat and normal operating
overhead.


Agreed but we aren't talking about 30% I don't think. Here is where I 
am at. It took until 30 minutes ago for the tests to finish:


name |  setting
-+---
 autovacuum  | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_analyze_threshold| 50
 autovacuum_freeze_max_age   | 2
 autovacuum_max_workers  | 3
 autovacuum_multixact_freeze_max_age | 4
 autovacuum_naptime  | 60
 autovacuum_vacuum_cost_delay| 20
 autovacuum_vacuum_cost_limit| -1
 autovacuum_vacuum_scale_factor  | 0.2
 autovacuum_vacuum_threshold | 50
 autovacuum_work_mem | -1
 log_autovacuum_min_duration | -1


Test 1: 55G/srv/main
TPS:955

Test 2: 112G/srv/main
TPS:531 (Not sure what happened here, long checkpoint?)

Test 3: 109G/srv/main
TPS:868

Test 4: 143G
TPS:840

Test 5: 154G
TPS: 722

I am running the query here:

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index 



And will post a followup. Once the query finishes I am going to launch 
the tests with autovacuum_vacuum_cost_limit of 5000. Is there anything 
else you folks would like me to change?







I usually advise setting autovacuum_naptime = 10s (or even 5s) for 
workloads that do a lot of updates (or inserts + deletes) - as on modern 
HW a lot of churn can happen in 1 minute, and that just makes vacuum's 
job harder.


regards
Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-16 Thread Mark Kirkwood

On 16/07/17 05:24, David Fetter wrote:


On Fri, Jul 14, 2017 at 09:49:25PM -0500, Robert Haas wrote:

On Mon, Jul 10, 2017 at 5:46 PM, David Fetter  wrote:

With utmost respect, it's less messy than adding '!' to the already
way too random and mysterious syntax of psql's \ commands.  What
should '\det!' mean?  What about '\dT!'?

Since \det lists foreign tables, \det! would list foreign tables even
if they are partitions.  Plain \det would show only the ones that are
not partitions.

\dT! wouldn't be meaningful, since \dT lists data types and data types
can't be partitions.  If you're trying to conjure up a rule that every
\d command must accept the same set of modifiers, a quick
look at the output of \? and a little experimentation will quickly
show you that neither S nor + apply to all command types, so I see no
reason why that would need to be true for a new modifier either.

TBH, I think we should just leave this well enough alone.  We're
post-beta2 now, there's no clear consensus on what to do here, and
there will be very little opportunity for users to give us feedback if
we stick a change into an August beta3 before a September final
release.

I think a new modifier would be too rushed at this stage, but there's
no reason to throw out the progress on \d vs \dt.




+1

And similarly, there seemed to be a reasonably clear push to label the 
'partitions' as such.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Mark Kirkwood

On 07/07/17 19:54, Michael Banck wrote:


On Fri, Jul 07, 2017 at 07:40:55PM +1200, Mark Kirkwood wrote:

On 07/07/17 13:29, Amit Langote wrote:

Someone complained about this awhile back [1].  And then it came up again
[2], where Noah appeared to take a stance that partitions should be
visible in views / output of commands that list "tables".

Although I too tend to prefer not filling up the \d output space by
listing partitions (pg_class.relispartition = true relations), there
wasn't perhaps enough push for creating a patch for that.  If some
committer is willing to consider such a patch, I can make one.

Yeah, me too (clearly). However if the consensus is that all these partition
tables *must* be shown in \d output, then I'd be happy if they were
identified as such rather than just 'table' (e.g 'partition table').

+1.

Or maybe just 'partition' is enough if 'partition table' would widen the
column output unnecessarily.




Yeah, that is probably better (and 'partition table' is potentially 
confusing as Robert pointed out).


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Mark Kirkwood

On 07/07/17 13:29, Amit Langote wrote:



Someone complained about this awhile back [1].  And then it came up again
[2], where Noah appeared to take a stance that partitions should be
visible in views / output of commands that list "tables".

Although I too tend to prefer not filling up the \d output space by
listing partitions (pg_class.relispartition = true relations), there
wasn't perhaps enough push for creating a patch for that.  If some
committer is willing to consider such a patch, I can make one.




Yeah, me too (clearly). However if the consensus is that all these 
partition tables *must* be shown in \d output, then I'd be happy if they 
were identified as such rather than just 'table' (e.g 'partition table').


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New partitioning - some feedback

2017-07-06 Thread Mark Kirkwood
I've been trying out the new partitioning in version 10. Firstly, I must 
say this is excellent - so much nicer than the old inheritance based method!


My only niggle is the display of partitioned tables via \d etc. e.g:

part=# \d
List of relations
 Schema | Name | Type  |  Owner
+--+---+--
 public | date_fact| table | postgres
 public | date_fact_201705 | table | postgres
 public | date_fact_201706 | table | postgres
 public | date_fact_20170601   | table | postgres
 public | date_fact_2017060100 | table | postgres
 public | date_fact_201707 | table | postgres
 public | date_fact_rest   | table | postgres
(7 rows)

Now it can be inferred from the names that date_fact is a partitioned 
table and the various date_fact_ are its partitions - but \d is not 
providing any hints of this. The more detailed individual describe is fine:


part=# \d date_fact
  Table "public.date_fact"
 Column |   Type   | Collation | Nullable | Default
+--+---+--+-
 id | integer  |   | not null |
 dte| timestamp with time zone |   | not null |
 val| integer  |   | not null |
Partition key: RANGE (dte)
Number of partitions: 6 (Use \d+ to list them.)

I'd prefer *not* to see a table and its partitions all intermixed in the 
same display (especially with nothing indicating which are partitions) - 
as this will make for unwieldy long lists when tables have many 
partitions. Also it would be good if the 'main' partitioned table and 
its 'partitions' showed up as a different type in some way.


I note the they do in pg_class:

part=# SELECT relname,relkind,relispartition FROM pg_class WHERE relname 
LIKE 'date_fact%';

   relname| relkind | relispartition
--+-+
 date_fact| p   | f
 date_fact_201705 | r   | t
 date_fact_201706 | r   | t
 date_fact_20170601   | r   | t
 date_fact_2017060100 | r   | t
 date_fact_201707 | r   | t
 date_fact_rest   | r   | t
(7 rows)

...so it looks to be possible to hide the partitions from the main 
display and/or mark them as such. Now I realize that making this comment 
now that beta is out is a bit annoying - apologies, but I think seeing a 
huge list of 'tables' is going to make \d frustrating for folk doing 
partitioning.


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication: \dRp+ and "for all tables"

2017-06-14 Thread Mark Kirkwood

On 15/06/17 11:10, Tom Lane wrote:


Jeff Janes  writes:

On Sat, Jun 10, 2017 at 7:42 AM, Tom Lane  wrote:

In the second place, this really fails to respond to what I'd call
the main usability problem with \dRp+, which is that the all-tables
property is likely to lead to an unreadably bulky list of affected tables.
What I'd say the patch ought to do is *replace* \dRp+'s list of affected
tables with a notation like "(all tables)" when puballtables is true.

I'd considered that, but I find the pager does a fine job of dealing with
the bulkiness of the list.

Have you tried it with a few tens of thousands of tables?  Even if your
pager makes it work comfortably, others might find it less satisfactory.


I thought it might be a good idea to not only
point out that it is all tables, but also remind people of exactly what
tables those are currently (in case it had slipped their mind that all
tables will include table from other schemas not in their search_path, for
example)

I'm not really buying that.  If they don't know what "all tables" means,
a voluminous list isn't likely to help much.

I was hoping we'd get some more votes in this thread, but it seems like
we've only got three, and by my count two of them are for just printing
"all tables".




I'd certainly prefer to see 'all tables' - in addition to being more 
compact, it also reflects more correctly how the publication was defined.


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-10 Thread Mark Kirkwood

On 06/06/17 10:12, Gavin Flower wrote:


On 06/06/17 09:41, Mark Kirkwood wrote:

On 05/06/17 09:30, Tom Lane wrote:


I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 

and it seems to me that there are a couple of things we ought to do 
about

it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value".  The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that 
design
originated when we were envisioning stats samples of just a few 
thousand
rows --- specifically, default_statistics_target was originally just 
10,
leading to a 3000-row sample size.  So accepting two-appearance 
values as

MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000.  With today's typical sample size of 3 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV.  That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the 
"average"
value's frequency is.  A rule of that general form seems like a good 
idea,
but I now think the 25% threshold is far too small to do anything 
useful.
In particular, in any case like this where there are more distinct 
values

than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this 
rule is
totally useless to filter anything that we would otherwise consider 
as an
MCV.  I wonder if we shouldn't make it be "at least double the 
estimated

average frequency".



Or possibly calculate the sample standard deviation and make use of 
that to help decide on a more flexible cutoff than twice the avg 
frequency?


Are there any research papers that might help us here (I'm drowning 
in a sea of barely relevant search results for most phrases I've 
tried so far)? I recall there were some that Tom referenced when this 
stuff was originally written.


On the other hand I do have access to some mathematicians 
specializing in statistics - so can get their thoughts on this issue 
if you feel it would be worthwhile.


Cheers

Mark


The standard deviation (sd) is proportional to the square root of the 
number in the sample in a Normal Distribution.


In a Normal Distribution, about 2/3 the values will be within plus or 
minus one sd of the mean.


There seems to be an implicit assumption that the distribution of 
values follows the Normal Distribution - has this been verified? I 
suspect that real data will have a skewed distribution of values, and 
may even be multi modal (multiple peaks)  The Normal Distribution has 
one central peak with 2 tails of the same shape & size.


So in a sample of 100 the sd is proportional to 10%,
for 10,000 the sd is proportional to 1%.

So essentially, the larger the sample size the more reliable is 
knowledge of the most common values (ignoring pathologically extreme 
distributions!) - the measure of reliability depends on the distribution.


How about selecting the cut off as the mean plus one sd, or something 
of that nature?  Note that the cut off point may result in no mcv's 
being selected - especially for small samples.


If practicable, it would be good to sample real datasets. Suggest 
looking at datasets were the current mechanism looks reasonable, and 
ones were the estimates are too far off.  Also, if possible, try any 
new selection method on the datasets and see what the difference is.


The above is based on what I remember from my university statistics 
papers, I took it up to 4th year level many moons ago.






With respect to the assumption of Normal distribution - it is easy to 
come up with an example that shows it need not be: consider our our 
Pgbench 'accounts' table. The distribution of 'bid' values is not Normal 
(it is Uniform).


Now I realized I made people think about Normal distributions by 
mentioning computing the standard deviation of the sample (and I 
probably should have said 'standard deviation of the *sample 
frequencies*') - but this was only a suggestion for working out how to 
(maybe) be less arbitrary about how we decide what values to put in the 
MCV list (currently 25% different from the mean frequency).


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Mark Kirkwood

On 05/06/17 09:30, Tom Lane wrote:


I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
and it seems to me that there are a couple of things we ought to do about
it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value".  The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that design
originated when we were envisioning stats samples of just a few thousand
rows --- specifically, default_statistics_target was originally just 10,
leading to a 3000-row sample size.  So accepting two-appearance values as
MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000.  With today's typical sample size of 3 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV.  That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the "average"
value's frequency is.  A rule of that general form seems like a good idea,
but I now think the 25% threshold is far too small to do anything useful.
In particular, in any case like this where there are more distinct values
than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this rule is
totally useless to filter anything that we would otherwise consider as an
MCV.  I wonder if we shouldn't make it be "at least double the estimated
average frequency".



Or possibly calculate the sample standard deviation and make use of that 
to help decide on a more flexible cutoff than twice the avg frequency?


Are there any research papers that might help us here (I'm drowning in a 
sea of barely relevant search results for most phrases I've tried so 
far)? I recall there were some that Tom referenced when this stuff was 
originally written.


On the other hand I do have access to some mathematicians specializing 
in statistics - so can get their thoughts on this issue if you feel it 
would be worthwhile.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Mark Kirkwood



On 05/06/17 13:08, Mark Kirkwood wrote:

On 05/06/17 00:04, Erik Rijkers wrote:


On 2017-05-31 16:20, Erik Rijkers wrote:

On 2017-05-31 11:16, Petr Jelinek wrote:
[...]
Thanks to Mark's offer I was able to study the issue as it happened 
and

found the cause of this.

[0001-Improve-handover-logic-between-sync-and-apply-worker.patch]


This looks good:

-- out_20170531_1141.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
100 -- All is well.

So this is 100x a 1-minute test with 100x success. (This on the most
fastidious machine (slow disks, meagre specs) that used to give 15%
failures)


[Improve-handover-logic-between-sync-and-apply-worker-v2.patch]

No errors after (several days of) running variants of this. (2500x 1 
minute runs; 12x 1-hour runs)


Same here, no errors with the v2 patch applied (approx 2 days - all 1 
minute runs)




Further, reapplying the v1 patch (with a bit of editing as I wanted to 
apply it to my current master), gets a failure with missing rows in the 
history table quite quickly. I'll put back the v2 patch and resume runs 
with that, but I'm cautiously optimistic that the v2 patch solves the issue.


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Mark Kirkwood

On 05/06/17 00:04, Erik Rijkers wrote:


On 2017-05-31 16:20, Erik Rijkers wrote:

On 2017-05-31 11:16, Petr Jelinek wrote:
[...]

Thanks to Mark's offer I was able to study the issue as it happened and
found the cause of this.

[0001-Improve-handover-logic-between-sync-and-apply-worker.patch]


This looks good:

-- out_20170531_1141.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
100 -- All is well.

So this is 100x a 1-minute test with 100x success. (This on the most
fastidious machine (slow disks, meagre specs) that used to give 15%
failures)


[Improve-handover-logic-between-sync-and-apply-worker-v2.patch]

No errors after (several days of) running variants of this. (2500x 1 
minute runs; 12x 1-hour runs)


Same here, no errors with the v2 patch applied (approx 2 days - all 1 
minute runs)


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Mark Kirkwood

On 03/06/17 11:10, Petr Jelinek wrote:


On 02/06/17 22:29, Petr Jelinek wrote:

On 02/06/17 08:55, Mark Kirkwood wrote:

On 02/06/17 17:11, Erik Rijkers wrote:


On 2017-06-02 00:46, Mark Kirkwood wrote:

On 31/05/17 21:16, Petr Jelinek wrote:

I'm seeing a new failure with the patch applied - this time the
history table has missing rows. Petr, I'll put back your access :-)

Is this error during 1-minute runs?

I'm asking because I've moved back to longer (1-hour) runs (no errors
so far), and I'd like to keep track of what the most 'vulnerable'
parameters are.


Yeah, still using your test config (with my minor modifications).

When I got the error the 1st time, I did a complete make clean and
rebuildbut it is still possible I've 'done it wrong' - so
independent confirmation would be good!

Well, I've seen this issue as well while I was developing the fix, but
the patch I proposed fixed it for me as well as the original issue.


While I was testing something for different thread I noticed that I
manage transactions incorrectly in this patch. Fixed here, I didn't test
it much yet (it takes a while as you know :) ). Not sure if it's related
to the issue you've seen though.


Ok - I've applied this version, and running tests again. I needed to do 
a git pull to apply the patch, so getting some other changes too!


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Mark Kirkwood

On 02/06/17 17:11, Erik Rijkers wrote:


On 2017-06-02 00:46, Mark Kirkwood wrote:

On 31/05/17 21:16, Petr Jelinek wrote:

I'm seeing a new failure with the patch applied - this time the
history table has missing rows. Petr, I'll put back your access :-)


Is this error during 1-minute runs?

I'm asking because I've moved back to longer (1-hour) runs (no errors 
so far), and I'd like to keep track of what the most 'vulnerable' 
parameters are.




Yeah, still using your test config (with my minor modifications).

When I got the error the 1st time, I did a complete make clean and 
rebuildbut it is still possible I've 'done it wrong' - so 
independent confirmation would be good!


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-01 Thread Mark Kirkwood

On 31/05/17 21:16, Petr Jelinek wrote:


On 29/05/17 23:06, Mark Kirkwood wrote:

On 29/05/17 23:14, Petr Jelinek wrote:


On 29/05/17 03:33, Jeff Janes wrote:


What would you want to look at?  Would saving the WAL from the master be
helpful?


Useful info is, logs from provider (mainly the logical decoding logs
that mention LSNs), logs from subscriber (the lines about when sync
workers finished), contents of the pg_subscription_rel (with srrelid
casted to regclass so we know which table is which), and pg_waldump
output around the LSNs found in the logs and in the pg_subscription_rel
(+ few lines before and some after to get context). It's enough to only
care about LSNs for the table(s) that are out of sync.


I have a run that aborted with failure (accounts table md5 mismatch).
Petr - would you like to have access to the machine ? If so send me you
public key and I'll set it up.

Thanks to Mark's offer I was able to study the issue as it happened and
found the cause of this.

The busy loop in apply stops at the point when worker shmem state
indicates that table synchronization was finished, but that might not be
visible in the next transaction if it takes long to flush the final
commit to disk so we might ignore couple of transactions for given table
in the main apply because we think it's still being synchronized. This
also explains why I could not reproduce it on my testing machine (fast
ssd disk array where flushes were always fast) and why it happens
relatively rarely because it's one specific commit during the whole
synchronization process that needs to be slow.

So as solution I changed the busy loop in the apply to wait for
in-catalog status rather than in-memory status to make sure things are
really there and flushed.

While working on this I realized that the handover itself is bit more
complex than necessary (especially for debugging and for other people
understanding it) so I did some small changes as part of this patch to
make the sequences of states table goes during synchronization process
to always be the same. This might cause unnecessary update per one table
synchronization process in some cases but that seems like small enough
price to pay for clearer logic. And it also fixes another potential bug
that I identified where we might write wrong state to catalog if main
apply crashed while sync worker was waiting for status update.

I've been running tests on this overnight on another machine where I was
able to reproduce  the original issue within few runs (once I found what
causes it) and so far looks good.





I'm seeing a new failure with the patch applied - this time the history 
table has missing rows. Petr, I'll put back your access :-)


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-29 Thread Mark Kirkwood

On 29/05/17 23:14, Petr Jelinek wrote:


On 29/05/17 03:33, Jeff Janes wrote:


What would you want to look at?  Would saving the WAL from the master be
helpful?


Useful info is, logs from provider (mainly the logical decoding logs
that mention LSNs), logs from subscriber (the lines about when sync
workers finished), contents of the pg_subscription_rel (with srrelid
casted to regclass so we know which table is which), and pg_waldump
output around the LSNs found in the logs and in the pg_subscription_rel
(+ few lines before and some after to get context). It's enough to only
care about LSNs for the table(s) that are out of sync.



I have a run that aborted with failure (accounts table md5 mismatch). 
Petr - would you like to have access to the machine ? If so send me you 
public key and I'll set it up.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood



On 29/05/17 13:33, Jeff Janes wrote:
On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood 
<mark.kirkw...@catalyst.net.nz <mailto:mark.kirkw...@catalyst.net.nz>> 
wrote:


On 28/05/17 19:01, Mark Kirkwood wrote:


So running in cloud land now...so for no errors - will update.




The framework ran 600 tests last night, and I see 3 'NOK' results,
i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given
the way the test decides on failure (gets tired of waiting for the
table md5's to match) - it begs the question 'What if it had
waited a bit longer'? However from what I can see in all cases:

- the rowcounts were the same in master and replica
- the md5 of pgbench_accounts was different


All four tables should be wrong if there is still a transaction it is 
waiting for, as all the changes happen in a single transaction.




Yeah, my thought exactly.

I also got a failure, after 87 iterations of a similar test case.  It 
waited for hours, as mine requires manual intervention to stop 
waiting.  On the subscriber, one account still had a zero balance, 
while the history table on the subscriber agreed with both history and 
accounts on the publisher and the account should not have been zero, 
so definitely a transaction atomicity got busted.




Interesting, great that we are seeing the same thing.

I altered the script to also save the tellers and branches tables and 
repeated the runs, but so far it hasn't failed again in over 800 
iterations using the altered script.



...so does seem possible that there is some bug being tickled
here. Unfortunately the test framework blasts away the failed
tables and subscription and continues on...I'm going to amend it
to stop on failure so I can have a closer look at what happened.


What would you want to look at?  Would saving the WAL from the master 
be helpful?





Good question - I initially wanted to see if anything changed if I 
waited longer (which you have already figured out) and what was actually 
wrong with the accounts record (which you have gotten to as well)! Nice. 
After that, I'd thought of doing another transaction on an accounts 
record that lives in the same page as the 'gammy' one on the master - 
generally poke about and see what is happening :-)


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood

On 29/05/17 16:26, Erik Rijkers wrote:


On 2017-05-29 00:17, Mark Kirkwood wrote:

On 28/05/17 19:01, Mark Kirkwood wrote:


So running in cloud land now...so for no errors - will update.


The framework ran 600 tests last night, and I see 3 'NOK' results, i.e
3 failed test runs (all scale 25 and 8 pgbench clients). Given the way


Could you also give the params for the successful runs?


Scale 25, clients 90 and 64, scale 5 clients 90, 64 and 8 (i.e the 
defaults in the driver script).


Can you say anything about hardware?  (My experience is that older, 
slower, 'worse' hardware makes for more fails.)


It's running in a openstack cloud (so is a libvirt guest): 4 cpus, 4 GB 
ram and 2 disks: one for each Postgres instance, both formatted xfs. Hmm 
so maybe I should run a VM on my workstation and crank the IOPS limit 
way down...in the meantime I'll just let it run :-)



Many thanks, by the way.  I'm glad that it turns out I'm probably not 
doing something uniquely stupid (although I'm not glad that there 
seems to be a bug, and an elusive one at that)





Yeah looks like something subtle :-( Hopefully now its out in the open 
we'll all figure it together!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood

On 28/05/17 19:01, Mark Kirkwood wrote:



So running in cloud land now...so for no errors - will update.





The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 
failed test runs (all scale 25 and 8 pgbench clients). Given the way the 
test decides on failure (gets tired of waiting for the table md5's to 
match) - it begs the question 'What if it had waited a bit longer'? 
However from what I can see in all cases:


- the rowcounts were the same in master and replica
- the md5 of pgbench_accounts was different

...so does seem possible that there is some bug being tickled here. 
Unfortunately the test framework blasts away the failed tables and 
subscription and continues on...I'm going to amend it to stop on failure 
so I can have a closer look at what happened.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood

On 27/05/17 20:30, Erik Rijkers wrote:




Here is what I have:

instances.sh:
  starts up 2 assert enabled sessions

instances_fast.sh:
  alternative to instances.sh
  starts up 2 assert disabled 'fast' sessions

testset.sh
  loop to call pgbench_derail2.sh with varying params

pgbench_derail2.sh
  main test program
  can be called 'standalone'
./pgbench_derail2.sh $scale $clients $duration $date_str
  so for instance this should work:
./pgbench_derail2.sh 25 64 60 20170527_1019
  to remove publication and subscription from sessions, add a 5th 
parameter 'clean'

./pgbench_derail2.sh 1 1 1 1 'clean'

pubsub.sh
  displays replication state. also called by pgbench_derail2.sh
  must be in path

result.sh
  display results
  I keep this in a screen-session as:
  watch -n 20 './result.sh 201705'


Peculiar to my setup also:
  server version at compile time stamped with date + commit hash
  I misuse information_schema.sql_packages  at compile time to store 
patch information

  instances are in $pg_stuff_dir/pg_installations/pgsql.

So you'll have to outcomment a line here and there, and adapt paths, 
ports, and things like that.


It's a bit messy, I should have used perl from the beginning...



Considering it is all shell - pretty nice! I spent a bit of time today 
getting this working in a vanilla Ubuntu 16.04 cloud server. I found a 
few things that didn't work (suspect Erik has some default env variables 
set for ports and databases). These were sufficient to stop logical 
replication working for me at all - due to no dbname specified in the 
subscription connection.


Given I had to make some changes anyway, I moved all the config into one 
place (new file config.sh) - made all the programs use /bin/bash as 
interpreter (/bin/sh just does not work for scripts on Ubuntu), added 
ports and databases as reqd and fixed the need to mess too much with 
PATH (see attached diff).


So running in cloud land now...so for no errors - will update.

regards

Mark
diff -Naur test.orig/config.sh test/config.sh
--- test.orig/config.sh	1970-01-01 12:00:00.0 +1200
+++ test/config.sh	2017-05-28 15:21:33.261701918 +1200
@@ -0,0 +1,13 @@
+#!/bin/bash
+port1=6972
+project1=logical_replication
+port2=6973
+project2=logical_replication2
+pg_stuff_dir=$HOME/pg_stuff
+PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH
+PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH
+server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
+server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
+data_dir1=$server_dir1/data
+data_dir2=$server_dir2/data
+db=bench
diff -Naur test.orig/instances_fast.sh test/instances_fast.sh
--- test.orig/instances_fast.sh	2017-05-28 15:18:33.315780487 +1200
+++ test/instances_fast.sh	2017-05-28 15:19:02.511439749 +1200
@@ -1,17 +1,10 @@
-#!/bin/sh
+#!/bin/bash
 
 #assertions on  in $pg_stuff_dir/pg_installations/pgsql./bin
 #assertions off in $pg_stuff_dir/pg_installations/pgsql./bin.fast
 
-port1=6972 project1=logical_replication
-port2=6973 project2=logical_replication2
-pg_stuff_dir=$HOME/pg_stuff
-PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin.fast:$PATH
-PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin.fast:$PATH
-server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
-server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
-data_dir1=$server_dir1/data
-data_dir2=$server_dir2/data
+. config.sh
+
 options1="
 -c wal_level=logical
 -c max_replication_slots=10
@@ -36,6 +29,6 @@
 -c log_filename=logfile.${project2} 
 -c log_replication_commands=on "
 
-export PATH=$PATH1; PG=$(which postgres); $PG -D $data_dir1 -p $port1 ${options1} &
-export PATH=$PATH2; PG=$(which postgres); $PG -D $data_dir2 -p $port2 ${options2} &
+export PATH=$PATH1; export PG=$(which postgres); $PG -D $data_dir1 -p $port1 ${options1} &
+export PATH=$PATH2; export PG=$(which postgres); $PG -D $data_dir2 -p $port2 ${options2} &
 
diff -Naur test.orig/instances.sh test/instances.sh
--- test.orig/instances.sh	2017-05-28 15:18:33.291780768 +1200
+++ test/instances.sh	2017-05-28 15:19:02.511439749 +1200
@@ -1,17 +1,9 @@
-#!/bin/sh
+#!/bin/bash
 
 #assertions on  in $pg_stuff_dir/pg_installations/pgsql./bin
 #assertions off in $pg_stuff_dir/pg_installations/pgsql./bin.fast
 
-port1=6972 project1=logical_replication
-port2=6973 project2=logical_replication2
-pg_stuff_dir=$HOME/pg_stuff
-PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH
-PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH
-server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
-server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
-data_dir1=$server_dir1/data
-data_dir2=$server_dir2/data
+. config.sh
 
 options1="
 -c wal_level=logical
diff -Naur test.orig/pgbench_derail2.sh test/pgbench_derail2.sh
--- test.orig/pgbench_derail2.sh	2017-05-28 15:18:33.363779926 +1200
+++ test/pgbench_derail2.sh	2017-05-28 15:19:02.511439749 +1200
@@ -11,11 +11,13 @@
 #I 

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Mark Kirkwood

Sorry - I see you have done this already.

On 28/05/17 11:15, Mark Kirkwood wrote:
Interesting - might be good to see your test script too (so we can 
better understand how you are deciding if the runs are successful or 
not).






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Mark Kirkwood
Interesting - might be good to see your test script too (so we can 
better understand how you are deciding if the runs are successful or not).



Also, any idea which rows are different? If you want something out of 
the box that will do that for you see DBIx::Compare.


regards

Mark


On 28/05/17 04:12, Erik Rijkers wrote:


ok, ok...

( The thing is, I am trying to pre-digest the output but it takes time )

I can do this now: attached some output that belongs with this group 
of 100  1-minute runs:


-- out_20170525_1426.txt
100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
 82 -- All is well.
 18 -- Not good.

That is the worst set of runs of what I showed earlier.

that is:  out_20170525_1426.txt  and
2x18 logfiles that the 18 failed runs produced.
Those logfiles have names like:
logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log
logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log

.1.=primary
.2.=replica

Please disregard the errors around pg_current_wal_location().  (it was 
caused by some code to dump some wal into zipfiles which obviously 
stopped working after the function was removed/renamed) There are also 
some uninportant errors from the test-harness where I call with the 
wrong port.  Not interesting, I don't think.







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Mark Kirkwood

On 26/05/17 20:09, Erik Rijkers wrote:



The idea is simple enough:

startup instance1
startup instance2 (on same machine)
primary: init pgbench tables
primary: add primary key to pgbench_history
copy empty tables to replica by dump/restore
primary: start publication
replica: start subscription
primary: run 1-minute pgbench
wait till the 4 md5's of primary pgbench tables
  are the same as the 4 md5's of replica pgbench
  tables (this will need a time-out).
log 'ok' or 'not ok'
primary: clean up publication
replica: clean up subscription
shutdown primary
shutdown replica

this whole thing 100


I might have a look at scripting this up (especially if it keeps raining 
here)...


Some questions that might help me get it right:
- do you think we need to stop and start the instances every time?
- do we need to init pgbench each time?
- could we just drop the subscription and publication and truncate the 
replica tables instead?

- what scale pgbench are you running?
- how many clients for the 1 min pgbench run?
- are you starting the pgbench run while the copy_data jobs for the 
subscription are still running?

- how exactly are you calculating those md5's?

Cheers

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Mark Kirkwood

On 26/05/17 20:09, Erik Rijkers wrote:


On 2017-05-26 09:40, Simon Riggs wrote:


If we can find out what the bug is with a repeatable test case we can 
fix it.


Could you provide more details? Thanks


I will, just need some time to clean things up a bit.


But what I would like is for someone else to repeat my 100x1-minute 
tests, taking as core that snippet I posted in my previous email.  I 
built bash-stuff around that core (to take md5's, shut-down/start-up 
the two instances between runs, write info to log-files, etc).  But it 
would be good if someone else made that separately because if that 
then does not fail, it would prove that my test-harness is at fault 
(and not logical replication).




Will do - what I had been doing was running pgbench, waiting until the 
row counts on the replica pgbench_history were the same as the primary, 
then summing the %balance and delta fields from the primary and replica 
dbs and comparing. So far - all match up ok. However I'd been running a 
longer time frames (5 minutes), so not the same number of repetitions as 
yet.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-24 Thread Mark Kirkwood


On 25/03/17 07:52, Peter Eisentraut wrote:

On 3/24/17 10:09, Petr Jelinek wrote:

On 24/03/17 15:05, Peter Eisentraut wrote:

On 3/23/17 19:32, Petr Jelinek wrote:

Yes, I also forgot to check if the table actually exists on subscriber
when fetching them in CREATE SUBSCRIPTION (we have check during
replication but not there).

I think for this we can probably just change the missing_ok argument of
RangeVarGetRelid() to false.

Unless we want the custom error message, in which case we need to change
AlterSubscription_refresh(), because right now it errors out because
missing_ok = false.


You are right, stupid me.

Committed this version.



Minor niggle:

bench=# DROP PUBLICATION pgbench;
DROP STATISTICS   <===

I'm guessing that notification is wrong.

regards

Mark





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-24 Thread Mark Kirkwood

On 24/03/17 12:32, Petr Jelinek wrote:


On 24/03/17 00:14, Mark Kirkwood wrote:

On 24/03/17 02:00, Peter Eisentraut wrote:

On 3/21/17 21:38, Peter Eisentraut wrote:

This patch is looking pretty good to me, modulo the failing pg_dump
tests.

Attached is a fixup patch.  I have mainly updated some comments and
variable naming for (my) clarity.  No functional changes.

Committed all that.


Testing now this patch is in, I'm unable to create a subscription:

(master)

bench=# CREATE PUBLICATION pgbench
 FOR TABLE pgbench_accounts , pgbench_branches,
   pgbench_tellers, pgbench_history
 WITH (PUBLISH INSERT, PUBLISH UPDATE, PUBLISH DELETE);

(slave)

bench=# CREATE SUBSCRIPTION pgbench
 CONNECTION 'port=5447 user=postgres dbname=bench'
 PUBLICATION pgbench
 WITH (COPY DATA);
ERROR:  duplicate key value violates unique constraint
"pg_subscription_rel_srrelid_srsubid_index"
DETAIL:  Key (srrelid, srsubid)=(0, 16389) already exists.

This is a pair of freshly initdb'ed instances, the master has a size 100
pgbench schema.

I'm guessing this is a different bug from the segfault also reported


Yes, I also forgot to check if the table actually exists on subscriber
when fetching them in CREATE SUBSCRIPTION (we have check during
replication but not there).

Attached patches should fix both issues.




Yep, does seem to.

I note that (probably intensional) specifying 'COPY DATA' does not 
'CREATE TABLES' for you...ok, I probably didn't read ...something 
somewhere...but anyway, after creating the tables it all seems to work. 
Nice.


However one minor observation - as Michael Banck noted - the elapsed 
time for slave to catch up after running:


$ pgbench -c8 -T600 bench

on the master was (subjectively) much longer than for physical streaming 
replication. Is this expected?


regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-23 Thread Mark Kirkwood

On 24/03/17 02:00, Peter Eisentraut wrote:

On 3/21/17 21:38, Peter Eisentraut wrote:

This patch is looking pretty good to me, modulo the failing pg_dump tests.

Attached is a fixup patch.  I have mainly updated some comments and
variable naming for (my) clarity.  No functional changes.

Committed all that.



Testing now this patch is in, I'm unable to create a subscription:

(master)

bench=# CREATE PUBLICATION pgbench
FOR TABLE pgbench_accounts , pgbench_branches,
  pgbench_tellers, pgbench_history
WITH (PUBLISH INSERT, PUBLISH UPDATE, PUBLISH DELETE);

(slave)

bench=# CREATE SUBSCRIPTION pgbench
CONNECTION 'port=5447 user=postgres dbname=bench'
PUBLICATION pgbench
WITH (COPY DATA);
ERROR:  duplicate key value violates unique constraint 
"pg_subscription_rel_srrelid_srsubid_index"

DETAIL:  Key (srrelid, srsubid)=(0, 16389) already exists.

This is a pair of freshly initdb'ed instances, the master has a size 100 
pgbench schema.


I'm guessing this is a different bug from the segfault also reported


regards


Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Mark Kirkwood

On 15/03/17 06:29, Robert Haas wrote:



Great!  I've committed the latest version of the patch, with some
cosmetic changes.

It would be astonishing if there weren't a bug or two left, but I
think overall this is very solid work, and I think it's time to put
this out there and see how things go.



Awesome, great work!

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical replication initial feedback

2017-03-10 Thread Mark Kirkwood

Hi,

Thought I'd take a look at how this works (checking out current master). 
You guys have manged to get this to the point where it is *ridiculously* 
easy to set up a basic replication scenario - awesome i must say:


- change a few parameters on the master 
(max_wal_senders,max_replication_slots,wal_level)


- create a publication on the master

- pg_dump/restore relevant tables

- create a subscription of the slave


..and done. Very nice!


best wishes

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Mark Kirkwood

On 23/11/16 16:31, Tom Lane wrote:

Robert Haas  writes:

[ Let's invent Oracle-style UNDO logs ]


I dunno.  I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better.  I don't recall all the details
of the conversation but I think his key point was basically this:


- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.




Also ROLLBACK becomes vastly more expensive than COMMIT (I can recall 
many years ago when I used to be an Oracle DBA reading whole chapters of 
novels waiting for failed batch jobs to roll back).


However I'd like to add that I agree this is worth looking at, as 
ideally it would be great to be able to choose whether to have No-UNDO 
or UNDO on a table by table basis...


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: About CMake v2

2016-11-16 Thread Mark Kirkwood
I see there are some patches for the putenv issue with Visual studio 
2013 in progress on this list - it is probably best to work with the 
author to see if 2015 has any new issues and keep all changes for that 
*out* of the cmake patches.



regards


Mark


On 16/11/16 21:22, Yury Zhuravlev wrote:
I made this small wrapper special for MSVC 2015 without Service Packs 
because postgres macross were in conflict with MS internal functions. 
After some time and some updates for MSVC Michael Paquier could not 
reproduce  my problem but I keep this patch to avoid problems in the 
future. I can check old behavior again and revert all changes if 
needed and ofcourse I have plans to make separate patch for this changes.

Thanks!




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: About CMake v2

2016-11-15 Thread Mark Kirkwood
Yeah, there seems to be a lot of these. Looking through them almost all 
concern the addition of piece of code to wrap putenv. e.g:


--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1350,7 +1350,7 @@ exec_command(const char *cmd,
char   *newval;

newval = psprintf("%s=%s", envvar, envval);
-   putenv(newval);
+   pg_putenv_proxy(newval);
success = true;

/*

Where pg_putenv_proxy either calls putenv or pgwin32_putenv (the latter 
on windows I'd guess). I wonder if this could have been avoided, since 
the original code handles this sort of thing. There are also some minor 
- and not immediately obvious - changes to a number of macros in various 
includes...If I'm feeling keen I'll experiment to see how far I can get 
without any source changes at all.



regards


Mark


On 09/11/16 08:37, Peter Eisentraut wrote:


There are a number of changes in .[ch] and .pl files that are unclear
and not explained.  Please explain them.  You can also submit separate
preliminary patches if you need to do some refactoring.  Ultimately, I
would expect this patch not to require C code changes.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: About CMake v2

2016-11-15 Thread Mark Kirkwood
I had a bit a play around to see if I could get this building properly 
again with v10 (i.e master). I've attached a minimal *incremental* patch 
that needs to be applied after v1. This gets everything under the src 
tree building (added the new file_utils.c and reordered some link libs 
and removed some duplicates).



I haven't made any changes with respect to it trying to detect and build 
everything. One added nit I see is that unless I'm missing something 
there appears to be no way to stop it trying to build all the 
contribs...so an option to enable/disable their build is needed.



To make it display what options there are I use:

$ mkdir build; cd build ; cmake .. -LH


And to do a build that works:

$ cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local/pgsql/10 -DWITH_PERL=OFF 
-DWITH_OPENSSL=OFF -DWITH_PYTHON=OFF



With reference to Tom's comments, I'm thinking that these should all 
default to 'OFF' plus an additional variable WITH_CONTRIB (or similar) 
should default to OFF too.



regards


Mark



On 09/11/16 08:37, Peter Eisentraut wrote:

On 9/17/16 1:21 PM, Yury Zhuravlev wrote:

Now, I published the first version of the patch.

I tried this out.  Because of some file moves in initdb and
pg_basebackup, the build fails:

[ 74%] Linking C executable initdb
  Undefined symbols for architecture x86_64:
"_fsync_pgdata", referenced from:
_main in initdb.c.o
  ld: symbol(s) not found for architecture x86_64
  collect2: error: ld returned 1 exit status
  make[2]: *** [src/bin/initdb/CMakeFiles/initdb.dir/build.make:177:
src/bin/initdb/initdb] Error 1
  make[1]: *** [CMakeFiles/Makefile2:2893:
src/bin/initdb/CMakeFiles/initdb.dir/all] Error 2
  make: *** [Makefile:128: all] Error 2

Please submit an updated patch.

I suggest you use git format-patch to produce patches.  This is easier
to apply, especially when there are a lot of new files involved.  Also
use the git facilities to check for whitespace errors.

Please supply some documentation, such as

- what are the basic commands
- how to set include/library paths, choose configure options
- how to set CFLAGS
- how to see raw build commands
- what are the targets for all/world/check/docs etc.
- explain directory structure

I suggest for now you could put this into a README.cmake file in your
patch.  We don't need to commit it that way, but it would help in the
meantime.

When I run cmake without options, it seems to do opportunistic feature
checking.  For example, it turns on OpenSSL or Python support if it can
find it, otherwise it turns it off.  We need this to be deterministic.
Without options, choose the basic feature set, require all other
features to be turned on explicitly, fail if they can't be found.
Whatever the Autoconf-based build does now has been fairly deliberately
tuned, so there should be very little reason to deviate from that.

The Python check appears to be picking up pieces from two different
Python installations:

  -- Found PythonInterp: /usr/local/bin/python (found version "2.7.12")
  -- Found PythonLibs: /usr/lib/libpython2.7.dylib (found version "2.7.10")

The check results otherwise look OK, but I'm a bit confused about the
order.  It checks for some functions before all the header files are
checked for.  Is that intentional?

There are a number of changes in .[ch] and .pl files that are unclear
and not explained.  Please explain them.  You can also submit separate
preliminary patches if you need to do some refactoring.  Ultimately, I
would expect this patch not to require C code changes.



diff --git a/src/bin/pg_archivecleanup/CMakeLists.txt b/src/bin/pg_archivecleanup/CMakeLists.txt
index 7c79ac3..06d7be4 100644
--- a/src/bin/pg_archivecleanup/CMakeLists.txt
+++ b/src/bin/pg_archivecleanup/CMakeLists.txt
@@ -7,9 +7,8 @@ include_directories(BEFORE
 add_executable(pg_archivecleanup pg_archivecleanup.c)
 
 target_link_libraries(pg_archivecleanup
-	port
-#	pq
 	pgcommon
+	port
 	${M_LIB}
 )
 
diff --git a/src/bin/pg_basebackup/CMakeLists.txt b/src/bin/pg_basebackup/CMakeLists.txt
index a985e08..a9bdd66 100644
--- a/src/bin/pg_basebackup/CMakeLists.txt
+++ b/src/bin/pg_basebackup/CMakeLists.txt
@@ -12,9 +12,12 @@ endif()
 add_library(common_basebackup STATIC
 	receivelog.c
 	streamutil.c
+	walmethods.c
 )
 
 target_link_libraries(common_basebackup
+	pgfeutils
+	pgcommon
 	port
 	pq
 )
@@ -29,12 +32,8 @@ foreach(loop_var IN ITEMS ${basebackup_list})
 	add_executable(${loop_var} ${loop_var}.c)
 
 	target_link_libraries(${loop_var}
-		pgfeutils
-		port
-		pq
-		pgcommon
-		${M_LIB}
 		common_basebackup
+		${M_LIB}
 	)
 	CMAKE_SET_TARGET_FOLDER(${loop_var} bin)
 	if(ZLIB_FOUND)
diff --git a/src/common/CMakeLists.txt b/src/common/CMakeLists.txt
index 7a24165..9338482 100644
--- a/src/common/CMakeLists.txt
+++ b/src/common/CMakeLists.txt
@@ -18,6 +18,7 @@ set(pgcommon_srv_SRCS
 set(pgcommon_SRCS
 	${pgcommon_srv_SRCS}
 	fe_memutils.c
+	file_utils.c
 	restricted_token.c
 	${PROJECT_SOURCE_DIR}/src/include/parser/gram.h
 )


Re: [HACKERS] WIP: About CMake v2

2016-11-10 Thread Mark Kirkwood

On 11/11/16 08:15, Yury Zhuravlev wrote:


Craig Ringer wrote:

So personally I think it'd be fine if a cmake build defaulted to
finding and using what it could, but offered a --minimal mode or
whatever that gets us just core postgres + whatever we enable
explicitly. But our current behaviour is OK too.


To me it's best way. But I'm not sure what Tom Lane will accept this.




I just had a play with this patch - nice! (ok so it needs a fix so that 
the compile completes as mentioned prev).


I would recommend making it behave as Tom suggested. *Then* add an 
--autodetect or similar option that makes

behave in the 'finding and using what it could' manner as a 2nd patch.

regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-25 Thread Mark Kirkwood



On 25/09/16 18:18, Amit Kapila wrote:

On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark  wrote:

On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:

But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes


Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated.  Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.



+1

I think so too - I've seen many email threads over the years on this 
list that essentially state "we need hash indexes wal logged to make 
progress with them"...and Amit et al has/have done this (more than this 
obviously - made 'em better too) and I'm astonished that folk are 
suggesting anything other than 'commit this great patch now!'...


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-18 Thread Mark Kirkwood



On 17/09/16 06:38, Andres Freund wrote:

On 2016-09-16 09:12:22 -0700, Jeff Janes wrote:

On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund  wrote:

One earlier question about this is whether that is actually a worthwhile
goal.  Are the speed and space benefits big enough in the general case?
Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?

Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.

I think that exploring it well requires good code.  If the code is good,
why not commit it?

Because getting there requires a lot of effort, debugging it afterwards
would take effort, and maintaining it would also takes a fair amount?
Adding code isn't free.

I'm rather unenthused about having a hash index implementation that's
mildly better in some corner cases, but otherwise doesn't have much
benefit. That'll mean we'll have to step up our user education a lot,
and we'll have to maintain something for little benefit.



While I see the point of what you are saying here, I recall all previous 
discussions about has indexes tended to go a bit like this:


- until WAL logging of hash indexes is written it is not worthwhile 
trying to make improvements to them

- WAL logging will be a lot of work, patches 1st please

Now someone has done that work, and we seem to be objecting that because 
they are not improved then the patches are (maybe) not worthwhile. I 
think that is - essentially - somewhat unfair.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-16 Thread Mark Kirkwood

On 16/09/16 18:35, Amit Kapila wrote:


On Thu, Sep 15, 2016 at 7:53 PM, Andres Freund  wrote:

Hi,

On 2016-05-10 17:39:22 +0530, Amit Kapila wrote:

For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.

One earlier question about this is whether that is actually a worthwhile
goal.  Are the speed and space benefits big enough in the general case?


I think there will surely by speed benefits w.r.t reads for larger
indexes.  All our measurements till now have shown that there is a
benefit varying from 30~60% (for reads) with hash index as compare to
btree, and I think it could be even more if we further increase the
size of index.  On space front, I have not done any detailed study, so
it is not right to conclude anything, but it appears to me that if the
index is on char/varchar column where size of key is 10 or 20 bytes,
hash indexes should be beneficial as they store just hash-key.


Could those benefits not be achieved in a more maintainable manner by
adding a layer that uses a btree over hash(columns), and adds
appropriate rechecks after heap scans?


I don't think it can be faster for reads than using real hash index,
but surely one can have that as a workaround.


Note that I'm not saying that hash indexes are not worthwhile, I'm just
doubtful that question has been explored sufficiently.


I think theoretically hash indexes are faster than btree considering
logarithmic complexity (O(1) vs. O(logn)), also the results after
recent optimizations indicate that hash indexes are faster than btree
for equal to searches.  I am not saying after the recent set of
patches proposed for hash indexes they will be better in all kind of
cases.  It could be beneficial for cases where indexed columns are not
updated heavily.

I think one can definitely argue that we can some optimizations in
btree and make them equivalent or better than hash indexes, but I am
not sure if it is possible for all-kind of use-cases.




I think having the choice for a more equality optimized index design is 
desirable. Now that they are wal logged they are first class citizens so 
to speak. I suspect that there are a lot of further speed optimizations 
that can be considered to tease out the best performance - now that the 
basics of reliability have been sorted. I think this patch/set of 
patches is/are important!


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less expensive pg_buffercache on big shmem

2016-09-15 Thread Mark Kirkwood

On 02/09/16 15:19, Andres Freund wrote:


On 2016-09-02 08:31:42 +0530, Robert Haas wrote:

I wonder whether we ought to just switch from the consistent method to
the semiconsistent method and call it good.

+1. I think, before long, we're going to have to switch away from having
locks & partitions in the first place. So I don't see a problem relaxing
this. It's not like that consistency really buys you anything...  I'd
even consider not using any locks.



+1 as well. When I wrote the original module I copied the design of the 
pg_locks view - as it was safe and consistent. Now it is clear that the 
ability to look at (semi-consistent) contents of the buffer cache is 
more important than having a theoretically correct point in time 
snapshot. Go for it :-)


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-12 Thread Mark Kirkwood

On 13/09/16 01:20, Jesper Pedersen wrote:

On 09/01/2016 11:55 PM, Amit Kapila wrote:

I have fixed all other issues you have raised.  Updated patch is
attached with this mail.



The following script hangs on idx_val creation - just with v5, WAL patch
not applied.


Are you sure it is actually hanging? I see 100% cpu for a few minutes 
but the index eventually completes ok for me (v5 patch applied to 
today's master).


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Mark Kirkwood

On 11/09/16 19:16, Mark Kirkwood wrote:




On 11/09/16 17:01, Amit Kapila wrote:

...Do you think we can do some read-only
workload benchmarking using this server?  If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.



[1] - 
https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com





I can do - are we checking checking for hangs/assertions or comparing 
patched vs unpatched performance (for the metapage patch)?





So, assuming the latter - testing performance with and without the 
metapage patch:


For my 1st runs:

- cpus 16, ran 16G
- size 100, clients 32

I'm seeing no difference in performance for read only (-S) pgbench 
workload (with everybody using has indexes). I guess not that surprising 
as the db fites in ram (1.6G and we have 16G). So I'll retry with a 
bigger dataset (suspect size 2000 is needed).


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Mark Kirkwood



On 11/09/16 17:01, Amit Kapila wrote:

On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:


performed several 10 hour runs on size 100 database using 32 and 64 clients.
For the last run I rebuilt with assertions enabled. No hangs or assertion
failures.


Thanks for verification.  Do you think we can do some read-only
workload benchmarking using this server?  If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.



[1] - 
https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com




I can do - are we checking checking for hangs/assertions or comparing 
patched vs unpatched performance (for the metapage patch)?


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-10 Thread Mark Kirkwood



On 09/09/16 14:50, Mark Kirkwood wrote:


Yeah, good suggestion about replacing (essentially) all the indexes 
with hash ones and testing. I did some short runs with this type of 
schema yesterday (actually to get a feel for if hash performance vs 
btree was compareable - does seem tp be) - but probably longer ones 
with higher concurrency (as high as I can manage on a single socket i7 
anyway) is a good plan. If Ashutosh has access to seriously large 
numbers of cores then that is even better :-)





I managed to find a slightly bigger server (used our openstack cloud to 
run a 16 cpu vm). With the schema modified as follows:


bench=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" hash (aid)

bench=# \d pgbench_branches
   Table "public.pgbench_branches"
  Column  | Type  | Modifiers
--+---+---
 bid  | integer   | not null
 bbalance | integer   |
 filler   | character(88) |
Indexes:
"pgbench_branches_pkey" hash (bid)

bench=# \d pgbench_tellers
Table "public.pgbench_tellers"
  Column  | Type  | Modifiers
--+---+---
 tid  | integer   | not null
 bid  | integer   |
 tbalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_tellers_pkey" hash (tid)

bench=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_pkey" hash (bid)

performed several 10 hour runs on size 100 database using 32 and 64 
clients. For the last run I rebuilt with assertions enabled. No hangs or 
assertion failures.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Mark Kirkwood

On 09/09/16 07:09, Jeff Janes wrote:

On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma > wrote:


> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid)


Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will 
never use hash indexes for searching, so it limits the amount of test 
coverage you will get.  While searching shouldn't generate novel types 
of WAL records (that I know of), it will generate locking and timing 
issues that might uncover bugs (if there are any left to uncover, of 
course).


I would drop the primary key on pgbench_accounts and replace it with a 
hash index and test it that way (except I don't have a 128 core 
machine at my disposal, so really I am suggesting that you do this...)


The lack of primary key and the non-uniqueness of the hash index 
should not be an operational problem, because the built in pgbench 
runs never attempt to violate the constraints anyway.


In fact, I'd replace all of the indexes on the rest of the pgbench 
tables with hash indexes, too, just for additional testing.


I plan to do testing using my own testing harness after changing it to 
insert a lot of dummy tuples (ones with negative values in the 
pseudo-pk column, which are never queried by the core part of the 
harness) and deleting them at random intervals.  I think that none of 
pgbench's built in tests are likely to give the bucket splitting and 
squeezing code very much exercise.


Is there a way to gather statistics on how many of each type of WAL 
record are actually getting sent over the replication link?  The only 
way I can think of is to turn on wal archving as well as replication, 
then using pg_xlogdump to gather the stats.


I've run my original test for a while now and have not seen any 
problems.  But I realized I forgot to compile with enable-casserts, to 
I will have to redo it to make sure the assertion failures have been 
fixed.  In my original testing I did very rarely get a deadlock (or 
some kind of hang), and I haven't seen that again so far.  It was 
probably the same source as the one Mark observed, and so the same fix.


Cheers,

Jeff


Yeah, good suggestion about replacing (essentially) all the indexes with 
hash ones and testing. I did some short runs with this type of schema 
yesterday (actually to get a feel for if hash performance vs btree was 
compareable - does seem tp be) - but probably longer ones with higher 
concurrency (as high as I can manage on a single socket i7 anyway) is a 
good plan. If Ashutosh has access to seriously large numbers of cores 
then that is even better :-)


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Mark Kirkwood

On 07/09/16 21:58, Amit Kapila wrote:


On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:


After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.


Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set.  Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit.  Then while freeing
the overflow page it can hit the Assert as mentioned by you.  I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue.  As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.



The test is rather awkward, it might be easier to just have me test it.


Okay, I have fixed this issue as explained above.  Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation).  I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended).  Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.




Repeating my tests with these new patches applied points to the hang 
issue being solved. I tested several 10 minute runs (any of which was 
enough to elicit the hang previously). I'll do some longer ones, but 
looks good!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-24 Thread Mark Kirkwood

On 24/08/16 17:01, Mark Kirkwood wrote:


...actually I was wrong there, only 2 of them were the same. So I've 
attached a new log of bt's of them all.






And I can reproduce with only 1 session, figured that might be a helpful 
piece of the puzzle (trace attached).



$ pgbench -c 1 -T600 bench

bench=# SELECT pid,state,now()-xact_start AS 
wait,wait_event_type,wait_event,query FROM pg_stat_activity WHERE 
datname='bench' ORDER BY wait DESC;
  pid  | state  |  wait   | wait_event_type |   wait_event   |  
 
query   
 

---++-+-++---
-

 17266 | active | 00:45:23.027555 | LWLockTranche   | buffer_content | INSERT 
INTO pgbenc
h_history (tid, bid, aid, delta, mtime) VALUES (352, 76, 1305123, -1604, 
CURRENT_TIMESTAM
P);
(1 row)

$ gdb -p 17266 
(gdb) bt
#0  0x7f7b4d533387 in semop () at ../sysdeps/unix/syscall-template.S:84
#1  0x00660331 in PGSemaphoreLock (sema=sema@entry=0x7f7b4cad1690)
at pg_sema.c:387
#2  0x006bde9b in LWLockAcquire (lock=0x7f7b44583fe4, mode=LW_EXCLUSIVE)
at lwlock.c:1288
#3  0x0049a625 in _hash_getbuf_with_strategy 
(rel=rel@entry=0x7f7b4e0c1908, 
blkno=blkno@entry=12, access=access@entry=2, flags=flags@entry=1, 
bstrategy=bstrategy@entry=0x0) at hashpage.c:252
#4  0x00498040 in _hash_freeovflpage (rel=rel@entry=0x7f7b4e0c1908, 
bucketbuf=bucketbuf@entry=12796, ovflbuf=ovflbuf@entry=236, 
wbuf=wbuf@entry=13166, 
itups=itups@entry=0x7ffc693fc800, 
itup_offsets=itup_offsets@entry=0x13067d0, 
tups_size=0x7ffc693fd4c0, nitups=34, bstrategy=0x0) at hashovfl.c:517
#5  0x00499985 in _hash_squeezebucket (rel=rel@entry=0x7f7b4e0c1908, 
bucket=bucket@entry=7, bucket_blkno=bucket_blkno@entry=10, 
bucket_buf=bucket_buf@entry=12796, bstrategy=bstrategy@entry=0x0) at 
hashovfl.c:1010
#6  0x00496caf in hashbucketcleanup (rel=rel@entry=0x7f7b4e0c1908, 
bucket_buf=bucket_buf@entry=12796, bucket_blkno=bucket_blkno@entry=10, 
bstrategy=bstrategy@entry=0x0, maxbucket=22, highmask=31, lowmask=15, 
tuples_removed=0x0, num_index_tuples=0x0, bucket_has_garbage=1 '\001', 
delay=0 '\000', callback=0x0, callback_state=0x0) at hash.c:937
#7  0x0049b353 in _hash_expandtable (rel=rel@entry=0x7f7b4e0c1908, 
metabuf=metabuf@entry=6179) at hashpage.c:785
#8  0x00497bae in _hash_doinsert (rel=rel@entry=0x7f7b4e0c1908, 
itup=itup@entry=0x13060b8) at hashinsert.c:313
#9  0x0049617f in hashinsert (rel=0x7f7b4e0c1908, values=, 
isnull=, ht_ctid=0x1305f7c, heapRel=, 
checkUnique=) at hash.c:247
#10 0x005bf060 in ExecInsertIndexTuples (slot=slot@entry=0x1304a70, 
tupleid=tupleid@entry=0x1305f7c, estate=estate@entry=0x13043b0, 
noDupErr=noDupErr@entry=0 '\000', specConflict=specConflict@entry=0x0, 
arbiterIndexes=arbiterIndexes@entry=0x0) at execIndexing.c:388
#11 0x005dd011 in ExecInsert (canSetTag=1 '\001', estate=0x13043b0, 
onconflict=, arbiterIndexes=0x0, planSlot=0x1304a70, 
slot=0x1304a70, 
mtstate=0x1304600) at nodeModifyTable.c:481
#12 ExecModifyTable (node=node@entry=0x1304600) at nodeModifyTable.c:1496
#13 0x005c3948 in ExecProcNode (node=node@entry=0x1304600) at 
execProcnode.c:396
#14 0x005bfdbf in ExecutePlan (dest=0x12f4bd8, direction=, 
numberTuples=0, sendTuples=, operation=CMD_INSERT, 
use_parallel_mode=, planstate=0x1304600, estate=0x13043b0)
at execMain.c:1567
#15 standard_ExecutorRun (queryDesc=0x1303fa0, direction=, 
count=0)
at execMain.c:338
#16 0x006ce669 in ProcessQuery (plan=, 
sourceText=0x12ccf50 "INSERT INTO pgbench_history (tid, bid, aid, delta, 
mtime) VALUES (352, 76, 1305123, -1604, CURRENT_TIMESTAMP);", params=0x0, 
dest=0x12f4bd8, 
completionTag=0x7ffc69400c30 "") at pquery.c:187
#17 0x006ce89c in PortalRunMulti (portal=portal@entry=0x1271fe0, 
isTopLevel=isTopLevel@entry=1 '\001', 
setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=dest@entry=0x12f4bd8, 
---Type  to continue, or q  to quit---
altdest=altdest@entry=0x12f4bd8, 
completionTag=completionTag@entry=0x7ffc69400c30 "") at pquery.c:1303
#18 0x006cf34e in PortalRun (portal=portal@entry=0x1271fe0, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 
'\001', 
dest=dest@entry=0x12f4bd8, altdest=altdest@entry=0x12f4bd8, 
completionTag=completionTag@entry=0x7ffc69400c30 "") at pquery.c:815
#19 0x006cc45b in exec_simple_query (
query_string=0x12ccf50 "INSERT INTO pgbench_history (tid, bid, aid, delta, 
mtime) VALUES (352, 76, 1305123, -1604, CURRENT_TIMESTAMP)

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-23 Thread Mark Kirkwood

On 24/08/16 16:52, Mark Kirkwood wrote:

On 24/08/16 16:33, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:
Can you get the call stacks?


For every stuck backend? (just double checking)...

Yeah.



Cool,

I managed to reproduce with a reduced workload of 4 backends, then 
noticed that the traces for 3 of 'em were all the same. So I've 
attached the 2 unique ones, plus noted what pg_stat_activity thought 
the wait event was, in case that is useful.


...actually I was wrong there, only 2 of them were the same. So I've 
attached a new log of bt's of them all.



$ pgbench -c 4 -T600 bench

postgres=# SELECT pid, state,now()-xact_start AS 
wait,wait_event_type,wait_event,query FROM pg_stat_activity WHERE 
datname='bench' ORDER BY wait DESC;  pid  | state  |  wait   | 
wait_event_type |   wait_event   |  
   query   
   
---++-+-++-
---
 16549 | active | 00:06:00.252146 | LWLockTranche   | buffer_content | INSERT 
INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (190, 24, 9513890, 
-2566, C
URRENT_TIMESTAMP);
 16548 | active | 00:06:00.235003 | LWLockTranche   | buffer_content | INSERT 
INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (445, 3, 5688076, 
1021, CUR
RENT_TIMESTAMP);
 16547 | active | 00:06:00.218032 | LWLockTranche   | buffer_content | INSERT 
INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (803, 97, 6871273, 
-35, CUR
RENT_TIMESTAMP);
 16546 | active | 00:06:00.192543 | Lock| transactionid  | UPDATE 
pgbench_branches SET bbalance = bbalance + -2823 WHERE bid = 3;
(4 rows)

$ gdb 16546
(gdb) bt
#0  0x7f51daa3a133 in __epoll_wait_nocancel () at 
../sysdeps/unix/syscall-template.S:84
#1  0x006abdb6 in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7fffd197d3a0, cur_timeout=-1, set=0xdc9288)
at latch.c:987
#2  WaitEventSetWait (set=set@entry=0xdc9288, timeout=timeout@entry=-1, 
occurred_events=occurred_events@entry=0x7fffd197d3a0, 
nevents=nevents@entry=1) at latch.c:941
#3  0x006ac1ca in WaitLatchOrSocket (latch=0x7f51d9fd969c, 
wakeEvents=wakeEvents@entry=1, sock=sock@entry=-1, 
timeout=-1, timeout@entry=0) at latch.c:347
#4  0x006ac27d in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=1, timeout=timeout@entry=0)
at latch.c:302
#5  0x006bb5e3 in ProcSleep (locallock=locallock@entry=0xd3d690, 
lockMethodTable=lockMethodTable@entry=0x9149a0 ) at 
proc.c:1219
#6  0x006b639d in WaitOnLock (locallock=locallock@entry=0xd3d690, 
owner=owner@entry=0xdc4cb0) at lock.c:1703
#7  0x006b7a77 in LockAcquireExtended 
(locktag=locktag@entry=0x7fffd197d700, lockmode=lockmode@entry=5, 
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 '\000', 
reportMemoryError=reportMemoryError@entry=1 '\001')
at lock.c:998
#8  0x006b7d51 in LockAcquire (locktag=locktag@entry=0x7fffd197d700, 
lockmode=lockmode@entry=5, 
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 '\000') 
at lock.c:688
#9  0x006b58fe in XactLockTableWait (xid=xid@entry=7667, 
rel=rel@entry=0x7f51db5c6398, ctid=ctid@entry=0x7fffd197d844, 
oper=oper@entry=XLTW_Update) at lmgr.c:587
#10 0x004a5d29 in heap_update (relation=relation@entry=0x7f51db5c6398, 
otid=otid@entry=0x7fffd197d9e0, 
newtup=newtup@entry=0xd4b3d0, cid=2, crosscheck=0x0, wait=wait@entry=1 
'\001', hufd=0x7fffd197d8f0, lockmode=0x7fffd197d8ec)
at heapam.c:3755
#11 0x005dbeaa in ExecUpdate (tupleid=tupleid@entry=0x7fffd197d9e0, 
oldtuple=oldtuple@entry=0x0, 
slot=slot@entry=0xd4adb0, planSlot=planSlot@entry=0xdc8ea0, 
epqstate=epqstate@entry=0xdc76c8, estate=estate@entry=0xdc73d0, 
canSetTag=1 '\001') at nodeModifyTable.c:922
#12 0x005dc633 in ExecModifyTable (node=node@entry=0xdc7620) at 
nodeModifyTable.c:1501
#13 0x005c3948 in ExecProcNode (node=node@entry=0xdc7620) at 
execProcnode.c:396
#14 0x005bfdbf in ExecutePlan (dest=0xdc9e60, direction=, numberTuples=0, sendTuples=, 
operation=CMD_UPDATE, use_parallel_mode=, 
planstate=0xdc7620, estate=0xdc73d0) at execMain.c:1567
#15 standard_ExecutorRun (queryDesc=0xdc6fc0, direction=, 
count=0) at execMain.c:338
#16 0x006ce669 in ProcessQuery (plan=, 
sourceText=0xd8ff70 "UPDATE pgbench_branches SET bbalance = bbalance + 
-2823 WHERE bid = 3;", params=0x0, dest=0xdc9e60, 
completionTag=0x7fffd197dfc0 "") at pquery.c:187
#17 0x006ce89c in PortalRunMulti (portal=portal@entry=0xd35000, 
isT

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-23 Thread Mark Kirkwood

On 24/08/16 16:33, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:
Can you get the call stacks?


For every stuck backend? (just double checking)...

Yeah.



Cool,

I managed to reproduce with a reduced workload of 4 backends, then 
noticed that the traces for 3 of 'em were all the same. So I've attached 
the 2 unique ones, plus noted what pg_stat_activity thought the wait 
event was, in case that is useful.


Cheers

Mark

$ pgbench -c 4 -T600 bench

postgres=# SELECT pid, state,now()-xact_start AS 
wait,wait_event_type,wait_event,query FROM pg_stat_activity WHERE 
datname='bench' ORDER BY wait DESC;  pid  | state  |  wait   | 
wait_event_type |   wait_event   |  
   query   
   
---++-+-++-
---
 16549 | active | 00:06:00.252146 | LWLockTranche   | buffer_content | INSERT 
INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (190, 24, 9513890, 
-2566, C
URRENT_TIMESTAMP);
 16548 | active | 00:06:00.235003 | LWLockTranche   | buffer_content | INSERT 
INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (445, 3, 5688076, 
1021, CUR
RENT_TIMESTAMP);
 16547 | active | 00:06:00.218032 | LWLockTranche   | buffer_content | INSERT 
INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (803, 97, 6871273, 
-35, CUR
RENT_TIMESTAMP);
 16546 | active | 00:06:00.192543 | Lock| transactionid  | UPDATE 
pgbench_branches SET bbalance = bbalance + -2823 WHERE bid = 3;
(4 rows)


postgres@zmori:~$ gdb -p 16546
0x7f51daa3a133 in __epoll_wait_nocancel ()
at ../sysdeps/unix/syscall-template.S:84
84  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x7f51daa3a133 in __epoll_wait_nocancel ()
at ../sysdeps/unix/syscall-template.S:84
#1  0x006abdb6 in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7fffd197d3a0, cur_timeout=-1, set=0xdc9288)
at latch.c:987
#2  WaitEventSetWait (set=set@entry=0xdc9288, timeout=timeout@entry=-1, 
occurred_events=occurred_events@entry=0x7fffd197d3a0, 
nevents=nevents@entry=1) at latch.c:941
#3  0x006ac1ca in WaitLatchOrSocket (latch=0x7f51d9fd969c, 
wakeEvents=wakeEvents@entry=1, sock=sock@entry=-1, timeout=-1, 
timeout@entry=0) at latch.c:347
#4  0x006ac27d in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=1, timeout=timeout@entry=0) at latch.c:302
#5  0x006bb5e3 in ProcSleep (locallock=locallock@entry=0xd3d690, 
lockMethodTable=lockMethodTable@entry=0x9149a0 )
at proc.c:1219
#6  0x006b639d in WaitOnLock (locallock=locallock@entry=0xd3d690, 
owner=owner@entry=0xdc4cb0) at lock.c:1703
#7  0x006b7a77 in LockAcquireExtended (
locktag=locktag@entry=0x7fffd197d700, lockmode=lockmode@entry=5, 
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 '\000', 
reportMemoryError=reportMemoryError@entry=1 '\001') at lock.c:998
#8  0x006b7d51 in LockAcquire (locktag=locktag@entry=0x7fffd197d700, 
lockmode=lockmode@entry=5, sessionLock=sessionLock@entry=0 '\000', 
dontWait=dontWait@entry=0 '\000') at lock.c:688
#9  0x006b58fe in XactLockTableWait (xid=xid@entry=7667, 
rel=rel@entry=0x7f51db5c6398, ctid=ctid@entry=0x7fffd197d844, 
oper=oper@entry=XLTW_Update) at lmgr.c:587
#10 0x004a5d29 in heap_update (relation=relation@entry=0x7f51db5c6398, 
otid=otid@entry=0x7fffd197d9e0, newtup=newtup@entry=0xd4b3d0, cid=2, 
crosscheck=0x0, wait=wait@entry=1 '\001', hufd=0x7fffd197d8f0, 
lockmode=0x7fffd197d8ec) at heapam.c:3755
#11 0x005dbeaa in ExecUpdate (tupleid=tupleid@entry=0x7fffd197d9e0, 
oldtuple=oldtuple@entry=0x0, slot=slot@entry=0xd4adb0, 
planSlot=planSlot@entry=0xdc8ea0, epqstate=epqstate@entry=0xdc76c8, 
estate=estate@entry=0xdc73d0, canSetTag=1 '\001') at nodeModifyTable.c:922
#12 0x005dc633 in ExecModifyTable (node=node@entry=0xdc7620)
at nodeModifyTable.c:1501
#13 0x005c3948 in ExecProcNode (node=node@entry=0xdc7620)
at execProcnode.c:396
#14 0x005bfdbf in ExecutePlan (dest=0xdc9e60, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_UPDATE, use_parallel_mode=, 
planstate=0xdc7620, estate=0xdc73d0) at execMain.c:1567
#15 standard_ExecutorRun (queryDesc=0xdc6fc0, direction=, 
count=0) at execMain.c:338
#16 0x006ce669 in ProcessQuery (plan=, 
sourceText=0xd8ff70 "UPDATE pgbench_branches SET bbalance = bbalance + 
-2823 WHERE bid = 3;", params=0x0, dest=0xdc9e60, completionTag=0x7fffd197dfc0 
"")
---

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-23 Thread Mark Kirkwood

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:


Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com



Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby, then
adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
   Column  | Type  | Modifiers
--+---+---
  aid  | integer   | not null
  bid  | integer   |
  abalance | integer   |
  filler   | character(84) |
Indexes:
 "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
 "pgbench_accounts_bid" hash (bid)<

bench=# \d pgbench_history
   Table "public.pgbench_history"
  Column |Type | Modifiers
+-+---
  tid| integer |
  bid| integer |
  aid| integer |
  delta  | integer |
  mtime  | timestamp without time zone |
  filler | character(22)   |
Indexes:
 "pgbench_history_bid" hash (bid) <=


they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
  datname | application_name | state  |  wait | query

-+--++-+
  | walreceiver  | idle   | |
  bench   | pgbench  | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
  bench   | pgbench  | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
2091, CURRENT_TIMESTAMP);
  bench   | pgbench  | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
3438, CURRENT_TIMESTAMP);
  bench   | pgbench  | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
-2443, CURRENT_TIMESTAMP);
  bench   | pgbench  | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
CURRENT_TIMESTAMP);
  bench   | pgbench  | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
  bench   | pgbench  | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
  bench   | pgbench  | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
  bench   | psql | active | 00:00:00| SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

Can you get the call stacks?



For every stuck backend? (just double checking)...


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-23 Thread Mark Kirkwood

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:


Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] - 
https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com





Firstly - really nice! Patches applied easily etc to latest version 10 
checkout.


I thought I'd test by initializing pgbench schema, adding a standby, 
then adding some hash indexes and running pgbench:


Looking on the standby:

bench=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)<

bench=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid) <=


they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS 
wait,query FROM pg_stat_activity ORDER BY wait DESC;

 datname | application_name | state  |  wait | query
-+--++-+ 


 | walreceiver  | idle   | |
 bench   | pgbench  | active | 00:31:38.367467 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, 
-3868, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:38.215378 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 
3954814, 2091, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.991056 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 
8355237, 3438, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.619798 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 
4839994, -2443, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.544196 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 
4053, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.334504 | UPDATE 
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
 bench   | pgbench  | active | 00:31:35.234112 | UPDATE 
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
 bench   | pgbench  | active | 00:31:34.434676 | UPDATE 
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
 bench   | psql | active | 00:00:00| SELECT 
datname,application_name,state,now()-xact_start AS wait,query FROM 
pg_stat_activity ORDER BY wait DESC;

(10 rows)

but no errors in the logs, any thoughts?



FWIW, retesting with the wal logging patch removed (i.e leaving the 
concurrent hast one) works fine.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-23 Thread Mark Kirkwood

On 23/08/16 15:24, Amit Kapila wrote:


Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] - 
https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com




Firstly - really nice! Patches applied easily etc to latest version 10 
checkout.


I thought I'd test by initializing pgbench schema, adding a standby, 
then adding some hash indexes and running pgbench:


Looking on the standby:

bench=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)<

bench=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid) <=


they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS 
wait,query FROM pg_stat_activity ORDER BY wait DESC;

 datname | application_name | state  |  wait | query
-+--++-+
 | walreceiver  | idle   | |
 bench   | pgbench  | active | 00:31:38.367467 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973, 
-3868, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:38.215378 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814, 
2091, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.991056 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237, 
3438, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.619798 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994, 
-2443, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.544196 | INSERT INTO 
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 
4053, CURRENT_TIMESTAMP);
 bench   | pgbench  | active | 00:31:35.334504 | UPDATE 
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
 bench   | pgbench  | active | 00:31:35.234112 | UPDATE 
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
 bench   | pgbench  | active | 00:31:34.434676 | UPDATE 
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
 bench   | psql | active | 00:00:00| SELECT 
datname,application_name,state,now()-xact_start AS wait,query FROM 
pg_stat_activity ORDER BY wait DESC;

(10 rows)

but no errors in the logs, any thoughts?

Cheers

Mark







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

2016-08-15 Thread Mark Kirkwood

On 13/08/16 05:44, Jeff Janes wrote:

On Fri, Aug 12, 2016 at 1:40 AM, Mark Kirkwood

However your index rebuild gets you from 5 to 3 GB - does that really help
performance significantly?

It can make a big difference, depending on how much RAM you have.



Yeah - I suspect this is the issue - loading up a similar type of schema 
with records with a primary key of form 'userx' for (uniformly) 
randomly distributed x... (I was gonna use the Yahoo benchmark but 
it is s slow...). Also I'm using 1000 rows instead of 1 
to avoid waiting a long time (1000 should be enough to show the point):


prefix=# \d prefix
   Table "public.prefix"
 Column | Type  | Modifiers
+---+---
 uid| character varying(30) | not null
 filler | character(255)|
Indexes:
"prefix_pkey" PRIMARY KEY, btree (uid)

Doing an uncached indexed read by forcing a buffer cache clear:

# echo 3 > /proc/sys/vm/drop_caches

prefix=# SELECT 
relfilenode,relname,reltuples,pg_relation_size(oid)/1024/1024 AS mb

FROM pg_class WHERE relname LIKE 'prefix%';
 relfilenode |   relname   | reltuples | mb
-+-+---+-
 6017817 | prefix  | 1e+07 | 422
 6017819 | prefix_pkey | 1e+07 | 391
(2 rows)

prefix=# EXPLAIN ANALYZE SELECT count(*)
 FROM prefix WHERE uid='user1';
  QUERY PLAN


---
 Aggregate  (cost=8.46..8.46 rows=1 width=0) (actual time=3.408..3.408 
rows=1 lo

ops=1)
   ->  Index Only Scan using prefix_pkey on prefix (cost=0.43..8.45 
rows=1 widt

h=0) (actual time=3.406..3.406 rows=0 loops=1)
 Index Cond: (uid = 'user1'::text)
 Heap Fetches: 0
 Planning time: 19.362 ms
 Execution time: 3.429 ms
(6 rows)

Repeating this after REINDEX:

# echo 3 > /proc/sys/vm/drop_caches

prefix=# SELECT 
relfilenode,relname,reltuples,pg_relation_size(oid)/1024/1024 AS mb

FROM pg_class WHERE relname LIKE 'prefix%';
 relfilenode |   relname   | reltuples | mb
-+-+---+-
 6017817 | prefix  | 1e+07 | 422
 6017819 | prefix_pkey | 1e+07 | 300
(2 rows)

prefix=# EXPLAIN ANALYZE SELECT count(*)
 FROM prefix WHERE uid='user1';
  QUERY PLAN


---
 Aggregate  (cost=8.46..8.46 rows=1 width=0) (actual time=3.868..3.868 
rows=1 lo

ops=1)
   ->  Index Only Scan using prefix_pkey on prefix (cost=0.43..8.45 
rows=1 widt

h=0) (actual time=3.866..3.866 rows=0 loops=1)
 Index Cond: (uid = 'user1'::text)
 Heap Fetches: 0
 Planning time: 19.366 ms
 Execution time: 3.889 ms
(6 rows)

So certainly not significantly *slower* with the physically bigger 
index. This suggests that Jeff's analysis was spot on - likely that the 
larger index didn't fix in RAM.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

2016-08-12 Thread Mark Kirkwood
After examining the benchmark design - I see we are probably not being 
helped by the repeated insertion of keys all of form 'userxxx' 
leading to some page splitting.


However your index rebuild gets you from 5 to 3 GB - does that really 
help performance significantly?


regards

Mark

On 11/08/16 16:08, Kisung Kim wrote:

Thank you for your information.
Here is the result:

After insertions:

ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | 
leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
leaf_fragmentation

-+++---+++-+---+--+
   2 |  3 | 5488721920 | 44337 | 4464 | 
665545 |   0 | 0 |   52 | 11

(1 row)

After rebuild:


ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | 
leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
leaf_fragmentation

-+++---+++-+---+--+
   2 |  3 | 3154296832 | 41827 | 1899 |   
  383146 |   0 | 0 |90.08 |   
   0



It seems like that rebuild has an effect to reduce the number of 
internal and leaf_pages and make more dense leaf pages.








--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Mark Kirkwood

On 03/08/16 02:27, Robert Haas wrote:


Personally, I think that incremental surgery on our current heap
format to try to fix this is not going to get very far.  If you look
at the history of this, 8.3 was a huge release for timely cleanup of
dead tuple.  There was also significant progress in 8.4 as a result of
5da9da71c44f27ba48fdad08ef263bf70e43e689.   As far as I can recall, we
then made no progress at all in 9.0 - 9.4.  We made a very small
improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but
that's pretty niche.  In 9.6, we have "snapshot too old", which I'd
argue is potentially a large improvement, but it was big and invasive
and will no doubt pose code maintenance hazards in the years to come;
also, many people won't be able to use it or won't realize that they
should use it.  I think it is likely that further incremental
improvements here will be quite hard to find, and the amount of effort
will be large relative to the amount of benefit.  I think we need a
new storage format where the bloat is cleanly separated from the data
rather than intermingled with it; every other major RDMS works that
way.  Perhaps this is a case of "the grass is greener on the other
side of the fence", but I don't think so.


Yeah, I think this is a good summary of the state of play.

The only other new db development to use a non-overwriting design like 
ours that I know of was Jim Starky's Falcon engine for (ironically) 
Mysql 6.0. Not sure if anyone is still progressing that at all now.


I do wonder if Uber could have successfully tamed dead tuple bloat with 
aggressive per-table autovacuum settings (and if in fact they tried), 
but as I think Robert said earlier, it is pretty easy to come up with a 
highly update (or insert + delete) workload that makes for a pretty ugly 
bloat component even with real aggressive autovacuuming.


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can we expand PostgreSQL ecosystem?

2016-03-05 Thread Mark Kirkwood
On 06/03/16 18:29, MauMau wrote:
> As I said in the previous greeting mail, I'd like to discuss how to 
> expand PostgreSQL ecosystem.  Here, ecosystem means "interoperability" 
> -- the software products and cloud services which use/support 
> PostgreSQL.  If pgsql-advocacy or somewhere else is better for this 
> topic, just tell me so.

For cloud - in particular Openstack (which I am working with ATM), the
biggest thing would be:

- multi-master replication

or failing that:

- self managing single master failover (voting/quorum etc)

so that operators can essentially 'set and forget'. We currently use
Mysql+ Galera (multi master) and Mongodb (self managing single master)
and the convenience and simplicity is just so important (Openstack is a
huge complex collection of services - hand holding of any one service is
pretty much a non starter).

regards

Mark


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] planstats.sgml

2016-02-16 Thread Mark Kirkwood

On 16/02/16 20:01, Tatsuo Ishii wrote:

Tatsuo Ishii  writes:

While reading planstats.sgml, I encounted a sentence which I don't
understand.



 These numbers are current as of the last VACUUM or
 ANALYZE on the table.  The planner then fetches the
 actual current number of pages in the table (this is a cheap operation,
 not requiring a table scan).  If that is different from
 relpages then
 reltuples is scaled accordingly to
 arrive at a current number-of-rows estimate.  In this case the value of
 relpages is up-to-date so the rows estimate is
 the same as reltuples.



I don't understand the last sentence (In this case...). For me it
seems it is talking about the case when replages is not different from
what the planner fetches from the table. If so, why "In this case"?


I think what it meant is "In the example above, ...".  Feel free to
change it if you think that is clearer.


Oh, I see. I'm going to change "In this case" to "In the example above".


Done.



Yeah, that is clearer.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] planstats.sgml

2016-02-15 Thread Mark Kirkwood

On 16/02/16 12:46, David G. Johnston wrote:

On Mon, Feb 15, 2016 at 4:23 PM, Tatsuo Ishii >wrote:

While reading planstats.sgml, I encounted a sentence which I don't
understand.

 These numbers are current as of the last VACUUM or
 ANALYZE on the table.  The planner then fetches the
 actual current number of pages in the table (this is a cheap
operation,
 not requiring a table scan).  If that is different from
 relpages then
 reltuples is scaled accordingly to
 arrive at a current number-of-rows estimate.  In this case the
value of
 relpages is up-to-date so the rows
estimate is
 the same as reltuples.

I don't understand the last sentence (In this case...). For me it
seems it is talking about the case when replages is not different from
what the planner fetches from the table. If so, why "In this case"?
Isn't "In this case" referrers to the previous sentence (If that is
different from...)? Maybe "In this case" should be "Otherwise" or some
such?


​The whole sentence is awkward but you are correct in your reading - and
"otherwise" would be a solid choice.



A long while ago when I wrote that, I was expressing the fact that *in 
the example* the numbers matched perfectly, but *in general* the planner 
would scale 'em. It still reads that way to me...but change away if you 
like :-)



Though iIt seems the whole thing could be simplified to:

These numbers are current as of the last VACUUM or ANALYZE on the
table.  To account for subsequent changes the planner obtains the actual
page count for the table and scales pg_class.reltuples by a factor of
"actual page count" over pg_class.relpages.

The "cheap operation" comment seems gratuitous though could still be
injected if desired.



Well folk interested in performance, like to keep an eye on whether 
these sort of probes cost a lot...


regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Mark Kirkwood

On 01/09/15 21:41, Bruce Momjian wrote:


Well, reworking our partitioning system is one of the things required
for sharding, so at least we will clean up one mess while we create
another.  ;-)

Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
that sharding is such a limited use-case that we will not implement it
inside of Postgres.



I'm thinking that partitioning and sharding are two different things:

Partitioning is about reducing the amount of table data accessed and 
also perhaps easing admin activities (archiving/removing old stuff etc).


Sharding is a about parallelism and redundancy...copies of stuff in 
different places and concurrent access by virtue of it being on 
different nodes!


Now *maybe* FDW is a good way to approach this, but really would be nice 
to see a more rigorous analysis (I note that like XC and XL, Greenplum 
looked at the existing mechanisms around at the time and ended up 
writing their own). Now I'm aware that things have moved on - but I 
think there needs to be a proper discussion about design and what we 
think distributed data/sharding etc should provide *before* grabbing 
hold of FDW as *the answer*!


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlog - pg_xjournal?

2015-06-02 Thread Mark Kirkwood

On 01/06/15 05:29, Joel Jacobson wrote:

While anyone who is familiar with postgres would never do something as
stupid as to delete pg_xlog,
according to Google, there appears to be a fair amount of end-users out
there having made the irrevocable mistake of deleting the precious
directory,
a decision made on the assumption that since it has *log* in the name
so it must be unimportant
(http://stackoverflow.com/questions/12897429/what-does-pg-resetxlog-do-and-how-does-it-work).

If we could turn back time, would we have picked pg_xlog as the most
optimal name for this important directory, or would we have come up with
a more user-friendly name?

Personally, I have never had any problems with pg_xlog, but I realize
there are quite a few unlucky new users who end up in trouble.

My suggestion is to use pg_xjournal instead of pg_xlog when new
users create a new data directory using initdb, and allow for both
directories to exist (exclusive or, i.e. either one or the other, but
not both). That way we don't complicate the life for any existing users,
all their tools will continue to work who rely on pg_xlog to be named
pg_xlog, but only force new users to do a bit of googling when they
can't use whatever tool that can't find pg_xlog. When they find out it's
an important directory, they can simply create a symlink and their old
not yet updated tool will work again.

Thoughts?



+1

Strongly agree - I have had people on my dba course ask about deleting 
these pesky 'log' directories (obvious confusion/conflation with pg_log 
...)! A change of name would help reduce the temptation!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove fsync ON/OFF as a visible option?

2015-03-22 Thread Mark Kirkwood

On 22/03/15 08:14, Jaime Casanova wrote:


El mar 21, 2015 2:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz
mailto:mark.kirkw...@catalyst.net.nz escribió:
 
  On 21/03/15 19:28, Jaime Casanova wrote:
 
  what about not removing it but not showing it in postgresql.conf? as a
  side note, i wonder why trace_sort is not in postgresql.conf...
  other option is to make it a compile setting, that why if you want to
  have it you need to compile and postgres' developers do that routinely
  anyway
 
 
  -1
 
  Personally I'm against hiding *any* settings. Choosing sensible
defaults - yes! Hiding them - that reeks of secret squirrel nonsense and
overpaid Oracle dbas that knew the undocumented settings for various
capabilities. I think/hope that no open source project will try to
emulate that meme!
 

That ship has already sailed.

http://www.postgresql.org/docs/9.4/static/runtime-config-developer.html



Not really - they are documented in the official doc repo (that was the 
point I was making I think), and +1 for adding or improving the 
documentation for some of the more dangerous ones!


While I'm against removing or hiding settings, I have no problem with 
shipping/generating a postgresql.conf that has *only* the non default 
settings therein, as that requires people to look at the docs where (of 
course) we have some sensible discussion about how to set the rest of 'em.


I note that Mysql ship a pretty minimal confile files there days (5.5, 
5.6) on Ubuntu, and that seems to cause no particular problem.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove fsync ON/OFF as a visible option?

2015-03-21 Thread Mark Kirkwood

On 21/03/15 19:28, Jaime Casanova wrote:

On Fri, Mar 20, 2015 at 11:29 PM, Michael Paquier
michael.paqu...@gmail.com wrote:

On Sat, Mar 21, 2015 at 2:47 AM, Peter Geoghegan p...@heroku.com wrote:

On Fri, Mar 20, 2015 at 9:52 AM, Joshua D. Drake j...@commandprompt.com wrote:

There are just as many people that are running with scissors that are now
running (or attempting to run) our elephant in production. Does it make
sense to remove fsync (and possibly full_page_writes) from such a visible
place as postgresql.conf?


-1

Anyone turning off fsync without even for a moment considering the
consequences has only themselves to blame. I can't imagine why you'd
want to remove full_page_writes or make it less visible either, since
in principle it ought to be perfectly fine to turn it off in
production once its verified as safe.


-1 for its removal as well. It is still useful for developers to
emulate CPU-bounded loads...


I fought to remove fsync before so i understand JD concerns. and yes,
i have seen fsync=off in the field too...

what about not removing it but not showing it in postgresql.conf? as a
side note, i wonder why trace_sort is not in postgresql.conf...
other option is to make it a compile setting, that why if you want to
have it you need to compile and postgres' developers do that routinely
anyway



-1

Personally I'm against hiding *any* settings. Choosing sensible defaults 
- yes! Hiding them - that reeks of secret squirrel nonsense and overpaid 
Oracle dbas that knew the undocumented settings for various 
capabilities. I think/hope that no open source project will try to 
emulate that meme!


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-02-25 Thread Mark Kirkwood

On 26/02/15 13:32, Simon Riggs wrote:

On 25 February 2015 at 23:30, Jim Nasby jim.na...@bluetreble.com wrote:


That said, I don't want to block this; I think it's useful. Though, perhaps
it would be better as an extension instead of in contrib? I don't think it
should be very version dependent?


The whole point of this is to get it into contrib.

It could have published it as an extension months ago.



Is this intended to replace pg_freespacemap? Not trying to be 
overprotective or anything :-) but is this new extension/module 
better/faster/stronger/more accurate etc? If so then excellent! Ohth was 
wondering if people either a) didn't know about pg_freespacemap or b) 
consider that it is crap and won't use it.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unable to build pg_rewind

2015-02-24 Thread Mark Kirkwood

On 25/02/15 12:47, Michael Paquier wrote:



On Wed, Feb 25, 2015 at 8:03 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:

On 25/02/15 11:06, Ratay, Steve wrote:

I have checked out the pg_rewind code from
https://github.com/vmware/pg_rewind.git on the master branch and
am using PostgreSQL 9.4.1 source code to build against.  When I
try to compile pg_rewind, I am getting the following errors.
How can I resolve these problems?


It looks like master of pg_rewind expects to be built against
Postgres 9.5. Try checking out the REL9_4_STABLE branch.


  This branching model makes management of the code easier because
WAL-related APIs and WAL format can change a lot between major releases
(and also because in this case it would have made the code less readable
using blocks based on PG_VERSION_NUM for not that much result).



Absolutely - and you've used the same branch names as the postgresql git 
repo does - so really nice and easy to match things up!


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unable to build pg_rewind

2015-02-24 Thread Mark Kirkwood

On 25/02/15 11:06, Ratay, Steve wrote:

I have checked out the pg_rewind code from 
https://github.com/vmware/pg_rewind.git on the master branch and am using 
PostgreSQL 9.4.1 source code to build against.  When I try to compile 
pg_rewind, I am getting the following errors.  How can I resolve these problems?


gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 
-fstack-protector -funwind-tables -fasynchronous-unwind-tables 
-I/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql -I. -I./ 
-I/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server 
-I/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/internal 
-DFRONTEND   -c -o parsexlog.o parsexlog.c

In file included from parsexlog.c:39:

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:28:61:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:29:68:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

In file included from parsexlog.c:39:

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:28:
 error: ‘PG_RMGR’ undeclared here (not in a function)

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:29:
 error: expected ‘}’ before ‘PG_RMGR’

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:30:64:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:31:61:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:32:68:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:33:73:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:34:81:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:35:69:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:36:73:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:37:65:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:38:61:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:39:65:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:40:61:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:41:81:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:42:87:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:43:62:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

/var/tmp/vpagent/postgresql94-libs-9.4.1-build/usr/include/pgsql/server/access/rmgrlist.h:44:87:
 error: macro PG_RMGR requires 7 arguments, but only 6 given

parsexlog.c: In function ‘findLastCheckpoint’:

parsexlog.c:198: warning: implicit declaration of function ‘XLogRecGetInfo’

parsexlog.c:200: warning: implicit declaration of function ‘XLogRecGetRmid’

parsexlog.c: In function ‘extractPageInfo’:

parsexlog.c:344: error: ‘XLR_SPECIAL_REL_UPDATE’ undeclared (first use in this 
function)

parsexlog.c:344: error: (Each undeclared identifier is reported only once

parsexlog.c:344: error: for each function it appears in.)

parsexlog.c:358: error: ‘XLogReaderState’ has no member named ‘max_block_id’

parsexlog.c:364: warning: implicit declaration of function ‘XLogRecGetBlockTag’

make: *** [parsexlog.o] Error 1




It looks like master of pg_rewind expects to be built against Postgres 
9.5. Try checking out the REL9_4_STABLE branch.


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of table names of a DB

2015-01-08 Thread Mark Kirkwood

Actually, code has moved to:

https://github.com/snaga/pqc

On 09/01/15 19:53, Mark Kirkwood wrote:

Also see:

https://code.google.com/p/pqc/

A project to implement a query cache using pgpool code, probably lots of
good ideas there.

Cheers

Mark

On 09/01/15 19:38, Tatsuo Ishii wrote:

Hi,

pgpool-II (pgpool.net) does exactly the same thing.

It receive SELECT query from clients, 1) parse it to find table names,
and 2) gets the oids (unique identifier in the PostgreSQL system
catalog) to recognize them. when the SELECT succeeds , it store the
query result (cache) on either shared memory or memcached according to
user's choice. For query cache invalidation, pgpool-II remembers all
oids related to the SELECTs which are source of query cache. If one of
tables get updated, pgpoool-II invalidates all of cache using the oid.

For #1, pgpool-II has a query parser copied from PostgreSQL.

pgpool-II is an open source project, so you could get some idea to
implement your own tool.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


Sorry, it's not about querying. I am implementing an invalidation
mechanism for Postgres Query Cache as part of my masters project. In
order to this, I need to store details(like name) of each table the
query uses. In essence, I need to store the table names of the cached
queries.
Initially, I thought of writing a code that could extract the table
names but later discovered that it is a gargantuan task as I shall
have to include around 600 production rules as was hinted in a
Stackoverflow Exchange post. Hence, I thought of getting hold of the
data structure used for storing table names of a DB but I couldn't
get it.
Sorry for the long post but do you know where these tables
information of a DB gets stored? Or can you suggest me a
mechanism(needn't be fully perfect) to extract the table names? I
went through the parser of postgres but it was confusing.
Thanks

Date: Fri, 9 Jan 2015 00:46:30 +1300
Subject: Re: [HACKERS] List of table names of a DB
From: dgrowle...@gmail.com
To: in.live...@live.in
CC: pgsql-hackers@postgresql.org

On 8 January 2015 at 22:41, Deepak S in.live...@live.in wrote:



Hello, I am unable to find the function which retrieves the 'list of
names of the tables' used in a DB.

You may find what you want in: select table_name from
information_schema.tables;
http://www.postgresql.org/docs/9.4/static/infoschema-tables.html

  Reason: I need a mechanism by which I can extract the names of the
tables used in a query which has been parsed. My plan is to check for
a match of each word in the query with a list of names of the tables
used in the current DB so that each hit confirms a 'table name' in
the query in most cases.

This sounds rather flaky.
Maybe it would be better to just EXPLAIN the query and see if you get
error code 42P01postgres=# \set VERBOSITY verbosepostgres=# explain
select * from doesnotexist;ERROR:  42P01: relation doesnotexist
does not existLINE 1: explain select * from doesnotexist;
Unfortunately this won't help you much if you want to know all of the
tables that don't exist.
Also, just for the future, a question like this might be more suited
for the pgsql-gene...@postgresql.org list.
Regards
David Rowley









--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of table names of a DB

2015-01-08 Thread Mark Kirkwood

Also see:

https://code.google.com/p/pqc/

A project to implement a query cache using pgpool code, probably lots of 
good ideas there.


Cheers

Mark

On 09/01/15 19:38, Tatsuo Ishii wrote:

Hi,

pgpool-II (pgpool.net) does exactly the same thing.

It receive SELECT query from clients, 1) parse it to find table names,
and 2) gets the oids (unique identifier in the PostgreSQL system
catalog) to recognize them. when the SELECT succeeds , it store the
query result (cache) on either shared memory or memcached according to
user's choice. For query cache invalidation, pgpool-II remembers all
oids related to the SELECTs which are source of query cache. If one of
tables get updated, pgpoool-II invalidates all of cache using the oid.

For #1, pgpool-II has a query parser copied from PostgreSQL.

pgpool-II is an open source project, so you could get some idea to
implement your own tool.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


Sorry, it's not about querying. I am implementing an invalidation mechanism for 
Postgres Query Cache as part of my masters project. In order to this, I need to 
store details(like name) of each table the query uses. In essence, I need to 
store the table names of the cached queries.
Initially, I thought of writing a code that could extract the table names but 
later discovered that it is a gargantuan task as I shall have to include around 
600 production rules as was hinted in a Stackoverflow Exchange post. Hence, I 
thought of getting hold of the data structure used for storing table names of a 
DB but I couldn't get it.
Sorry for the long post but do you know where these tables information of a DB 
gets stored? Or can you suggest me a mechanism(needn't be fully perfect) to 
extract the table names? I went through the parser of postgres but it was 
confusing.
Thanks

Date: Fri, 9 Jan 2015 00:46:30 +1300
Subject: Re: [HACKERS] List of table names of a DB
From: dgrowle...@gmail.com
To: in.live...@live.in
CC: pgsql-hackers@postgresql.org

On 8 January 2015 at 22:41, Deepak S in.live...@live.in wrote:



Hello, I am unable to find the function which retrieves the 'list of names of 
the tables' used in a DB.

You may find what you want in: select table_name from information_schema.tables;
http://www.postgresql.org/docs/9.4/static/infoschema-tables.html

  Reason: I need a mechanism by which I can extract the names of the tables 
used in a query which has been parsed. My plan is to check for a match of each 
word in the query with a list of names of the tables used in the current DB so 
that each hit confirms a 'table name' in the query in most cases.

This sounds rather flaky.
Maybe it would be better to just EXPLAIN the query and see if you get error code 
42P01postgres=# \set VERBOSITY verbosepostgres=# explain select * from 
doesnotexist;ERROR:  42P01: relation doesnotexist does not existLINE 1: 
explain select * from doesnotexist;
Unfortunately this won't help you much if you want to know all of the tables 
that don't exist.
Also, just for the future, a question like this might be more suited for the 
pgsql-gene...@postgresql.org list.
Regards
David Rowley








--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mysql with postgres

2014-12-23 Thread Mark Kirkwood

On 23/12/14 22:36, Ravi Kiran wrote:

hi all,


Is postgres source code compatible with mysql database?? If it is, could
someone could give me some links so that I can do that.

I want to hack into the postgres source code, but as I am comfortable
with mysql, I want to use the mysql database not postgres.

any references would be fine.



I'm wondering if you are thinking that you can use Postgres as a Mysql 
storage engine? While Mysql does has pluggable storage 
engines...Postgres is not designed to be able to be used in this way 
(that would be an interesting - but big and probably controversial 
project to undertake).


So if you are more familiar with Mysql, why not hack Innodb?

Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest problems

2014-12-19 Thread Mark Kirkwood

On 19/12/14 20:48, Andres Freund wrote:

On 2014-12-18 10:02:25 -0800, Joshua D. Drake wrote:

I think a lot of hackers forget exactly how tender their egos are. Now I say
this knowing that a lot of them will go, Oh give me a break but as someone
who employs hackers, deals with open source AND normal people :P every
single day, I can tell you without a single inch of sarcasm that petting
egos is one of the ways you get things done in the open source (and really
any male dominated) community.


To me that's a bit over the top stereotyping.



+1

Having been mentioned one or two times myself - it was an unexpected 
wow - cool rather than a grumpy/fragile I must be noticed thing. I 
think some folk have forgotten the underlying principle of the open 
source community - it is about freely giving - time or code etc. The 
there must be something in it for me me me meme is - well - the 
*other* world view.



However, doing crappy work and let's not be shy about it, there is NOTHING
fun about reviewing someone else's code needs to have incentive.


FWIW, I don't agree with this at all. Reviewing code can be quite
interesting - with the one constraint that the problem the patch solves
needs to be somewhat interesting. The latter is what I think gets many
of the more experienced reviewers - lots of the patches just solve stuff
they don't care about.



Yeah, and also it helps if the patch addresses an area that you at least 
know *something* about - otherwise it is really hard to review in any 
useful way.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest problems

2014-12-19 Thread Mark Kirkwood

On 20/12/14 11:22, Joshua D. Drake wrote:


On 12/19/2014 12:28 AM, Mark Kirkwood wrote:


To me that's a bit over the top stereotyping.



+1

Having been mentioned one or two times myself - it was an unexpected
wow - cool rather than a grumpy/fragile I must be noticed thing. I
think some folk have forgotten the underlying principle of the open
source community - it is about freely giving - time or code etc. The
there must be something in it for me me me meme is - well - the
*other* world view.


It was supposed to be over the top. That doesn't make it any less true.
Sure there are plenty of us that don't have any of the ego petting
issues. However,t there are more of us in those of us that think we
don't, that really, really do.



Heh - that fact that even you are stating it is over the top clearly 
makes it less *generally* true. Sure, there are some/few(?) folk who are 
seeing open source contributions as purely a CV enhancer, and perhaps a 
few in denial about their egos, but I don't see that as the common trend 
in this community (which is great).


regards

Mark





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vitesse DB call for testing

2014-10-17 Thread Mark Kirkwood

On 18/10/14 07:13, Josh Berkus wrote:

CK,

Before we go any further on this, how is Vitesse currently licensed?
last time we talked it was still proprietary.  If it's not being
open-sourced, we likely need to take discussion off this list.



+1

Guys, you need to 'fess up on the licensing!

Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixed xloginsert_locks for 9.4

2014-10-03 Thread Mark Kirkwood

On 04/10/14 11:21, Andres Freund wrote:

On 2014-10-03 18:16:28 -0400, Bruce Momjian wrote:

On Sat, Oct  4, 2014 at 12:13:00AM +0200, Andres Freund wrote:

Do we really want to expose a setting a few of us _might_ ask customers
to change?


They also will try that themselves. Our customers aren't a horde of dumb
people. Some of them are willing to try things if they hit scalability
problesm. And *lots* of people hit scalability problems with
postgres. In fact I've seen big users migrate away from postgres because
of them.

And it's not like this only affects absurd cases. Even a parallel
restore will benefit.


I disagree.  I just don't see the value in having such undefined
variables.


undefined variables? I'm not arguing that we don't need documentation
for it. Obviously we'd need that. I'm arguing against taking away
significant scalability possibilities from our users. My bet is that
it's more than 50% on a bigger machine.

I don't think we can offer absolutely accurate tuning advice, but I'm
sure we can give some guidance. Let me try.



+1

I think it is ok to document our reason for providing the new GUC - 
along with that fact that it is a new one and we need more field testing 
and benchmarks to provide comprehensive advice about how to set - and 
recommend leaving it alone unless consult with experts/this list etc.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixed xloginsert_locks for 9.4

2014-10-03 Thread Mark Kirkwood

On 04/10/14 12:10, Bruce Momjian wrote:

On Sat, Oct  4, 2014 at 12:00:36PM +1300, Mark Kirkwood wrote:

I don't think we can offer absolutely accurate tuning advice, but I'm
sure we can give some guidance. Let me try.



+1

I think it is ok to document our reason for providing the new GUC -
along with that fact that it is a new one and we need more field
testing and benchmarks to provide comprehensive advice about how to
set - and recommend leaving it alone unless consult with
experts/this list etc.


I predict that such a setting will remain in postgresql.conf for years
with almost zero activity, as have other similar efforts.



Sure that *may* happen. In fact in my experience the vast majority of 
our current GUCs are never altered in the field - however when you run 
into a situation where a certain GUC solves your performance issue, then 
that seldom used GUC really gets some love.


So altho I get your point about endless proliferation of 'em not being 
cost free, I'd like to plug the other side of the argument too - having 
the flexibility to adjust your Postgres installation to work well with 
random platform with annoying quirks is the corresponding benefit.


In addition with the increasing use of cloud platforms - the situation 
above is likely to become *more* common (Postgres in Openstack using 
Ceph for volume storage is a case in point).


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] “Core” function in Postgres

2014-09-23 Thread Mark Kirkwood

On 24/09/14 11:29, Mingzhe Li wrote:

Hi experts,

I want to know what's the core function used in Postgres server? I am
looking for something corresponding to main() in a simple C program. I
want to know the file path and the function name. I am using Postgres
9.3.5, however I assume the core function will be unchanged between
different revisions.



I suspect you want to start looking at
PostgresMain in src/backend/tcop/postgres.c

and ServerLoop in src/backend/postmaster/postmaster.c

Regards

Mark

P.s: FWIW I think this *is* the right list to ask this type of question...


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-15 Thread Mark Kirkwood

On 14/09/14 20:43, Mark Kirkwood wrote:

On 14/09/14 20:24, Atri Sharma wrote:


How do you plan to  do all that VACUUM does for this table then?

It seems to me that you are saying to VACUUM that it need not be
concerned with table 'A' and you are assuming ownership of all the tasks
performed by VACUUM for this table. Seems pretty broken to me, not to
mention the performance degradations.



I think the whole point of such a modification is that nothing is done
to such tables, as you want to see all the previous versions.

Clearly this is less performant for standard workloads...but we are
talking about non standard workloads surely...


To be fair with respect to what Atri is saying, I should have said 
something like:


Clearly this is *horribly* less performant for standard workloads...etc :-)

Also there is the good point he raised about transaction xid wrap, so 
some messing about with that part of VACUUM would be required too (it's 
the little complications that all add up)!



The TRIGGER based approach is clearly a lot simpler! However for an 
interest project to understand Postgres internals the other approach is 
worthwhile.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-14 Thread Mark Kirkwood

On 14/09/14 05:36, Rohit Goyal wrote:

Hi All,

I want to work on the code of intermediate dataset of select and update
query.

For example.

Rohit's salary has been updated 4 times, so it has 4 different version
of salary.

I want to select  salary of person named Rohit. Now suppose , in
intermediate result, I found 4 different versions of the data. I want to
know the code portion which i need to look for working on all 4 versions
in dataset. :)




Hi Rohit,

Currently in Postgres, these intermediate versions all exist - however a 
given session can only see one of them. Also VACUUM is allowed to 
destroy versions that no other transactions can see.


So if I'm understanding you correctly, you would like to have some way 
for a session to see *all* these versions (and I guess preventing VACUUM 
from destroying them).


It is certainly possible (or used to be via snapshot manipulation, I 
haven't looked at that code in a while sorry) to enable a session to see 
all the old versions, and is quite a cool idea (Postgres used to have 
this ability in older versions - called Time Travel).


For pure practicality, this can be achieved without any code 
modifications using TRIGGERs and an extra table (as Gavin alludes to).


Do tell us a bit more about what you are wanting to do!

Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-14 Thread Mark Kirkwood

On 14/09/14 19:00, Amit Kapila wrote:

On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith
gregsmithpg...@gmail.com mailto:gregsmithpg...@gmail.com wrote:

  This looks like it's squashed one of the very fundamental buffer
   scaling issues though; well done Amit.

Thanks.

  I'll go back to my notes and try to recreate the pathological cases
  that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync
  spreading effort I did; get those running again and see how they
  do on this new approach.  I have a decent sized 24 core server
  that should be good enough for this job.  I'll see what I can do.

It will be really helpful if you can try out those cases.




And if you want 'em run on the 60 core beast, just let me know the 
details and I'll do some runs for you.


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-14 Thread Mark Kirkwood

On 14/09/14 19:25, Atri Sharma wrote:



On Sunday, September 14, 2014, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:

On 14/09/14 05:36, Rohit Goyal wrote:

Hi All,

I want to work on the code of intermediate dataset of select and
update
query.

For example.

Rohit's salary has been updated 4 times, so it has 4 different
version
of salary.

I want to select  salary of person named Rohit. Now suppose , in
intermediate result, I found 4 different versions of the data. I
want to
know the code portion which i need to look for working on all 4
versions
in dataset. :)



Hi Rohit,

Currently in Postgres, these intermediate versions all exist -
however a given session can only see one of them. Also VACUUM is
allowed to destroy versions that no other transactions can see.

So if I'm understanding you correctly, you would like to have some
way for a session to see *all* these versions (and I guess
preventing VACUUM from destroying them).



Any modifications of that sort are bound to introduce lots of pain, not
to mention performance degradation and the added responsibility of
ensuring that dead tuples don't bloat up the system (prevent vacuum from
running at regular intervals and you can have a xid wraparound).

I just mentioned that in case you are planning to go in that direction.
If you only want the data, use the triggers as Gavin mentioned.



Obviously in the general case sure - but (as yet) we don't have much 
idea about Rohit's use case and workload. If retrieving past versions is 
the *primary* workload bias and high update concurrency is not required 
then this could well work better than a trigger based solution.


And it does not seem too onerous to have the ability to switch this on 
as required, viz:


ALTER TABLE table1 VERSIONING;

(or similar syntax) which makes VACUUM leave this table alone. It might 
make more sense to make such a concept apply to a TABLESPACE instead 
mind you (i.e things in here are for archive/versioning purposes)...


Clearly we'd need to see the code for any of this and evaluate if it is 
good or terrible, but I'm not seeing the idea as bad as stated.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-14 Thread Mark Kirkwood

On 14/09/14 20:24, Atri Sharma wrote:



On Sun, Sep 14, 2014 at 1:30 PM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:

On 14/09/14 19:25, Atri Sharma wrote:



On Sunday, September 14, 2014, Mark Kirkwood
mark.kirkw...@catalyst.net.nz
mailto:mark.kirkw...@catalyst.net.nz
mailto:mark.kirkwood@__catalyst.net.nz
mailto:mark.kirkw...@catalyst.net.nz

wrote:

 On 14/09/14 05:36, Rohit Goyal wrote:

 Hi All,

 I want to work on the code of intermediate dataset of
select and
 update
 query.

 For example.

 Rohit's salary has been updated 4 times, so it has 4
different
 version
 of salary.

 I want to select  salary of person named Rohit. Now
suppose , in
 intermediate result, I found 4 different versions of
the data. I
 want to
 know the code portion which i need to look for working
on all 4
 versions
 in dataset. :)



 Hi Rohit,

 Currently in Postgres, these intermediate versions all exist -
 however a given session can only see one of them. Also
VACUUM is
 allowed to destroy versions that no other transactions can see.

 So if I'm understanding you correctly, you would like to
have some
 way for a session to see *all* these versions (and I guess
 preventing VACUUM from destroying them).



Any modifications of that sort are bound to introduce lots of
pain, not
to mention performance degradation and the added responsibility of
ensuring that dead tuples don't bloat up the system (prevent
vacuum from
running at regular intervals and you can have a xid wraparound).

I just mentioned that in case you are planning to go in that
direction.
If you only want the data, use the triggers as Gavin mentioned.


Obviously in the general case sure - but (as yet) we don't have much
idea about Rohit's use case and workload. If retrieving past
versions is the *primary* workload bias and high update concurrency
is not required then this could well work better than a trigger
based solution.

And it does not seem too onerous to have the ability to switch this
on as required, viz:

ALTER TABLE table1 VERSIONING;

(or similar syntax) which makes VACUUM leave this table alone.


How do you plan to  do all that VACUUM does for this table then?

It seems to me that you are saying to VACUUM that it need not be
concerned with table 'A' and you are assuming ownership of all the tasks
performed by VACUUM for this table. Seems pretty broken to me, not to
mention the performance degradations.



I think the whole point of such a modification is that nothing is done 
to such tables, as you want to see all the previous versions.


Clearly this is less performant for standard workloads...but we are 
talking about non standard workloads surely...


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-14 Thread Mark Kirkwood

On 14/09/14 20:11, Rohit Goyal wrote:

Hi Mark,

On Sun, Sep 14, 2014 at 8:57 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:



Currently in Postgres, these intermediate versions all exist -
however a given session can only see one of them. Also VACUUM is
allowed to destroy versions that no other transactions can see.

Exactly, one visible version is there per session. But, I want to test
my algorithm in which i myself want to work on all the intermediate
version and find the correct one for the session.

So if I'm understanding you correctly, you would like to have some
way for a session to see *all* these versions (and I guess
preventing VACUUM from destroying them).

yes and I also want to know the portion of code where i can find all the
historical or intermediate versions


Well that's going to be a bit of a learning curve for you :-), the 
concept to get familiar with is snapshots (see 
src/backand/access/heap/heapam.c to get started).





It is certainly possible (or used to be via snapshot manipulation, I
haven't looked at that code in a while sorry) to enable a session to
see all the old versions, and is quite a cool idea (Postgres used to
have this ability in older versions - called Time Travel).

For pure practicality, this can be achieved without any code
modifications using TRIGGERs and an extra table (as Gavin alludes to).

Can you explain me more about how to starting working using trigger?


I'm not sure we have specif examples in the docs for what you want to 
do, but generally see 
http://www.postgresql.org/docs/9.4/static/plpgsql-trigger.html


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-14 Thread Mark Kirkwood

On 14/09/14 21:18, Rohit Goyal wrote:

Hi Mark  Atri, :)

Thanks for reply. But, I think i confused you. I am talking about access
using indexes. So, I assume that B+ tree store key-value pair where
rohit is the key and all the versions are its value.

Another way to think is I have a secondary index on emp. name and there
are 4 rohit exist in DB. So, now B+ tree gives me 4 different tuple
pointer for each Rohit. I want to know the code portion for this where i
can see all 4 tuple pointer before each one have I/O access to fetch its
tuple.

Are the suggestions still valid?



Visibility rules mentioned earlier apply equally to tables and indexes 
(strictly speaking what happens is index tuples are checked against the 
relevant tables to see if your session can see them), so discussion of 
whether tuples are retrieved via index or table scans is not really 
relevant (i.e query planning/optimization is separate from tuple 
visibility).


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Mark Kirkwood

On 10/09/14 18:54, Amit Kapila wrote:

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:
 
  In terms of the effect of the patch - looks pretty similar to the
scale 2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even
better perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.




No worries, this is looking like a patch we're going to apply to 9.4 to 
make the 60 core beast scale a bit better, so thanks very much for your 
work in this area.


If you would like more tests run at higher scales let me know (we have 
two of these machines at pre-production state currently so I can run 
benchmarks as reqd)!


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Mark Kirkwood

On 05/09/14 23:50, Amit Kapila wrote:

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:
 
  On 04/09/14 14:42, Amit Kapila wrote:
 
  On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
  wrote:
 
 
 
  Hi Amit,
 
  Results look pretty good. Does it help in the read-write case too?
 
 
  Last time I ran the tpc-b test of pgbench (results of which are
  posted earlier in this thread), there doesn't seem to be any major
  gain for that, however for cases where read is predominant, you
  might see better gains.
 
  I am again planing to take that data in next few days.
 
 
  FWIW below are some test results on the 60 core beast with this patch
applied to 9.4. I'll need to do more runs to iron out the variation,
  but it looks like the patch helps the standard (write heavy) pgbench
workload a little, and clearly helps the read only case.
 

Thanks for doing the test.  I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM.  You might want to use
latest patch I have posted today.



Here's some fairly typical data from read-write and read-only runs at 
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not 
seeing much variation between repeated read-write runs with the same 
config (which is nice - sleep 30 and explicit checkpoint call between 
each one seem to help there).


Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be 
better at higher client counts than beta1 was...


In terms of the effect of the patch - looks pretty similar to the scale 
2000 results for read-write, but read-only is a different and more 
interesting story - unpatched 9.4 is noticeably impacted in the higher 
client counts, whereas the patched version scales as well (or even 
better perhaps) than in the scale 2000 case.


read write (600s)

Clients  | tps| tps (unpatched)
-++
  6  |   9395 |   9334
  12 |  16605 |  16525
  24 |  24634 |  24910
  48 |  32170 |  31275
  96 |  35675 |  36533
 192 |  35579 |  31137
 384 |  30528 |  28308


read only (300s)

Clients  | tps| tps (unpatched)
-++
  6  |  35743 |   35362
  12 | 111019 |  106579
  24 | 199746 |  160305
  48 | 327026 |  198407
  96 | 379184 |  171863
 192 | 356623 |  152224
 384 | 340878 |  128308


regards

Mark





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Mark Kirkwood

On 04/09/14 14:42, Amit Kapila wrote:

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz
wrote:



Hi Amit,

Results look pretty good. Does it help in the read-write case too?


Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.



FWIW below are some test results on the 60 core beast with this patch 
applied to 9.4. I'll need to do more runs to iron out the variation, but 
it looks like the patch helps the standard (write heavy) pgbench 
workload a little, and clearly helps the read only case.



4x E7-4890 15 cores each.
1 TB ram
16x Toshiba PX02SS SATA SSD
4x Samsung NVMe XS1715 PCIe SSD

Ubuntu 14.04  (Linux 3.13)
Postgres 9.4 beta2
+ buffer eviction patch v5

Pgbench

scale 2000

Non default params:

max_connections = 400;
shared_buffers = 10GB;
maintenance_work_mem = 1GB;
effective_io_concurrency = 10;
wal_buffers = 256MB;
checkpoint_segments = 1920;
checkpoint_completion_target = 0.8;
ssl = 'off';
wal_sync_method = 'open_datasync';

read write

elapsed 600s

Clients  | tps   | tps (unpatched)
-+---+
  6  |  8279 |  8328
  12 | 16260 | 16381
  24 | 23639 | 23451
  48 | 31430 | 31004
  96 | 38516 | 34777
 192 | 33535 | 32443
 384 | 27978 | 25068
 384 | 30589 | 28798


read only

elapsed 300s

Clients  | tps| tps (unpatched)
-++
  6  |  57654 |  57255
  12 | 111361 | 112360
  24 | 220304 | 187967
  48 | 384567 | 230961
  96 | 380309 | 241947
 192 | 330865 | 214570
 384 | 315516 | 207548


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-03 Thread Mark Kirkwood

On 03/09/14 16:22, Amit Kapila wrote:

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote:


On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com

wrote:


I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.


I have collected more data with the patch.  I understand that you
have given more review comments due to which patch require
changes, however I think it will not effect the performance data
to a great extent and I have anyway taken the data, so sharing the
same.



Performance Data:
---

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout=15min
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load


Common configuration remains same as above.


Forgot to mention that data is a median of 3 runs and attached
sheet contains data for individual runs.




Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Mark Kirkwood

On 02/09/14 21:25, Álvaro Hernández Tortosa wrote:


On 02/09/14 05:24, Craig Ringer wrote:

I couldn't disagree more.

If we were to implement anything, it'd be PL/PSM
(http://en.wikipedia.org/wiki/SQL/PSM). I'm sure it's as bizarre and
quirky as anything else the SQL committee has brought forth, but it's at
least a standard(ish) language.

 So we'd choose a bizarre and quirky language instead of anything
better just because it's standard. I'm sure current and prospective
users will surely prefer a bizarre and quirky language that is standard
approved, rather than a modern, comfortable, easy-to-use, that is not
embodied by the ISO. No doubt ^_^



Well there is the risk that by randomly adding new syntax to PL/pgSQL we 
turn it in a bizarre and quirky *non standard* language. Part of the 
attraction of PL/pgsql is that it is Ada like - if we break that too 
much then...well...that would be bad. So I think a careful balance is 
needed, to add new features that keep the spirit of the original language.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL 2

2014-09-01 Thread Mark Kirkwood

On 02/09/14 15:46, Craig Ringer wrote:


was is exactly why we need a new language and that All the clumsy
stuff we cannot fix in plpgsql, can easily be fixed in plpgsql2, with
the most beautiful syntax we can come up with. But you haven't said HOW
you propose to fix this one case.


Unfortunately, there is likely to be a (large) variance of opinion 
concerning the details. In particular 'beautiful/elegant...'. Err - 
these things are mostly in the eye of the beholder. E.g: I might want 
this new shiny syntax to be lisp like, as that is beautiful (heh, 
kidding - but you should get the idea).


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-28 Thread Mark Kirkwood

On 29/08/14 08:56, Alvaro Herrera wrote:

Robert Haas wrote:


I agree that you might not like that.  But you might not like having
the table vacuumed slower than the configured rate, either.  My
impression is that the time between vacuums isn't really all that
negotiable for some people.  I had one customer who had horrible bloat
issues on a table that was vacuumed every minute; when we changed the
configuration so that it was vacuumed every 15 seconds, those problems
went away.


Wow, that's extreme.  For that case you can set
autovacuum_vacuum_cost_limit to 0, which disables the whole thing and
lets vacuum run at full speed -- no throttling at all.  Would that
satisfy the concern?



Well no - you might have a whole lot of big tables that you want vacuum 
to not get too aggressive on, but a few small tables that are highly 
volatile. So you want *them* vacuumed really fast to prevent them 
becoming huge tables with only a few rows therein, but your system might 
not be able to handle *all* your tables being vacuum full speed.


This is a fairly common scenario for (several) web CMS systems that tend 
to want to have session and/cache tables that are small and extremely 
volatile, plus the rest of the (real) data that is bigger and vastly 
less volatile.


While there is a valid objection along the lines of don't use a 
database instead of memcache, it does seem reasonable that Postgres 
should be able to cope with this type of workload.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-26 Thread Mark Kirkwood

On 27/08/14 10:27, Alvaro Herrera wrote:

Alvaro Herrera wrote:


So my proposal is a bit more complicated.  First we introduce the notion
of a single number, to enable sorting and computations: the delay
equivalent, which is the cost_limit divided by cost_delay.


Here's a patch that implements this idea.  As you see this is quite a
bit more complicated that Haribabu's proposal.

There are two holes in this:

1. if you ALTER DATABASE to change vacuum delay for a database, those
values are not considered in the global equiv delay.  I don't think this
is very important and anyway we haven't considered this very much, so
it's okay if we don't handle it.

2. If you have a fast worker that's only slightly faster than regular
workers, it will become slower in some cases.  This is explained in a
FIXME comment in the patch.

I don't really have any more time to invest in this, but I would like to
see it in 9.4.  Mark, would you test this?  Haribabu, how open are you
to fixing point (2) above?



Thanks Alvaro - I will take a look.

regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-26 Thread Mark Kirkwood

On 27/08/14 10:27, Alvaro Herrera wrote:

Alvaro Herrera wrote:


So my proposal is a bit more complicated.  First we introduce the notion
of a single number, to enable sorting and computations: the delay
equivalent, which is the cost_limit divided by cost_delay.


Here's a patch that implements this idea.  As you see this is quite a
bit more complicated that Haribabu's proposal.

There are two holes in this:

1. if you ALTER DATABASE to change vacuum delay for a database, those
values are not considered in the global equiv delay.  I don't think this
is very important and anyway we haven't considered this very much, so
it's okay if we don't handle it.

2. If you have a fast worker that's only slightly faster than regular
workers, it will become slower in some cases.  This is explained in a
FIXME comment in the patch.

I don't really have any more time to invest in this, but I would like to
see it in 9.4.  Mark, would you test this?  Haribabu, how open are you
to fixing point (2) above?




I did some testing with this patch applied.

Minimally tweaking autovacuum (naptime of 5s) with a single table 
'cache0' created with a cost limit setting of 1, running:


$ pgbench -n -c8 -T300 -f volatile0.sql cache

and monitoring the size of 'cache0' table showed a steady state of:


cache=# SELECT pg_relation_size('cache0')/(1024*1024) AS mb;
   mb
--
   85

So far so good. Adding another table 'cache1' similar to the previous 
but lacking any per table autovacuum settings, and running 2 pgbench 
sessions:


$ pgbench -n -c8 -T300 -f volatile0.sql cache
$ pgbench -n -c8 -T300 -f volatile1.sql cache


(volatile1.sql just uses table 'cache1' instead of 'cache0') shows after 
a few minutes:


cache=# SELECT relname,pg_relation_size(oid)/(1024*1024) AS mb
FROM pg_class WHERE relname like 'cache_';
 relname |   mb
-+
 cache0  |664
 cache1  |   1900

So we are definitely seeing the 'fast' worker being slowed down. Also, 
the growth of 'cache1' was only a bit faster than 'cache0' - so the 
'slow' worker was getting a speed boost was well.


So looks like good progress, but yeah - point (2) is obviously rearing 
its head in this test.


Cheers

Mark




schema.sql
Description: application/sql


volatile0.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Trove with PostgreSQL-XC

2014-08-20 Thread Mark Kirkwood

On 19/08/14 23:14, Vivek Singh Raghuwanshi wrote:


Hi All,

Please let me know is that possible to use Openstack Trove with Postgres-XC.
With instances and Baremetal (after Juno Release).
I Know it is possible to use other medium like MySQL or PostgreSQL, but
i am not sure about XC.


AFAIK [1], vanilla Postgres support for Trove is still at the review 
stage, and there is quite a bit to finish, and some stuff to fix.


I'd think that getting Trove to provision a complete XC VM (or more 
sensibly a set of VMs that run XC) is going to require fair bit of extra 
work.


It looks like replication will be deployable via Trove for the supported 
flavours (Mysql, Mongo, Cassandra, Counch, Redis) see 
https://wiki.openstack.org/wiki/Trove/Replication-And-Clustering . I'm 
not seeing any specific discussion of sharding notice.


Cheers

Mark

[1] I've been helping debug the Postgres Trove patch



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-05 Thread Mark Kirkwood

On 05/08/14 17:56, Mark Kirkwood wrote:



Adding in the 'if' in the float8 case increases run time to 4s. So looks
like plpgsql might have a slightly higher cost for handling added
conditionals. Be interesting to dig a bit more and see what is taking
the time.



Thinking about this a bit more, I wonder if the 'big O' has added some 
optimizations in PL/SQL for trivial conditionals - i.e you are adding:


IF (0 = 0) THEN

END IF;

...it may be going...'Ah yes, always true...so remove'!

So it might be interesting to try some (hopefully not so easily 
removable) non trivial ones like:



DO LANGUAGE plpgsql $$ DECLARE
DECLARE i integer;
BEGIN
FOR i IN 1..1000 LOOP
  IF (i%100 = 0) THEN
NULL;
  END IF;
END LOOP;
END $$;

Now I guess there is the chance that PL/SQL might understand that NULL 
inside a loop means it can remove it...so you may need to experiment 
further. The point to take away here is that for interesting loops and 
conditions - there may be not such a significant difference!


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-04 Thread Mark Kirkwood

On 05/08/14 08:48, testman1316 wrote:

We am trying to get an idea of the raw performance of Oracle vs PostgreSQL.
We have extensive oracle experience but are new to PostgreSQL. We are going
to run lots of queries with our data, etc. But first we wanted to see just
how they perform on basic kernel tasks, i.e. math and branching since SQL is
built on that.

In AWS RDS we created two db.m3.2xlarge instances one with oracle
11.2.0.4.v1 license included, the other with PostgreSQL (9.3.3)

In both we ran code that did 1 million square roots (from 1 to 1 mill). Then
did the same but within an If..Then statement.

The results were a bit troubling:

Oracle  4.8 seconds

PostgreSQL  21.803 seconds

adding an if statement:

Oracle  4.78 seconds

PostgreSQL  24.4 seconds

code Oracle square root

SET SERVEROUTPUT ON
SET TIMING ON

DECLARE
   n NUMBER := 0;
BEGIN
   FOR f IN 1..1000
LOOP
 n := SQRT (f);
   END LOOP;
END;

PostgreSQL

DO LANGUAGE plpgsql $$ DECLARE n real;
DECLARE f integer;
BEGIN
FOR f IN 1..1000 LOOP
n = SQRT (f);
END LOOP;
RAISE NOTICE 'Result = %',n;
END $$;

oracle adding if

SET SERVEROUTPUT ON
SET TIMING ON

DECLARE
   n NUMBER := 0;
BEGIN
   FOR f IN 1..1000
LOOP
   if 0 =0 then
 n := SQRT (f);
 end if;
   END LOOP;

postgres adding if

DO LANGUAGE plpgsql $$ DECLARE n real;
DECLARE f integer;
BEGIN
FOR f IN 1..1000 LOOP
if 0=0 then
n = SQRT (f);
end if;
END LOOP;
RAISE NOTICE 'Result = %',n;
END $$;

I used an anonymous block for PostgreSQL. I also did it as a function and
got identical results

CREATE OR REPLACE FUNCTION testpostgrescpu()
   RETURNS real AS
$BODY$
declare
  n real;
  f integer;

BEGIN
FOR f IN 1..1000 LOOP
 n = SQRT (f);
END LOOP;


RETURN n;
END;
$BODY$
   LANGUAGE plpgsql VOLATILE
   COST 100;
ALTER FUNCTION testpostgrescpu()
   OWNER TO xxx

Based on what we had heard of PostgreSQL and how it is comparable to Oracle
in many ways, we were taken aback by the results. Did we code PostgreSQL
incorrectly? What are we missing or is this the way it is.

Note: once we started running queries on the exact same data in Oracle and
PostgreSQL we saw a similar pattern. On basic queries little difference, but
as they started to get more and more complex Oracle was around 3-5 faster.

Again, this was run on identical AWS RDS instances, we ran them many times
during the day on different days and results were always the same








Looking at this guy:

DO LANGUAGE plpgsql $$ DECLARE n real;
DECLARE f integer;
BEGIN
FOR f IN 1..1000 LOOP
  n = SQRT (f);
END LOOP;
RAISE NOTICE 'Result = %',n;
END $$;

Takes about 12s with with Postgres 9.4 running on Ubuntu 14.04 hosted on 
real HW (Intel i7).


Changing n to be float8 rather than real, i.e:

DO LANGUAGE plpgsql $$ DECLARE n float8;
DECLARE f integer;
BEGIN
FOR f IN 1..1000 LOOP
  n = SQRT (f);
END LOOP;
RAISE NOTICE 'Result = %',n;
END $$;

...time drops to about 2s (which I'm guessing would get it to about 
Oracle speed on your EC2 setup).


The moral of the story for this case is that mapping Oracle to Postgres 
datatypes can require some careful thought. Using 'native' types (like 
integer, float8 etc) will generally give vastly quicker performance.



Adding in the 'if' in the float8 case increases run time to 4s. So looks 
like plpgsql might have a slightly higher cost for handling added 
conditionals. Be interesting to dig a bit more and see what is taking 
the time.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parametric block size?

2014-07-26 Thread Mark Kirkwood

On 26/07/14 21:05, Andres Freund wrote:



More advanced features, but with much more impact on the code, would be to
be able to change the size at database/table level.


That'd be pretty horrible because the size of pages in shared_buffers
wouldn't be uniform anymore.




Possibly stopping at the tablespace level might be more straightforward. 
To avoid messing up the pages in shared buffers we'd perhaps need 
something like several shared buffer pools - each with either its own 
blocksize or associated with a (set of) tablespace(s).


Obviously this sort of thing has a pretty big architecture/code impact, 
probably better to consider a 1st iteration with it being initdb 
specifiable only (as that would still be very convenient)!


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql.auto.conf and reload

2014-07-08 Thread Mark Kirkwood

On 09/07/14 05:13, Josh Berkus wrote:

On 07/06/2014 01:27 AM, Christoph Berg wrote:

Another could be that during initdb all the uncommented settings be

written to postgresql.auto.conf file rather than to postgresql.conf.
I think we can do this by changing code in initdb.c-setup_config().
This will ensure that unless user is changing settings in a mixed way
(some by Alter System and some manually by editing postgresql.conf),
there will no such problem.


There is no reasonable way for us to prevent issues for users who are
manually changing both pg.conf and pg.auto.conf.  Users should stick to
one or the other method of management (or, thirdly, using conf.d); if
they mix methods, we can't prevent confusion at restart time and we
shouldn't try.




Yes, but even well behaved users will see this type of error, because 
initdb uncomments certain values (ones that are dead certs for being 
changed via ALTER SYSTEM subsequently like shared_buffers), and then - 
bang! your next reload gets that your postgresql.conf contains errors 
message.


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2014-07-01 Thread Mark Kirkwood

On 01/07/14 23:25, Heikki Linnakangas wrote:

On 07/01/2014 01:08 PM, Andres Freund wrote:

Hi,

Over at -performance Mark Kirkwood tested a recent version of this
(http://archives.postgresql.org/message-id/53B283F3.7020005%40catalyst.net.nz)

. I thought it's interesting to add the numbers to this thread:


Test: pgbench
Options: scale 500
  read only
Os: Ubuntu 14.04
Pg: 9.3.4
Pg Options:
 max_connections = 200
 shared_buffers = 10GB
 maintenance_work_mem = 1GB
 effective_io_concurrency = 10
 wal_buffers = 32MB
 checkpoint_segments = 192
 checkpoint_completion_target = 0.8


Results

Clients | 9.3 tps 32 cores | 9.3 tps 60 cores
+--+-
6   |  70400   |  71028
12  |  98918   | 129140
24  | 230345   | 240631
48  | 324042   | 409510
96  | 346929   | 120464
192 | 312621   |  92663

So we have anti scaling with 60 cores as we increase the client
connections.
Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4
branch [1]
- cherry picking the last 5 commits into 9.4 branch and building a
package
from that and retesting:

Clients | 9.4 tps 60 cores (rwlock)
+--
6   |  70189
12  | 128894
24  | 233542
48  | 422754
96  | 590796
192 | 630672


Now, this is a bit of a skewed comparison due to 9.4 vs. 9.3 but still
interesting.


It looks like the issue I reported here:

http://www.postgresql.org/message-id/5190e17b.9060...@vmware.com

fixed by this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b03d196be055450c7260749f17347c2d066b4254.


So, definitely need to compare plain 9.4 vs patched 9.4, not 9.3.



Here's plain 9.4 vs patched 9.4:

Clients | 9.4 tps 60 cores | 9.4 tps 60 cores (rwlock)
+--+--
6   |  69490   |  70189
12  | 128200   | 128894
24  | 232243   | 233542
48  | 417689   | 422754
96  | 464037   | 590796
192 | 418252   | 630672

It appears that plain 9.4 does not exhibit the dramatic anti scaling 
that 9.3 showed, but there is still evidence of some contention in the 
higher client numbers, and we peak at the 96 client mark. The patched 
variant looks pretty much free from this, still scaling at 192 
connections (might have been interesting to try more, but had 
max_connections set to 200)!


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2014-05-06 Thread Mark Kirkwood

On 07/05/14 17:35, Peter Geoghegan wrote:

On Tue, May 6, 2014 at 10:20 PM, Simon Riggs si...@2ndquadrant.com wrote:

On 6 May 2014 23:47, Josh Berkus j...@agliodbs.com wrote:


If you're going to make
an argument in favor of different tuning advice, then do it based on
something in which you actually believe, based on hard evidence.

The proposed default setting of 4x shared_buffers is unprincipled
*and* lacks hard evidence from you and everybody else.

+1. In my view, we probably should have set it to a much higher
absolute default value. The main problem with setting it to any
multiple of shared_buffers that I can see is that shared_buffers is a
very poor proxy for what effective_cache_size is supposed to
represent. In general, the folk wisdom around sizing shared_buffers
has past its sell-by date.



+1. ISTM the only sensible approach to auto tune this requires us to 
have a plugin to detect how much RAM the system has (and then setting it 
to 1/2 that say). I wonder if it might be worthwhile writing plugins for 
the handful of popular platforms. For the remainder maybe we could leave 
it defaulting to the current (small) value, and encourage volunteers to 
code the missing ones if they want something better.


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-05-05 Thread Mark Kirkwood

On 05/05/14 15:22, Amit Kapila wrote:


Here what I could understand is that sum of cost_limit for all
autovacuum workers should never exceed the value of
autovacuum_vacuum_cost_limit which seems to be always the
case in current code but same is not true for proposed patch.



Right, but have a look at the 1st message in this thread - the current 
behavior (and to a large extent the above condition) means that setting

cost limits per table does not work in any remotely intuitive way.

ITSM that the whole purpose of a per table setting in this context is to 
override the behavior of auto vacuum throttling - and currently this 
does not happen unless you get real brutal (i.e setting the cost delay 
to zero in addition to setting cost limit...making the whole cost limit 
a bit pointless).


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-05-05 Thread Mark Kirkwood

On 06/05/14 16:28, Amit Kapila wrote:

On Mon, May 5, 2014 at 11:57 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:

On 05/05/14 15:22, Amit Kapila wrote:
Right, but have a look at the 1st message in this thread - the current
behavior (and to a large extent the above condition) means that setting
cost limits per table does not work in any remotely intuitive way.

ITSM that the whole purpose of a per table setting in this context is to
override the behavior of auto vacuum throttling - and currently this does
not happen unless you get real brutal (i.e setting the cost delay to zero in
addition to setting cost limit...making the whole cost limit a bit
pointless).

I think meaning of per table setting is just that it overrides the default
value of autovacuum_vacuum_cost_limit for that table and the rest of
calculation or concept remains same.  This is what currently code does
and the same is mentioned in docs as far as I can understand.

As per current behaviour the per-table cost_limit is also adjusted based
on the setting of GUC autovacuum_vacuum_cost_limit and right now it
follows a simple principle that the total cost limit for all workers should be
equal to autovacuum_vacuum_cost_limit.  Even code has below comment:

/*
* Adjust cost limit of each active worker to balance the total of cost
* limit to autovacuum_vacuum_cost_limit.
*/

Now If you want to change for the case where user specifies value per
table which is more than autovacuum_vacuum_cost_limit or otherwise,
then I think the new definition should be bit more clear and it is better
not to impact current calculation for default values.

I could think of 2 ways to change this:

a. if user has specified cost_limit value for table, then it just uses it
 rather than rebalancing based on value of system-wide guc variable
 autovacuum_vacuum_cost_limit
b. another could be to restrict setting per-table value to be lesser than
 system-wide value?

The former is used for auto vacuum parameters like scale_factor and
later is used for parameters like freeze_max_age.

Thoughts?

Alvaro, do you think above options makes sense to solve this problem?


Yes indeed - the code currently working differently from what one would 
expect. However the usual reason for handing knobs to the user for 
individual object is so that special configurations can be applied to 
them. The current method of operation of the per table knobs does not do 
this (not without clubbing 'em on the head)


The (ahem) sensible way that one would expect (perhaps even need) 
autovacuum throttling to work is:


- set sensible defaults for all the usual (well behaved) tables
- set a few really aggressive overrides for a handful of the naughty ones

Runaway free space bloat is one of the things that can really mangle a 
postgres system (I've been called in to rescue a few in my time)... 
there needs to be a way to control those few badly behaved tables ... 
without removing the usefulness of throttling the others.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 Proposal: Initdb creates a single table

2014-04-23 Thread Mark Kirkwood
This seems like a much better idea - whereas a single table, related to 
nothing - on the other hand, is at best not very helpful (and it could 
be argued, might contribute to teaching poor data data design).


Regards

Mark

On 23/04/14 19:13, Pavel Stehule wrote:

Hello

if you are thinking about this direction, then store there some demo
project.

I am don't think so isolated table has significant price.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   >