[HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-02 Thread Noah Misch
My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception.  It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill.  The animals share a machine with
each other and with dozens of other developers.  I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods.  Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.

The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout.  Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice.  If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.


The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it.  The timing dependency comes from the pg_ctl "slop" time:

/*
 * Make sanity checks.  If it's for a 
standalone backend
 * (negative PID), or the recorded 
start time is before
 * pg_ctl started, then either we are 
looking at the wrong
 * data directory, or this is a 
pre-existing pidfile that
 * hasn't (yet?) been overwritten by 
our child postmaster.
 * Allow 2 seconds slop for possible 
cross-process clock
 * skew.
 */

The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:

All of this could be improved if we rewrote start_postmaster() so that it
could report the child postmaster's PID, so that we'd know a-priori the
correct PID to test with postmaster_is_alive().  That looks like a bit too
much change for so late in the 9.1 development cycle, unfortunately.

I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:

--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;
 command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
-   'second pg_ctl start succeeds');
+sleep 3;# bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+   'second pg_ctl start fails');
 command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
'pg_ctl stop -w');


Alternately, I could just remove the test.

crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tatsuo Ishii
>> One lesson from XL we got is that we need testing framework for cluster,
>> so any cluster project should at least pass functional and performance
>> testing.
>>
> 
> +1. In early XC days, we focused a lot on adding newer features and
> supporting as many PG features as possible. That took its toll on the
> testing and QA. It was a mistake though my feeling was we tried to correct
> that to some extend with XL. We did a 9.5 merge, which of course was a big
> deal, but other than more time is being spent on improving stability and
> performance

Agreed. Any cluster project needs a cluster testing framework.

pgpool-II project runs "build farm" which runs cluster regression
tests every day. The tests includes several versions of pgpool-II and
PostgreSQL combinations using docker. Still it needs more tests but
even with limited test cases, it is pretty usefull to detect bugs.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Pavan Deolasee
On Wed, Sep 2, 2015 at 9:04 PM, Oleg Bartunov  wrote:

>
>
>
> One lesson from XL we got is that we need testing framework for cluster,
> so any cluster project should at least pass functional and performance
> testing.
>

+1. In early XC days, we focused a lot on adding newer features and
supporting as many PG features as possible. That took its toll on the
testing and QA. It was a mistake though my feeling was we tried to correct
that to some extend with XL. We did a 9.5 merge, which of course was a big
deal, but other than more time is being spent on improving stability and
performance

XL was very easy to break and I'm wondering how many corner cases still
> exists.
>

Your team reported 2 or 3 major issues which I think we were able to fix
quite quickly. But if there are more such issues which your team has
recorded somewhere, I would request you to send them to the XL mailing
list. I would definitely want to look at them and address them.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-02 Thread Etsuro Fujita

On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either.  Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.


Another idea about that code is to call that code in eg, ExecProcNode, 
instead of calling ExecForeignScan there.  I think that that might be 
much cleaner and resolve the naming problem below.



And much of it is contingent on whether estate->es_epqTuple
!= NULL and scanrelid == 0, but perhaps out would be better to check
whether the subplan is actually present instead of checking whether we
think it should be present.


Agreed with this.


Also, the naming is a bit weird:
node->fs_subplan gets shoved into outerPlanState(), which seems like a
kludge.


And with this.  Proposals welcome.


I'm wondering if there's another approach.  If I understand correctly,
there are two reasons why the current situation is untenable.  The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there.  The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.


My question about this is, is the callback really needed?  If there are 
any FDWs that want to do the work *in their own way*, instead of just 
doing ExecProcNode for executing a local join execution plan in case of 
foreign join (or just doing ExecQual for checking remote quals in case 
of foreign table), I'd agree with introducing the callback, but if not, 
I don't think that that makes much sense.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
 wrote:
> Peter, would you be able to share the test case which you saw the speedup
> on. So far I've been unable to see much of an improvement.

The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.

The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread David Rowley
On 3 September 2015 at 07:24, Peter Geoghegan  wrote:

> On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> > What worries me about adding explicit prefetching is that it's *highly*
> > platform and even micro-architecture dependent. Why is looking three
> > elements ahead the right value?
>
> Because that was the fastest value following testing on my laptop. You
> are absolutely right to point out that this isn't a good reason to go
> with the patch -- I share your concern. All I can say in defense of
> that is that other major system software does the same, without any
> regard for the underlying microarchitecture AFAICT. So, yes,
> certainly, more testing is required across a reasonable cross-section
> of platforms to justify the patch.
>

FWIW someone else found 3 to be good on the platform they tested on:
http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/

Peter, would you be able to share the test case which you saw the speedup
on. So far I've been unable to see much of an improvement.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-02 Thread David Rowley
On 3 September 2015 at 05:10, Andres Freund  wrote:

> On 2015-08-09 12:47:53 +1200, David Rowley wrote:
> > I took a bit of weekend time to finish this one off. Patch attached.
> >
> > A quick test shows a pretty good performance increase:
> >
> > create table ts (ts timestamp not null);
> > insert into ts select generate_series('2010-01-01 00:00:00', '2011-01-01
> > 00:00:00', '1 sec'::interval);
> > vacuum freeze ts;
> >
> > Master:
> > david=# copy ts to 'l:/ts.sql';
> > COPY 31536001
> > Time: 20444.468 ms
> >
> > Patched:
> > david=# copy ts to 'l:/ts.sql';
> > COPY 31536001
> > Time: 10947.097 ms
>
> Yes, that's pretty cool.
>
> > There's probably a bit more to squeeze out of this.
> > 1. EncodeDateTime() could return the length of the string to allow
> callers
> > to use pnstrdup() instead of pstrdup(), which would save the strlen()
> call.
> > 2. Have pg_int2str_zeropad() and pg_int2str() skip appending the NUL char
> > and leave this up to the calling function.
> > 3. Make something to replace the sprintf(str, " %.*s", MAXTZLEN, tzn);
> call.
> >
> > Perhaps 1 and 3 are worth looking into, but I think 2 is likely too error
> > prone for the small gain we'd get from it.
>

I see the logic there, but a few weekends ago I modified the patch to
implement number 2. This is perhaps one optimization that would be hard to
change later since, if we suddenly change pg_int2str() to not append the
NUL, then it might break any 3rd party code that's started using them. The
more I think about it the more I think allowing those to skip appending the
NUL is ok. They're very useful functions for incrementally building strings.

Attached is a patch to implement this, which performs as follows
postgres=# copy ts to 'l:/ts.sql';
COPY 31536001
Time: 10665.297 ms

However, this version of the patch does change many existing functions in
datetime.c so that they no longer append the NUL char... The good news is
that these are all static functions, so nothing else should be using them.

I'm happy to leave 1 and 3 for another rainy weekend one day.


> > Also I was not too sure on if pg_int2str() was too similar to pg_ltoa().
> > Perhaps pg_ltoa() should just be modified to return the end of string?
>
> I don't think the benefits are worth breaking pg_ltoa interface.
>
>
Ok let's leave pg_ltoa() alone


> >  /*
> > - * Append sections and fractional seconds (if any) at *cp.
> > + * Append seconds and fractional seconds (if any) at *cp.
> >   * precision is the max number of fraction digits, fillzeros says to
> >   * pad to two integral-seconds digits.
> >   * Note that any sign is stripped from the input seconds values.
> > + * Note 'precision' must not be a negative number.
> >   */
> > -static void
> > +static char *
> >  AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool
> fillzeros)
> >  {
> > +#ifdef HAVE_INT64_TIMESTAMP
>
> Wouldn't it be better to just normalize fsec to an integer in that case?
> Afaics that's the only remaining reason for the alternative path?
>
> A special case would need to exist somewhere, unless we just modified
timestamp2tm() to multiply fsec by 1 million when HAVE_INT64_TIMESTAMP is
not defined, but that changes the function signature.


> > +/*
> > + * pg_int2str
> > + *   Converts 'value' into a decimal string representation of
> the number.
> > + *
> > + * Caller must ensure that 'str' points to enough memory to hold the
> result
> > + * (at least 12 bytes, counting a leading sign and trailing NUL).
> > + * Return value is a pointer to the new NUL terminated end of string.
> > + */
> > +char *
> > +pg_int2str(char *str, int32 value)
> > +{
> > + char *start;
> > + char *end;
> > +
> > + /*
> > +  * Handle negative numbers in a special way. We can't just append
> a '-'
> > +  * prefix and reverse the sign as on two's complement machines
> negative
> > +  * numbers can be 1 further from 0 than positive numbers, we do it
> this way
> > +  * so we properly handle the smallest possible value.
> > +  */
> > + if (value < 0)
> > + {
>
> We could alternatively handle this by special-casing INT_MIN, probably
> resulting in a bit less duplicated code.
>

Those special cases seem to do some weird stuff to the performance
characteristics of those functions. I find my method to handle negative
numbers to produce much faster code.

I experimented with finding the fastest way to convert an int to a string
at the weekend, I did happen to be testing with int64's but likely int32
will be the same.

This is the time taken to convert 30 into "30" 2 billion times.

The closest implementation I'm using in the patch is pg_int64tostr() one.
The pg_int64tostr_memcpy() only performs better with bigger numbers /
longer strings.

david@ubuntu:~/C$ gcc5.2 tostring.c -o tostring -O2
david@ubuntu:~/C$ ./tostring
pg_int64tostr_memcpy in 13.653252 seconds
pg_int64tostr in 2.042616 seconds
pg_int64tostr_new in 2.056688 seconds
pg_lltoa in 13.604653 seconds

dav

Re: [HACKERS] GIN pending clean up is not interruptable

2015-09-02 Thread Fujii Masao
On Thu, Sep 3, 2015 at 2:15 AM, Andres Freund  wrote:
> On 2015-08-12 11:59:48 -0700, Jeff Janes wrote:
>> Attached patch does it that way.  There was also a free-standing
>> CHECK_FOR_INTERRUPTS() which had no reason that I could see not be a
>> vacuum_delay_point, so I changed that one as well.

-if (vac_delay)
-vacuum_delay_point();
+vacuum_delay_point();

If vac_delay is false, e.g., ginInsertCleanup() is called by the backend,
vacuum_delay_point() should not be called. No?

> I think we should backpatch this - any arguments against?

+1 for backpatch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Kapila
On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian  wrote:
>
> On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> > >Can you explain why logical replication is better than binary
> > >replication for this use-case?
> > >
> >
> > Selectivity?
>
> I was assuming you would just create identical slaves to handle failure,
> rather than moving selected data around.
>

Yes, I also think so, otherwise when the shard goes down and it's replica
has to take the place of shard, it will take more time to make replica
available as it won't have all the data as original shard had.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 09:03:25PM -0400, Robert Haas wrote:
> > Can you explain why logical replication is better than binary
> > replication for this use-case?
> 
> Uh, well, for the same reasons it is better in many other cases.
> Particularly, you probably don't want to replicate all the data on
> machine A to machine B, just some of it.
> 
> Typically, sharding solutions store multiple copies of each piece of
> data.  So let's say you have 4 machines.  You divide the data into 12
> chunks.  Each machine is the write-master for 2 of those chunks, but
> has secondary copies of 3 others.  So maybe things start out like
> this:
> 
> machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
> machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
> machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
> machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9
> 
> If machine #1 is run over by a rabid triceratops, you can make machine
> #2 the master for chunk 1, machine #3 the master for chunk 2, and
> machine #4 the master for chunk 3.  The write load therefore remains
> evenly divided.  If you can only copy entire machines, you can't
> achieve that in this situation.

I see the advantage of this now.  My original idea is that each shard
would have its own standby for disaster recovery, but your approach
above, which I know is typical, allows the shards to back up each other.
You could say shard 2 is the backup for shard 1, but then if shard one
goes bad, the entire workload of shard 1 goes to shard 2.  With the
above approach, the load of shard 1 is shared by all the shards.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> >Can you explain why logical replication is better than binary
> >replication for this use-case?
> >
> 
> Selectivity?

I was assuming you would just create identical slaves to handle failure,
rather than moving selected data around.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Joshua D. Drake

On 09/02/2015 03:56 PM, Bruce Momjian wrote:

On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.


Can you explain why logical replication is better than binary
replication for this use-case?



Selectivity?

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 11:20 AM, Stephen Frost wrote:
> * Michael Paquier wrote:
>> 1) Use a differential backup to me, or the possibility to fetch a set
>> of data block diffs from a source node using an LSN and then re-apply
>> them on the target node. The major disadvantage of this approach is
>> actually performance: we would need to scan the source node
>> completely, so that's slow. And pg_rewind is fast because it only
>> scans the WAL records of the node to be rewound from the last
>> checkpoint before WAL forked.
>
> I don't follow this performance concern at all.  Why can't pg_rewind
> look through the WAL and find what it needs, and then request exactly
> that information?  Clearly, we would need to add to the existing
> replication protocol, but I don't see any reason to not consider that a
> perfectly reasonable approach for this.

See below, visibly I misunderstood what you meant.

>> 2) Request data blocks from the source node using the replication
>> protocol, that's what pg_read_binary_file actually does, we just don't
>> have the logic on replication protocol side, though we could.
>
> Right, we would need to modify the replication protocol to allow such
> requests, but that's not particularly difficult.

Check.

>> 3) Create a new set of functions similar to the existing file access
>> functions that are usable by the replication user, except that they
>> refuse to return file entries that match the existing filters in
>> basebackup.c. That is doable, with a routine in basebackup.c that
>> decides if a given file string can be read or not. Base backups could
>> use this routine as well.
>
> I don't particularly like this approach as it implies SQL access for the
> replication role.

Check.

>> > This is definitely a big part of
>> > the question, but I'd like to ask- what, exactly, does pg_rewind
>> > actually need access to? Is it only the WAL, or are heap and WAL files
>> > needed?
>>
>> Not only, +clog, configuration files, etc.
>
> Configuration files?  Perhaps you could elaborate?

Sure. Sorry for being unclear. It copies everything that is not a
relation file, a kind of base backup without the relation files then.

>> > Consider the discussion about delta backups, et al, using LSNs.  Perhaps
>> > the replication protocol should be extended to allow access to arbitrary
>> > WAL, querying what WAL is available, and access to the heap files,
>> > perhaps even specific pages in the heap files and relation forks,
>> > instead of giving pg_rewind access to these extremely general
>> > nearly-OS-user-level functions.
>>
>> The problem when using differential backups in this case is
>> performance as mentioned above. We would need to scan the whole target
>> cluster, which may take time, the current approach of pg_rewind only
>> needs to scan WAL records to find the list of blocks modified, and
>> directly requests them from the source. I would expect pg_rewind to be
>> as quick as possible.
>
> I don't follow why the current approach of pg_rewind would have to
> change.  All I'm suggesting is that we have a different way, one which
> is much more restricted, for pg_rewind to request exactly the
> information it needs for efficient operation.

Ah, OK. I thought that you were referring to a protocol where caller
sends a single LSN from which it gets a differential backup that needs
to scan all the relation files of the source cluster to get the data
blocks with an LSN newer than the one sent, and then sends them back
to the caller.

I guess that what you are suggesting instead is an approach where
caller sends something like that through the replication protocol with
a relation OID and a block list:
BLOCK_DIFF relation_oid BLOCK_LIST m,n,[o, ...]
Which is close to what pg_read_binary_file does now for a superuser.
We would need as well to extend BASE_BACKUP so as it does not include
relation files though for this use case.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements query jumbling question

2015-09-02 Thread Satoshi Nagayasu
On 2015/09/01 14:39, Satoshi Nagayasu wrote:
> On 2015/09/01 14:01, Tom Lane wrote:
>> Satoshi Nagayasu  writes:
>>> On 2015/09/01 13:41, Peter Geoghegan wrote:
 If you want to use the queryId field directly, which I recall you
 mentioning before, then that's harder. There is simply no contract
 among extensions for "owning" a queryId. But when the fingerprinting
 code is moved into core, then I think at that point queryId may cease
 to be even a thing that pg_stat_statements theoretically has the right
 to write into. Rather, it just asks the core system to do the
 fingerprinting, and finds it within queryId. At the same time, other
 extensions may do the same, and don't need to care about each other.

 Does that work for you?
>>
>>> Yes. I think so.
>>
>>> I need some query fingerprint to determine query group. I want queryid
>>> to keep the same value when query strings are the same (except literal
>>> values).
>>
>> The problem I've got with this is the unquestioned assumption that every
>> application for query IDs will have exactly the same requirements for
>> what the ID should include or ignore.
> 
> I'm not confident about that too, but at least, I think we will be able
> to collect most common use cases as of today. (aka best guess. :)
> 
> And IMHO it would be ok to change the spec in future release.

I know this still needs to be discussed, but I would like to submit
a patch for further discussion at the next CF, 2015-11.

Regards,
-- 
NAGAYASU Satoshi 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..2f6fb99 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -76,6 +76,8 @@
 #include "tcop/utility.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/relcache.h"
 
 PG_MODULE_MAGIC;
 
@@ -2286,6 +2288,7 @@ static void
 JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
 {
ListCell   *lc;
+   Relation   rel;
 
foreach(lc, rtable)
{
@@ -2296,7 +2299,9 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
switch (rte->rtekind)
{
case RTE_RELATION:
-   APP_JUMB(rte->relid);
+   rel = RelationIdGetRelation(rte->relid);
+   APP_JUMB_STRING(RelationGetRelationName(rel));
+   RelationClose(rel);
JumbleExpr(jstate, (Node *) rte->tablesample);
break;
case RTE_SUBQUERY:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 3:41 AM, Robert Haas wrote:
> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

I don't recall that. Immutable functions are switched to constants in
the query sent to datanodes. Volatile and stable functions are
evaluated locally after fetching the results from the remote node. Not
that efficient for warehouse loads. My 2c.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.

That's something that the quorum synchronous patch would address.
Still, having the possibility to be synchronous across multiple nodes
does not seem like to be something at the top of the list.

> Also, the GTM needs to be aware that this stuff is happening, or it will 
> DTWT.  That too seems like a problem that can be solved.

If I understood correctly, yes it is with its centralized transaction
facility each node is aware of the transaction status via the global
snapshot.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Stephen Frost
* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Thu, Sep 3, 2015 at 10:05 AM, Stephen Frost wrote:
> > * Michael Paquier (michael.paqu...@gmail.com) wrote:
> >> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera wrote:
> > The replication role already has read-only access to everything
> > (nearly?) in the PGDATA directory.  The specific issue in this case, I
> > believe, is if the functions mentioned provide access beyond what the
> > replication role already has access to.
> 
> It seems to me that this difference of files is symbolized by what is
> filtered out in a base backup, like postmaster.opts or postmaster.pid.

Neither of those strike me as particularly sensitive information..

> The matter here is: are we going to have in the future files that we
> do not want to include in a base backup that are superuser-only? Today
> I cannot recall we have any of that, we may have in the future though.

This also seems unliekly.

> >> I don't entirely understand why pg_rewind should be invoking any of these
> >> functions to begin with.  Isn't it working directly with the disk data,
> >> with both postmasters shut down?
> >
> > Certain information is needed from the new master to rewind the old
> > master.  What I would hope is that we'd provide a way for *just* that
> > information to be made available to the replication role rather than
> > completely general access through the functions mentioned.
> 
> The rewound node does not only need the diffs of blocks from a source,
> but also some more information like clog. There are actually three
> different approaches to solve that:

Ok, so CLOG information is also required, but that's not a particularly
large issue and the replication role has access to those files already,
no?  It's just inconvenient to access them using the current protocol,
if that's all you're interested in.

> 1) Use a differential backup to me, or the possibility to fetch a set
> of data block diffs from a source node using an LSN and then re-apply
> them on the target node. The major disadvantage of this approach is
> actually performance: we would need to scan the source node
> completely, so that's slow. And pg_rewind is fast because it only
> scans the WAL records of the node to be rewound from the last
> checkpoint before WAL forked.

I don't follow this performance concern at all.  Why can't pg_rewind
look through the WAL and find what it needs, and then request exactly
that information?  Clearly, we would need to add to the existing
replication protocol, but I don't see any reason to not consider that a
perfectly reasonable approach for this.

> 2) Request data blocks from the source node using the replication
> protocol, that's what pg_read_binary_file actually does, we just don't
> have the logic on replication protocol side, though we could.

Right, we would need to modify the replication protocol to allow such
requests, but that's not particularly difficult.

> 3) Create a new set of functions similar to the existing file access
> functions that are usable by the replication user, except that they
> refuse to return file entries that match the existing filters in
> basebackup.c. That is doable, with a routine in basebackup.c that
> decides if a given file string can be read or not. Base backups could
> use this routine as well.

I don't particularly like this approach as it implies SQL access for the
replication role.

> >> > Another problem with it is that access to the filesystem is about halfway
> >> > to being superuser anyway, even if it's only read-only access.  It would
> >> > certainly let you read things that one would not expect a replication
> >> > user to be able to read (like files unrelated to Postgres).
> >>
> >> Doesn't pg_read_file et al insist that the path is below the data
> >> directorY?
> >
> > I don't believe that's the case, actually..  I might be wrong though,
> > I'm not looking at the code right now.
> 
> The use of PGDATA is enforced

We know that there are a lot of potential risks around enforcing this,
or similar, requirements.  Right now, we don't have to worry about any
of those corner cases, but we would if we used this approach.  If,
instead, the replication protocol is modified to accept requests for the
specific information (CLOG ), we could be sure to only ever return
exactly that information from the current cluster, or throw an error.

> > This is definitely a big part of
> > the question, but I'd like to ask- what, exactly, does pg_rewind
> > actually need access to? Is it only the WAL, or are heap and WAL files
> > needed?
> 
> Not only, +clog, configuration files, etc.

Configuration files?  Perhaps you could elaborate?

> > Consider the discussion about delta backups, et al, using LSNs.  Perhaps
> > the replication protocol should be extended to allow access to arbitrary
> > WAL, querying what WAL is available, and access to the heap files,
> > perhaps even specific pages in the heap files and relation forks,
> > instead of 

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 10:05 AM, Stephen Frost wrote:
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera wrote:
> The replication role already has read-only access to everything
> (nearly?) in the PGDATA directory.  The specific issue in this case, I
> believe, is if the functions mentioned provide access beyond what the
> replication role already has access to.

It seems to me that this difference of files is symbolized by what is
filtered out in a base backup, like postmaster.opts or postmaster.pid.
The matter here is: are we going to have in the future files that we
do not want to include in a base backup that are superuser-only? Today
I cannot recall we have any of that, we may have in the future though.

>> I don't entirely understand why pg_rewind should be invoking any of these
>> functions to begin with.  Isn't it working directly with the disk data,
>> with both postmasters shut down?
>
> Certain information is needed from the new master to rewind the old
> master.  What I would hope is that we'd provide a way for *just* that
> information to be made available to the replication role rather than
> completely general access through the functions mentioned.

The rewound node does not only need the diffs of blocks from a source,
but also some more information like clog. There are actually three
different approaches to solve that:
1) Use a differential backup to me, or the possibility to fetch a set
of data block diffs from a source node using an LSN and then re-apply
them on the target node. The major disadvantage of this approach is
actually performance: we would need to scan the source node
completely, so that's slow. And pg_rewind is fast because it only
scans the WAL records of the node to be rewound from the last
checkpoint before WAL forked.
2) Request data blocks from the source node using the replication
protocol, that's what pg_read_binary_file actually does, we just don't
have the logic on replication protocol side, though we could.
3) Create a new set of functions similar to the existing file access
functions that are usable by the replication user, except that they
refuse to return file entries that match the existing filters in
basebackup.c. That is doable, with a routine in basebackup.c that
decides if a given file string can be read or not. Base backups could
use this routine as well.
I just came up with 3), which looks rather appealing to me.

>> > Another problem with it is that access to the filesystem is about halfway
>> > to being superuser anyway, even if it's only read-only access.  It would
>> > certainly let you read things that one would not expect a replication
>> > user to be able to read (like files unrelated to Postgres).
>>
>> Doesn't pg_read_file et al insist that the path is below the data
>> directorY?
>
> I don't believe that's the case, actually..  I might be wrong though,
> I'm not looking at the code right now.

The use of PGDATA is enforced
=# select pg_read_file('/etc/passwd');
ERROR:  42501: absolute path not allowed
LOCATION:  convert_and_check_filename, genfile.c:73
If PGDATA includes soft links it is possible to point to those places
though, but that's not something anybody sane would do, still it can
be done and I've seen worse:
$ cd $PGDATA && ln -s /etc etc
$ psql -c 'select pg_read_file('etc/passwd')'
# Adult content, not an ERROR
>From this perspective allowing a replication user to have access to
those functions is dangerous, a superuser being the equivalent of an
all-mightly god on a server, still that's not the case of the
replication user.

> This is definitely a big part of
> the question, but I'd like to ask- what, exactly, does pg_rewind
> actually need access to? Is it only the WAL, or are heap and WAL files
> needed?

Not only, +clog, configuration files, etc.

> Consider the discussion about delta backups, et al, using LSNs.  Perhaps
> the replication protocol should be extended to allow access to arbitrary
> WAL, querying what WAL is available, and access to the heap files,
> perhaps even specific pages in the heap files and relation forks,
> instead of giving pg_rewind access to these extremely general
> nearly-OS-user-level functions.

The problem when using differential backups in this case is
performance as mentioned above. We would need to scan the whole target
cluster, which may take time, the current approach of pg_rewind only
needs to scan WAL records to find the list of blocks modified, and
directly requests them from the source. I would expect pg_rewind to be
as quick as possible.

> Further, for my two cents, I view any access to the system by a
> replication role to a non-replication-protocol communication with the
> server as being inappropriate.  To that point, I'd actually remove the
> ability for the replication roles to call pg_start/stop_backup as normal
> SQL calls since that implies the replication user has connected to a
> normal backend instead of talking the replicat

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby

On 9/2/15 2:56 PM, Jim Nasby wrote:

On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an
exception block thing. If it was that simple then we'd still get the
crash without involving pgTap. So I suspect you need to have a named
cursor in the mix as well.

Let me make another attempt at something simpler.


After some time messing around with this I've been able to remove pgTap 
from the equation using the attached crash.sql (relevant snippet below).


This only works if you pass a refcursor into the function. It doesn't 
work if you do


OPEN have FOR EXECUTE $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$;

inside the function instead. This code also produces different results; 
it outputs the error message before crashing and the stack trace shows 
the assert is now failing while trying to abort the top-level 
transaction. So it looks like the scenario is:


BEGIN;
DECLARE (open?) cursor that calls function with exception handler that 
calls function with exception handler that calls something that fails


The double set of exceptions seems to be critical; if instead of calling 
tf.get(invoice) (which recursively does tf.get(customer)) I define the 
cursor to call tf.get(customer) directly there's no assert.


I can poke at this more tomorrow, but it'd be very helpful if someone 
could explain this failure mode, as I'm basically stumbling in the dark 
on a test case right now.


CREATE OR REPLACE FUNCTION a(
have refcursor
) RETURNS void LANGUAGE plpgsql AS $body$
DECLARE
have_rec record;
BEGIN
FETCH have INTO have_rec;
END
$body$;
DECLARE h CURSOR FOR SELECT * FROM tf.get( NULL::invoice, 'base' );
SELECT a(
  'h'::refcursor
);


(lldb) bt
* thread #1: tid = 0x722ed8, 0x7fff92a5a866 
libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', 
stop reason = signal SIGABRT
  * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125
frame #3: 0x00010cdb4039 
postgres`ExceptionalCondition(conditionName=0x00010cf59cfd, 
errorType=0x00010cec392d, fileName=0x00010cf59045, lineNumber=1972) + 
137 at assert.c:54
frame #4: 0x00010cd9b332 
postgres`RelationClearRelation(relation=0x00011658f110, rebuild='\0') + 162 
at relcache.c:1970
frame #5: 0x00010cd9cc0f 
postgres`AtEOSubXact_cleanup(relation=0x00011658f110, isCommit='\0', 
mySubid=15, parentSubid=1) + 79 at relcache.c:2665
frame #6: 0x00010cd9cb92 
postgres`AtEOSubXact_RelationCache(isCommit='\0', mySubid=15, parentSubid=1) + 
242 at relcache.c:2633
frame #7: 0x00010c9b6e88 postgres`AbortSubTransaction + 440 at 
xact.c:4373
frame #8: 0x00010c9b7208 postgres`AbortCurrentTransaction + 312 at 
xact.c:2947
frame #9: 0x00010cc249f3 postgres`PostgresMain(argc=1, 
argv=0x7fa3f4800378, dbname=0x7fa3f4800200, 
username=0x7fa3f30031f8) + 1875 at postgres.c:3902
frame #10: 0x00010cb9da48 postgres`PostmasterMain [inlined] BackendRun 
+ 8024 at postmaster.c:4155
frame #11: 0x00010cb9da22 postgres`PostmasterMain [inlined] 
BackendStartup at postmaster.c:3829
frame #12: 0x00010cb9da22 postgres`PostmasterMain [inlined] ServerLoop 
at postmaster.c:1597
frame #13: 0x00010cb9da22 postgres`PostmasterMain(argc=, 
argv=) + 7986 at postmaster.c:1244
frame #14: 0x00010cb218cd postgres`main(argc=, 
argv=) + 1325 at main.c:228
frame #15: 0x7fff8e9a35fd libdyld.dylib`start + 1

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
BEGIN;
\i test/helpers/tap_setup.sql

CREATE EXTENSION test_factory;
SET search_path=tap;
\i test/helpers/create.sql

SELECT tf.register(
  'customer'
  , array[
row(
  'insert'
  , $$INSERT INTO customer VALUES (DEFAULT, 'first', 'last' ) RETURNING *$$
)::tf.test_set
, row(
  'function'
  , $$SELECT * FROM customer__add( 'func first', 'func last' )$$
)::tf.test_set
  ]
);
SELECT tf.register(
  'invoice'
  , array[
  row(
'insert'
, $$INSERT INTO invoice VALUES(
DEFAULT
, (tf.get( NULL::customer, 'insert' )).customer_id
, current_date
, current_date + 30
  ) RETURNING *$$
  )::tf.test_set
  , row(
'function'
, $$INSERT INTO invoice VALUES(
  

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark
That doesn't match any of the empirical tests I did at the time. I posted
graphs of the throughput for varying numbers of spindles with varying
amount of prefetch. In every case more prefetching increases throuput up to
N times the single platter throuput where N was the number of spindles.

There can be a small speedup from overlapping CPU with I/O but that's a
really small effect. At most that can be a single request and it would be a
very rarely would the amount of CPU time be even a moderate fraction of the
I/O latency. The only case where they're comparable is when your reading
sequentially and then hopefully you wouldn't be using postgres prefetching
at all which is only really intended to help random I/O.
On 2 Sep 2015 23:38, "Andres Freund"  wrote:

> On 2015-09-02 19:49:13 +0100, Greg Stark wrote:
> > I can take the blame for this formula.
> >
> > It's called the "Coupon Collector Problem". If you hit get a random
> > coupon from a set of n possible coupons, how many random coupons would
> > you have to collect before you expect to have at least one of each.
>
> My point is that that's just the entirely wrong way to model
> prefetching. Prefetching can be massively beneficial even if you only
> have a single platter! Even if there were no queues on the hardware or
> OS level! Concurrency isn't the right way to look at prefetching.
>
> You need to prefetch so far ahead that you'll never block on reading
> heap pages - and that's only the case if processing the next N heap
> blocks takes longer than the prefetch of the N+1 th page. That doesn't
> mean there continously have to be N+1 prefetches in progress - in fact
> that actually often will only be the case for the first few, after that
> you hopefully are bottlnecked on CPU.
>
> If you additionally take into account hardware realities where you have
> multiple platters, multiple spindles, command queueing etc, that's even
> more true. A single rotation of a single platter with command queuing
> can often read several non consecutive blocks if they're on a similar
>
> - Andres
>


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 12:38 AM, Andres Freund wrote:
> Isn't a full test with a separate initdb, create extension etc. a really
> heavyhanded way to test this? I mean that's a test where the setup takes
> up to 10s, whereas the actual runtime is in the millisecond range?
>
> Adding tests in this manner doesn't seem scalable to me.

How to include such kind of tests is in the heart of the discussion
since this patch has showed up. I think we are now close to 5
different opinions with 4 different hackers on the matter, the Riggs'
theorem applies nicely.

> I think we should rather add *one* test that does dump/restore over the
> normal regression test database. Something similar to the pg_upgrade
> tests. And then work at adding more content to the regression test
> database - potentially sourcing from src/test/modules.

If you are worrying about run time, pg_upgrade already exactly does
that. What would be the point to double the amount of time to do the
same thing in two different places?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 2:30 AM, Alvaro Herrera wrote:
> Andres Freund wrote:
>> Isn't a full test with a separate initdb, create extension etc. a really
>> heavyhanded way to test this? I mean that's a test where the setup takes
>> up to 10s, whereas the actual runtime is in the millisecond range?
>
> I spent some time looking over this patch yesterday, and that was one
> concern I had too.  I was thinking that if we turn this into a single
> module that can contain several extensions, or several pg_dump-related
> tests, and then test multiple features without requiring an initdb for
> each, it would be more palatable.

Yeah sure. That's a statement you could generalize for all the TAP
tests, per se for example pg_rewind tests.

>> Adding tests in this manner doesn't seem scalable to me.
>
> I was thinking in having this be renamed src/test/modules/extensions/
> and then the extension contained here would be renamed ext001_fk_tables
> or something like that; later we could ext002_something for testing some
> other angle of extensions, not necessarily pg_dump related.

So, if I am following here correctly, what you are suggesting is:
1) create src/test/modules/extensions/, and include fk_tables in it for now
2) offer the possibility to run make check in this path, doing
complicated tests on a single instance, pg_dump on an instance is one
of them.
Note that for now the TAP test machinery does not allow to share a
single cluster instance across multiple pl scripts, and it seems to me
that's what you are meaning here (in my opinion that's out of sight of
this patch, and I don't think either we should do it this will make
error handling a nightmare). And we can now have a single TAP script
doing all the tests on a single instance. So which one are you
suggesting?

The patch proposed here is actually doing the second, initializing an
instance and performing pg_dump on it. If we change it your way we
should just rename the test script to something like
001_test_extensions.pl and mention in it that new tests for extensions
should be included in it. Still I would imagine that each extension
are actually going to need slightly differ test patterns to work or
cover what they are intended to cover. See for example the set of
tests added in the CREATE EXTENSION CASCADE patch: those are not
adapted to run with the TAP machinery. I may be of course wrong, we
may have other extensions in the future using that, still I am seeing
none around for now.

If you are worrying about run time, we could as well have make check
add an EXTRA_INSTALL pointing to src/test/modules/extensions/, with an
additional test, say extensions.sql that creates data based on it.
Then we let pg_upgrade do the dump/restore checks. This way the run
time of make check-world is minimally impacted, and the code paths we
want tested are done by the buildfarm. That's ugly, still it won't
impact performance.

>> I think we should rather add *one* test that does dump/restore over the
>> normal regression test database. Something similar to the pg_upgrade
>> tests. And then work at adding more content to the regression test
>> database - potentially sourcing from src/test/modules.
>
> That's another option, but we've had this idea for many years now and it
> hasn't materialized.  As I recall, Andrew Dunstan has a module that
> tests cross-version pg_upgrade and one thing he does is dump both and
> compare; the problem is that there are differences, so he keeps a count
> of how many lines he expect to differ between any two releases.  Or
> something like that.  While it's a good enough start, I don't think it's
> robust enough to be in core.  How would we do it?

That does not seem plausible to me for an integration in core, you
would need to keep switching between branches in a given git tree if
the test is being done within a git repository, and that would be just
not doable from a raw tarball which just stores one given version.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Stephen Frost
Micahel,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  
> wrote:
> > Michael Paquier wrote:
> >> As of now, file access functions in genfile.c can only be used by
> >> superusers. This proposal is to relax those functions so as
> >> replication users can use them as well. Here are the functions aimed
> >> by this patch:
> >> - pg_stat_file
> >> - pg_read_binary_file
> >> - pg_read_file
> >> - pg_ls_dir
> >> The main argument for this change is that pg_rewind makes use of those
> >> functions, forcing users to use a superuser role when rewinding a
> >> node.
> >
> > Can this piggyback on Stephen Frost's proposed patch for reserved roles et 
> > al?
> 
> (Adding Stephen in the loop)

Thanks!

> I thought that this was rather independent on Stephen's default role
> patch because this is not based on a new role permission type.
> Stephen, do you think so, or should we merge the effort with your
> things?

It's not directly related to the default roles as it's adding
permissions to an existing role attribute control, but it's certainly
covering related territory and I'd like to make sure we're careful with
any increase in permissions for existing role attributes.

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Michael Paquier  writes:
> > On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  
> > wrote:
> >> Michael Paquier wrote:
> >>> The main argument for this change is that pg_rewind makes use of those
> >>> functions, forcing users to use a superuser role when rewinding a
> >>> node.
> 
> >> Can this piggyback on Stephen Frost's proposed patch for reserved roles et 
> >> al?
> 
> > (Adding Stephen in the loop)
> > I thought that this was rather independent on Stephen's default role
> > patch because this is not based on a new role permission type.
> > Stephen, do you think so, or should we merge the effort with your
> > things?
> 
> Just on general principles, this seems like a pretty horrid idea.

Agreed- it's far broader than what is necessary, as I understand it.

> To me replication privilege means the ability to transfer data out of
> the master, not to cause arbitrary state changes on the master.

Agreed, but as Andres points out, this doesn't allow state changes on
the master.

> Therefore, "allow replication users to run pg_rewind" seems less like
> a feature and more like a security bug.  Am I missing something?

I believe Andres clarified this- this is about remastering an old master
to follow a *new* master, and at this point, the old master needs
information from the new master to reset the old master system back to a
point where the old master can turn into a replica of the new master.

> Another problem with it is that access to the filesystem is about halfway
> to being superuser anyway, even if it's only read-only access.  It would
> certainly let you read things that one would not expect a replication
> user to be able to read (like files unrelated to Postgres).

The replication role already has read-only access to everything
(nearly?) in the PGDATA directory.  The specific issue in this case, I
believe, is if the functions mentioned provide access beyond what the
replication role already has access to.

> I don't entirely understand why pg_rewind should be invoking any of these
> functions to begin with.  Isn't it working directly with the disk data,
> with both postmasters shut down?

Certain information is needed from the new master to rewind the old
master.  What I would hope is that we'd provide a way for *just* that
information to be made available to the replication role rather than
completely general access through the functions mentioned.

* Andres Freund (and...@anarazel.de) wrote:
> On 2015-09-02 19:48:15 -0400, Tom Lane wrote:
> > Just on general principles, this seems like a pretty horrid idea.
> > To me replication privilege means the ability to transfer data out of
> > the master, not to cause arbitrary state changes on the master.
> 
> It's not about the permission to trigger pg_rewind on the master - it's
> about being able to run pg_rewind (as the necessary OS user) on the
> *standby* when the connection to the primary has only replication rather
> than superuser privs.

Understood.

> > Another problem with it is that access to the filesystem is about halfway
> > to being superuser anyway, even if it's only read-only access.  It would
> > certainly let you read things that one would not expect a replication
> > user to be able to read (like files unrelated to Postgres).
> 
> Doesn't pg_read_file et al insist that the path is below the data
> directorY?

I don't believe that's the case, actually..  I might be wrong though,
I'm not looking at the code right now.  This is definitely a big part of
the question, but I'd like to ask- what, exactly, does pg_rewind
actually need access to?  Is it only the WAL, or are heap and WAL files
needed?  How much of that access could be reasonably needed as part of
the replication protocol for

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 6:56 PM, Bruce Momjian  wrote:
> On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.  But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Can you explain why logical replication is better than binary
> replication for this use-case?

Uh, well, for the same reasons it is better in many other cases.
Particularly, you probably don't want to replicate all the data on
machine A to machine B, just some of it.

Typically, sharding solutions store multiple copies of each piece of
data.  So let's say you have 4 machines.  You divide the data into 12
chunks.  Each machine is the write-master for 2 of those chunks, but
has secondary copies of 3 others.  So maybe things start out like
this:

machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9

If machine #1 is run over by a rabid triceratops, you can make machine
#2 the master for chunk 1, machine #3 the master for chunk 2, and
machine #4 the master for chunk 3.  The write load therefore remains
evenly divided.  If you can only copy entire machines, you can't
achieve that in this situation.

I'm not saying that the above is exactly what we're going to end up
with, or even necessarily close.  But a big part of the point of
sharding is that not all the machines have the same data - otherwise
you are not write scaling.  But it will frequently be the case, for
various reasons, that they have *overlapping* sets of data.  Logical
replication can handle that; physical replication can't.

In Postgres-XC, all tables are either sharded (part of the table is
present on each node) or distributed (all of the table is present on
every node).  Clearly, there's no way to use physical replication in
that scenario except if you are OK with having two copies of every
node.  But that's not a very good solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 5:22 AM, Robert Haas wrote:
> Still, that's not a reason not commit this, so done.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 5:55 PM, Andres Freund  wrote:
> On 2015-09-02 14:36:51 -0400, Robert Haas wrote:
>> On Wed, Sep 2, 2015 at 2:20 PM, Fabien COELHO  wrote:
>> >> I'm wondering if percentages instead of weights would be a better
>> >> idea. That'd mean you'd be forced to be more careful when adding another
>> >> script (having to adjust the percentages of other scripts) but arguably
>> >> that's a good thing?
>> >
>> > If you use only percent, then you have to check that the total is 100,
>> > probably you have to use floats, to do something when the total is not 100,
>> > checking would complicate the code and test people mental calculus
>> > abilities. Not sure this is a good idea:-)
>>
>> I agree.  I don't see a reason to enforce that the total of the
>> weights must be 100.
>
> I'm slightly worried that using weights will be a bit confusing because
> adding another script will obviously reduce the frequency of already
> defined scripts. But it's probably not worth worrying.

That sounds like a feature to me, not a bug.  I wouldn't worry.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 5:31 PM, Joe Conway  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 09/02/2015 05:25 PM, Tom Lane wrote:
>> Robert Haas  writes:
>>> But I'm not sure I like the idea of adding a server dependency on
>>> the ability to exec pg_controldata.  That seems like it could be
>>> unreliable at best, and a security vulnerability at worst.
>>
>> I hadn't been paying attention --- the proposed patch actually
>> depends on exec'ing pg_controldata?  That's horrid!  There is no
>> expectation that that's installed.
>
> No it doesn't. I'm confused :-/

No, I'm confused.  Sorry.  Somehow I misread your patch.

Pay no attention to that man behind the curtain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-02 Thread Robert Haas
, On Wed, Aug 26, 2015 at 4:05 AM, Kouhei Kaigai  wrote:
>> On 2015/08/26 16:07, Kouhei Kaigai wrote:
>> I wrote:
>> >> Maybe I'm missing something, but why do we need such a flexiblity for
>> >> the columnar-stores?
>>
>> > Even if we enforce them a new interface specification comfortable to RDBMS,
>> > we cannot guarantee it is also comfortable to other type of FDW drivers.
>>
>> Specifically, what kind of points about the patch are specific to RDBMS?
>>
>
>   *** 88,93  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
>   --- 99,122 
> TupleTableSlot *
> ExecForeignScan(ForeignScanState *node)
> {
>   + EState *estate = node->ss.ps.state;
>   +
>   + if (estate->es_epqTuple != NULL)
>   + {
>   + /*
>   +  * We are inside an EvalPlanQual recheck.  If foreign join, 
> get next
>   +  * tuple from subplan.
>   +  */
>   + Index   scanrelid = ((Scan *) 
> node->ss.ps.plan)->scanrelid;
>   +
>   + if (scanrelid == 0)
>   + {
>   + PlanState  *outerPlan = outerPlanState(node);
>   +
>   + return ExecProcNode(outerPlan);
>   + }
>   + }
>   +
> return ExecScan((ScanState *) node,
> (ExecScanAccessMtd) ForeignNext,
> (ExecScanRecheckMtd) ForeignRecheck);
>
> It might not be specific to RDBMS, however, we cannot guarantee all the FDW 
> are
> comfortable to run the alternative plan node on EPQ recheck.
> This design does not allow FDW drivers to implement own EPQ recheck, possibly
> more efficient than built-in logic.

I'm not convinced that this problem is more than hypothetical.  EPQ
rechecks should be quite rare, so it shouldn't really matter if we
jump through a few extra hoops when they happen.  And really, are
those hoops all that expensive?  It's not as if ExecInitNode should be
doing any sort of expensive operation, or ExecEndScan either.  And
they will be able to tell if they're being called for an EPQ-recheck
by fishing out the estate, so if there's some processing that they
want to short-circuit for that case, they can.  So I'm not seeing the
problem.  Do you have any evidence that either the performance cost or
the code complexity cost is significant for PG-Strom or any other
extension?

That having been said, I don't entirely like Fujita-san's patch
either.  Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.  And much of it is contingent on whether estate->es_epqTuple
!= NULL and scanrelid == 0, but perhaps out would be better to check
whether the subplan is actually present instead of checking whether we
think it should be present.  Also, the naming is a bit weird:
node->fs_subplan gets shoved into outerPlanState(), which seems like a
kludge.

I'm wondering if there's another approach.  If I understand correctly,
there are two reasons why the current situation is untenable.  The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there.  The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.  The
second is that ExecScanFetch needs scanrelid > 0 so that
estate->es_epqTupleSet[scanrelid - 1] isn't indexing off the beginning
of the array, and similarly estate->es_epqScanDone[scanrelid - 1] and
estate->es_epqTuple[scanrelid - 1].  But, waving my hands wildly, that
also seems like a solvable problem.  I mean, we're joining a non-empty
set of relations, so the entries in the EPQ-related arrays for those
RTIs are not getting used for anything, so we can use any of them for
the joinrel.  We need some way for this code to decide what RTI to
use, but that shouldn't be too hard to finagle.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund
On 2015-09-02 17:14:12 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan  wrote:
> > On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> >> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> >> Hm. Is a compiler test actually test anything reliably here? Won't this
> >> just throw a warning during compile time about an unknown function?
> >
> > I'll need to look into that.
> 
> My error here was not considering why the existing __builtin_bswap32()
> test worked alright with AC_COMPILE_IFELSE(), while my test needed
> AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
> failure (not just success) when writing compiler tests in the future.
> 
> AC_LINK_IFELSE() appears to work fine following a straight
> substitution. When I get around to revising this patch in a while,
> that fix will be included.

Be sure to also test with optimizations enabled - often tests can get
optimized away if you don't force the result to be used (or volatile) :/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-03 01:59:13 +0200, Tomas Vondra wrote:
> That's a bit surprising, especially considering that e_i_c=30 means ~100
> pages to prefetch if I'm doing the math right.
> 
> AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages
> should not make a difference), 256 for SAS drives and ~1000 for most current
> RAID controllers.

I think the point is that after an initial buildup - we'll again only
prefetch pages in smaller increments because we already prefetched most
pages. The actual number of concurrent reads will then be determined by
how fast pages are processed.  A prefetch_target of a 100 does *not*
mean 100 requests are continously in flight. And even if it would, the
OS won't issue many more requests than the queue depth, so they'll just
sit in the OS queue.

> So instead of "How many blocks I need to prefetch to saturate the devices?"
> you're asking "How many blocks I need to prefetch to never actually wait for
> the I/O?"

Yes, pretty much.

> I do like this view, but I'm not really sure how could we determine the
> right value? It seems to be very dependent on hardware and workload.

Right.

> For spinning drives the speedup comes from optimizing random seeks to a more
> optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel
> channels (and possibly faster access to the same block).

+ OS reordering & coalescing.

Don't forget that the OS processes the OS IO queues while the userland
process isn't scheduled - in a concurrent workload with more processes
than hardware threads that means that pending OS requests are being
issued while the query isn't actively being processed.

> I guess the best thing we could do at this level is simply keep the
> on-device queues fully saturated, no?

Well, being too aggressive can hurt throughput and latency of concurrent
processes, without being beneficial.

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 8:59 AM, Andres Freund  wrote:
> On 2015-09-02 19:48:15 -0400, Tom Lane wrote:
>> Just on general principles, this seems like a pretty horrid idea.
>> To me replication privilege means the ability to transfer data out of
>> the master, not to cause arbitrary state changes on the master.
>
> It's not about the permission to trigger pg_rewind on the master - it's
> about being able to run pg_rewind (as the necessary OS user) on the
> *standby* when the connection to the primary has only replication rather
> than superuser privs.

Yeah, I got poked by this limitation of pg_rewind some time ago
internally actually, folks willing to be able to manage their cluster
only with a replication role, and they were not really willing to have
a superuser for such operations being used across the network.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan  wrote:
> On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
>> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
>> Hm. Is a compiler test actually test anything reliably here? Won't this
>> just throw a warning during compile time about an unknown function?
>
> I'll need to look into that.

My error here was not considering why the existing __builtin_bswap32()
test worked alright with AC_COMPILE_IFELSE(), while my test needed
AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
failure (not just success) when writing compiler tests in the future.

AC_LINK_IFELSE() appears to work fine following a straight
substitution. When I get around to revising this patch in a while,
that fix will be included.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Andres Freund
On 2015-09-02 19:48:15 -0400, Tom Lane wrote:
> Just on general principles, this seems like a pretty horrid idea.
> To me replication privilege means the ability to transfer data out of
> the master, not to cause arbitrary state changes on the master.

It's not about the permission to trigger pg_rewind on the master - it's
about being able to run pg_rewind (as the necessary OS user) on the
*standby* when the connection to the primary has only replication rather
than superuser privs.

> Another problem with it is that access to the filesystem is about halfway
> to being superuser anyway, even if it's only read-only access.  It would
> certainly let you read things that one would not expect a replication
> user to be able to read (like files unrelated to Postgres).

Doesn't pg_read_file et al insist that the path is below the data
directorY?

> I don't entirely understand why pg_rewind should be invoking any of these
> functions to begin with.  Isn't it working directly with the disk data,
> with both postmasters shut down?

There's two modes: If both data directories are on the same server both
need to be shut down. If, as it's pretty commonly the case, the "source"
db is on another system the source system must be started (as a
primary). In the second mode all the files that have changed are copied
over a libpq connection.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra



On 09/03/2015 12:23 AM, Andres Freund wrote:

On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:


As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:


My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.


I saw pretty much the opposite - the benefits seldomly were
significant below 30 or so. Even on single disks.


That's a bit surprising, especially considering that e_i_c=30 means ~100 
pages to prefetch if I'm doing the math right.


AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 
pages should not make a difference), 256 for SAS drives and ~1000 for 
most current RAID controllers.


I'm not entirely surprised that values beyond 30 make a difference, but 
that you seldomly saw significant improvements below this value.


No doubt there are workloads like that, but I'd expect them to be quite 
rare and not prevalent as you're claiming.



Which actually isn't that surprising - to be actually beneficial
(that is, turn an IO into a CPU bound workload) the prefetched buffer
needs to actually have been read in by the time its needed. In many
queries processing a single heap page takes far shorter than
prefetching the data from storage, even if it's on good SSDs.

>

Therefore what you actually need is a queue of prefetches for the
next XX buffers so that between starting a prefetch and actually
needing the buffer ienough time has passed that the data is
completely read in. And the point is that that's the case even for a
single rotating disk!


So instead of "How many blocks I need to prefetch to saturate the 
devices?" you're asking "How many blocks I need to prefetch to never 
actually wait for the I/O?"


I do like this view, but I'm not really sure how could we determine the 
right value? It seems to be very dependent on hardware and workload.


For spinning drives the speedup comes from optimizing random seeks to a 
more optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the 
parallel channels (and possibly faster access to the same block).


I guess the best thing we could do at this level is simply keep the 
on-device queues fully saturated, no?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  
> wrote:
>> Michael Paquier wrote:
>>> The main argument for this change is that pg_rewind makes use of those
>>> functions, forcing users to use a superuser role when rewinding a
>>> node.

>> Can this piggyback on Stephen Frost's proposed patch for reserved roles et 
>> al?

> (Adding Stephen in the loop)
> I thought that this was rather independent on Stephen's default role
> patch because this is not based on a new role permission type.
> Stephen, do you think so, or should we merge the effort with your
> things?

Just on general principles, this seems like a pretty horrid idea.
To me replication privilege means the ability to transfer data out of
the master, not to cause arbitrary state changes on the master.
Therefore, "allow replication users to run pg_rewind" seems less like
a feature and more like a security bug.  Am I missing something?

Another problem with it is that access to the filesystem is about halfway
to being superuser anyway, even if it's only read-only access.  It would
certainly let you read things that one would not expect a replication
user to be able to read (like files unrelated to Postgres).

I don't entirely understand why pg_rewind should be invoking any of these
functions to begin with.  Isn't it working directly with the disk data,
with both postmasters shut down?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 4:26 PM, Andres Freund  wrote:
> I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
> readptr->current) and the corresponding version for tuplesort in the
> callsites?

Oh, I see. Maybe I'll do it that way when I pick this up in a little
while. I need to consider the "no tuple proper" (pass-by-value datum
sort) case within tuplesort.c, where stup.tuple is always NULL.
Currently, we discriminate against such SortTuples for various other
reasons (e.g. for memory accounting) by testing if the pointer is set,
so I suppose I copied that pattern here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  wrote:
> Michael Paquier wrote:
>> Hi all,
>>
>> As of now, file access functions in genfile.c can only be used by
>> superusers. This proposal is to relax those functions so as
>> replication users can use them as well. Here are the functions aimed
>> by this patch:
>> - pg_stat_file
>> - pg_read_binary_file
>> - pg_read_file
>> - pg_ls_dir
>> The main argument for this change is that pg_rewind makes use of those
>> functions, forcing users to use a superuser role when rewinding a
>> node.
>
> Can this piggyback on Stephen Frost's proposed patch for reserved roles et al?

(Adding Stephen in the loop)
I thought that this was rather independent on Stephen's default role
patch because this is not based on a new role permission type.
Stephen, do you think so, or should we merge the effort with your
things?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier
On Wed, Sep 2, 2015 at 11:32 PM, Fujii Masao wrote:
> Did you confirm that replication user can complete pg_rewind
> after this patch is applied?

Yes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund
On 2015-09-02 16:23:13 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund  wrote:
> >> Well, still needs to work for tuplestore, which does not have a SortTuple.
> >
> > Isn't it even more trivial there? It's just an array of void*'s? So
> > prefetch(state->memtuples + 3 + readptr->current)?
> 
> All I meant is that there couldn't be one centralized definition for
> both. I don't mind if you want to bake it into a macro for
> tupelstore.c and tuplesort.c.

I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
readptr->current) and the corresponding version for tuplesort in the
callsites?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund  wrote:
>> Well, still needs to work for tuplestore, which does not have a SortTuple.
>
> Isn't it even more trivial there? It's just an array of void*'s? So
> prefetch(state->memtuples + 3 + readptr->current)?

All I meant is that there couldn't be one centralized definition for
both. I don't mind if you want to bake it into a macro for
tupelstore.c and tuplesort.c.

>> Because of the way tuples are fetched across translation unit
>> boundaries in the cases addressed by the patch, it isn't hard to see
>> why the compiler does not do this automatically (prefetch instructions
>> added by the compiler are not common anyway, IIRC).
>
> Hardware prefetchers just have gotten to be rather good and obliterated
> most of the cases where it's beneficial.

Well, if hardware prefetchers are bright enough to do this perfectly
nowadays, then that implies a cost for useless prefetch instructions.
It might still be worth it even then, if the cost is very low for
these platforms. It's not as if they're required to respect the
prefetch hint in any way.

> I'd be interested to see a perf stat -ddd comparison to the patch
> with/without prefetches. It'll be interesting to see how the number of
> cache hits/misses and prefetches changes.
>
> Which microarchitecture did you test this on?

My laptop has an Intel Core i7-3520M, which is a mobile processor that
is a bit old. So, Ivy Bridge.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Andres Freund
On 2015-09-02 17:03:46 -0400, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 4:50 PM, Andrew Dunstan  wrote:
> > Tell me what's needed and I'll look at creating a buildfarm test module for
> > it.
> 
> You run the tests via: make -C src/test/ssl check
> 
> But nota bene security caveats:
> 
> commit e39250c644ea7cd3904e4e24570db21a209cf97f
> Author: Heikki Linnakangas 
> Date:   Tue Dec 9 17:21:18 2014 +0200
> 
> Add a regression test suite for SSL support.
> 
> It's not run by the global "check" or "installcheck" targets, because the
> temporary installation it creates accepts TCP connections from any user
> the same host, which is insecure.

We could just implement SSL over unix sockets. Obviously the
connection-encryption aspect isn't actually useful, but e.g. client
certs still make sense.  Besides, it allows to avoid concerns like the
above...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-02 Thread Tomas Vondra

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash 
table after the first batch, it does this:


memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of 
the array (usually resulting in crashes due to invalid pointers).


I fixed it to

  memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
 src/backend/executor/nodeHash.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 906cb46..63d484c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -388,7 +388,9 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	MemoryContextSwitchTo(hashtable->batchCxt);
 
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+			   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	/*
 	 * Set up for skew optimization, if possible and there's a need for more
@@ -654,7 +656,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		hashtable->nbuckets = hashtable->nbuckets_optimal;
 		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
 
-		hashtable->buckets = repalloc(hashtable->buckets,
+		hashtable->buckets = repalloc_huge(hashtable->buckets,
 sizeof(HashJoinTuple) * hashtable->nbuckets);
 	}
 
@@ -779,7 +781,7 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	 * chunks)
 	 */
 	hashtable->buckets =
-		(HashJoinTuple *) repalloc(hashtable->buckets,
+		(HashJoinTuple *) repalloc_huge(hashtable->buckets,
 hashtable->nbuckets * sizeof(HashJoinTuple));
 
 	memset(hashtable->buckets, 0, sizeof(void *) * hashtable->nbuckets);
@@ -1217,7 +1219,9 @@ ExecHashTableReset(HashJoinTable hashtable)
 
 	/* Reallocate and reinitialize the hash bucket headers. */
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+			   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade remote copy

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 04:50:32PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > On Fri, Aug 28, 2015 at 10:34:38PM -0700, AI Rumman wrote:
> 
> > > In pg_upgrade, how about adding a feature to copy data directory over 
> > > network.
> > > That is, we can run pg_upgrade from our new host, where old host will be a
> > > remote machine.
> 
> > I think it is much simpler to just copy the old clsuter to the remote
> > server and run pg_upgrade in --link mode on the remote server.
> 
> Couldn't it run pg_basebackup as a first step?

Yes.  I was thinking it would be simpler to just shut down the server
and copy it, but pg_basebackup would allow the server to be up during
the copy.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 12:03:36PM -0700, Josh Berkus wrote:
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.
> 
> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

I think this gets back to the idea of running an administrative alert
command when we switch to using a different server for
synchronous_standby_names.  We can't just keep requiring external
tooling to identify things that the database knows easily and can send
an alert.  Removing failed nodes is also something we should do and
notify users about.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund
On 2015-09-02 16:02:00 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund  wrote:
> > That's just a question of how to formulate this though?
> >
> > pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + 
> > offsetof(SortTuple, tuple))?
> >
> > For something heavily platform dependent like this that seems ok.
> 
> Well, still needs to work for tuplestore, which does not have a SortTuple.

Isn't it even more trivial there? It's just an array of void*'s? So
prefetch(state->memtuples + 3 + readptr->current)?

> Because of the way tuples are fetched across translation unit
> boundaries in the cases addressed by the patch, it isn't hard to see
> why the compiler does not do this automatically (prefetch instructions
> added by the compiler are not common anyway, IIRC).

Hardware prefetchers just have gotten to be rather good and obliterated
most of the cases where it's beneficial.

I'd be interested to see a perf stat -ddd comparison to the patch
with/without prefetches. It'll be interesting to see how the number of
cache hits/misses and prefetches changes.

Which microarchitecture did you test this on?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund  wrote:
> I'd be less brief in this case then, no need to be super short here.

Okay.

>> It started out that way, but Tom felt that it was better to have a
>> USE_MEM_PREFETCH because of the branch below...
>
> That doesn't mean we shouldn't still provide an empty definition.

Okay.

> That's just a question of how to formulate this though?
>
> pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + 
> offsetof(SortTuple, tuple))?
>
> For something heavily platform dependent like this that seems ok.

Well, still needs to work for tuplestore, which does not have a SortTuple.

>> Because that was the fastest value following testing on my laptop. You
>> are absolutely right to point out that this isn't a good reason to go
>> with the patch -- I share your concern. All I can say in defense of
>> that is that other major system software does the same, without any
>> regard for the underlying microarchitecture AFAICT.
>
> I know linux stripped out most prefetches at some point, even from x86
> specific code, because it showed that they aged very badly. I.e. they
> removed a bunch of them and stuff got faster, whereas they were
> beneficial on earlier architectures.

That is true, but IIRC that was specifically in relation to a commonly
used list data structure that had prefetches all over the place. That
was a pretty bad idea.

I think that explicit prefetching has extremely limited uses, too. The
only cases that I can imagine being helped are cases where there is
extremely predictable sequential access, but some pointer indirection.

Because of the way tuples are fetched across translation unit
boundaries in the cases addressed by the patch, it isn't hard to see
why the compiler does not do this automatically (prefetch instructions
added by the compiler are not common anyway, IIRC). The compiler has
no way of knowing that gettuple_common() is ultimately called from an
important inner loop, which could make all the difference, I suppose.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Can you explain why logical replication is better than binary
replication for this use-case?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-02 19:49:13 +0100, Greg Stark wrote:
> I can take the blame for this formula.
> 
> It's called the "Coupon Collector Problem". If you hit get a random
> coupon from a set of n possible coupons, how many random coupons would
> you have to collect before you expect to have at least one of each.

My point is that that's just the entirely wrong way to model
prefetching. Prefetching can be massively beneficial even if you only
have a single platter! Even if there were no queues on the hardware or
OS level! Concurrency isn't the right way to look at prefetching.

You need to prefetch so far ahead that you'll never block on reading
heap pages - and that's only the case if processing the next N heap
blocks takes longer than the prefetch of the N+1 th page. That doesn't
mean there continously have to be N+1 prefetches in progress - in fact
that actually often will only be the case for the first few, after that
you hopefully are bottlnecked on CPU.

If you additionally take into account hardware realities where you have
multiple platters, multiple spindles, command queueing etc, that's even
more true. A single rotation of a single platter with command queuing
can often read several non consecutive blocks if they're on a similar

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:
> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> > 
> > As I explained, spindles have very little to do with it - you need
> > multiple I/O requests per device, to get the benefit. Sure, the DBAs
> > should know how many spindles they have and should be able to determine
> > optimal IO depth. But we actually say this in the docs:
> 
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.

I saw pretty much the opposite - the benefits seldomly were significant
below 30 or so. Even on single disks. Which actually isn't that
surprising - to be actually beneficial (that is, turn an IO into a CPU
bound workload) the prefetched buffer needs to actually have been read
in by the time its needed. In many queries processing a single heap page
takes far shorter than prefetching the data from storage, even if it's
on good SSDs.

Therefore what you actually need is a queue of prefetches for the next
XX buffers so that between starting a prefetch and actually needing the
buffer ienough time has passed that the data is completely read in.  And
the point is that that's the case even for a single rotating disk!

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-09-02 Thread Andres Freund
On 2015-09-02 22:46:59 +0100, Greg Stark wrote:
> On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund  wrote:
> > To me it sounds like this shouldn't go through the full ReadBuffer()
> > rigamarole. That code is already complex enough, and here it's really
> > not needed. I think it'll be much easier to review - and actually faster
> > in many cases to simply have something like
> >
> > bool
> > BufferInCache(Relation, ForkNumber, BlockNumber)
> > {
> > /* XXX: setup tag, hash, partition */
> >
> > LWLockAcquire(newPartitionLock, LW_SHARED);
> > buf_id = BufTableLookup(&newTag, newHash);
> > LWLockRelease(newPartitionLock);
> > return buf_id != -1;
> > }
> >
> > and then fall back to the normal ReadBuffer() when it's in cache.
> 
> 
> I guess I'm missing something but how is this API useful? As soon as
> its returned it the result might be invalid since before you actually
> make use of the return value another process could come and read in
> and pin the page in question. Is this part of some interlock where you
> only have to know it was unpinned at some point in time since some
> other event?

Yes. We're talking about this block:
for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; 
blkno++)
{
/*
 * We use RBM_NORMAL_NO_LOG mode because it's not an 
error
 * condition to see all-zero pages.  The original 
btvacuumpage
 * scan would have skipped over all-zero pages, noting 
them in FSM
 * but not bothering to initialize them just yet; so we 
mustn't
 * throw an error here.  (We could skip acquiring the 
cleanup lock
 * if PageIsNew, but it's probably not worth the cycles 
to test.)
 *
 * XXX we don't actually need to read the block, we 
just need to
 * confirm it is unpinned. If we had a special call 
into the
 * buffer manager we could optimise this so that if the 
block is
 * not in shared_buffers we confirm it as unpinned.
 */
buffer = XLogReadBufferExtended(thisrnode, 
MAIN_FORKNUM, blkno,

RBM_NORMAL_NO_LOG);
if (BufferIsValid(buffer))
{
LockBufferForCleanup(buffer);
UnlockReleaseBuffer(buffer);
}
}
}


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund
On 2015-09-02 12:24:33 -0700, Peter Geoghegan wrote:
> "Read fetch". One argument past to the intrinsic here specifies that
> the variable will be read only. I did things this way because I
> imagined that there would be very limited uses for the macro only. I
> probably cover almost all interesting cases for explicit memory
> prefetching already.

I'd be less brief in this case then, no need to be super short here.

> > I don't think you should specify the read/write and locality parameters
> > if we don't hand-pick them - right now you're saying the memory will
> > only be read and that it has no temporal locality.
> >
> > I think there should be a empty fallback definition even if the current
> > only caller ends up not needing it - not all callers will require it.
> 
> It started out that way, but Tom felt that it was better to have a
> USE_MEM_PREFETCH because of the branch below...

That doesn't mean we shouldn't still provide an empty definition.

> > Hm. Why do we need a branch here? The target of prefetches don't need to
> > be valid addresses and adding a bunch of arithmetic and a branch for the
> > corner case doesn't sound worthwhile to me.
> 
> ...which is needed, because I'm still required to not dereference wild
> pointers. > In other words, although pg_rfetch()/__builtin_prefetch()
> does not require that a valid pointer be passed, it is not okay to
> read past an array's bounds to read that pointer. The GCC docs are
> clear on this -- "Data prefetch does not generate faults if addr is
> invalid, but the address expression itself must be valid".

That's just a question of how to formulate this though?

pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + 
offsetof(SortTuple, tuple))?

For something heavily platform dependent like this that seems ok.

> Because that was the fastest value following testing on my laptop. You
> are absolutely right to point out that this isn't a good reason to go
> with the patch -- I share your concern. All I can say in defense of
> that is that other major system software does the same, without any
> regard for the underlying microarchitecture AFAICT.

I know linux stripped out most prefetches at some point, even from x86
specific code, because it showed that they aged very badly. I.e. they
removed a bunch of them and stuff got faster, whereas they were
beneficial on earlier architectures.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Jeff Janes
On Wed, Sep 2, 2015 at 2:31 PM, Josh Berkus  wrote:

> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> >
> > As I explained, spindles have very little to do with it - you need
> > multiple I/O requests per device, to get the benefit. Sure, the DBAs
> > should know how many spindles they have and should be able to determine
> > optimal IO depth. But we actually say this in the docs:
>
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.
>

Perhaps one reason is that the planner assumes it will get no benefit from
this setting, meaning it is somewhat unlikely to choose the types of plans
which would actually show a benefit from higher values.

Cheers,

Jeff


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Merlin Moncure
On Wed, Sep 2, 2015 at 4:31 PM, Josh Berkus  wrote:
> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
>>
>> As I explained, spindles have very little to do with it - you need
>> multiple I/O requests per device, to get the benefit. Sure, the DBAs
>> should know how many spindles they have and should be able to determine
>> optimal IO depth. But we actually say this in the docs:
>
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.

That's the exact opposite of my findings on intel S3500 (see:
http://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Andres Freund
On 2015-09-02 14:36:51 -0400, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 2:20 PM, Fabien COELHO  wrote:
> >> I'm wondering if percentages instead of weights would be a better
> >> idea. That'd mean you'd be forced to be more careful when adding another
> >> script (having to adjust the percentages of other scripts) but arguably
> >> that's a good thing?
> >
> > If you use only percent, then you have to check that the total is 100,
> > probably you have to use floats, to do something when the total is not 100,
> > checking would complicate the code and test people mental calculus
> > abilities. Not sure this is a good idea:-)
> 
> I agree.  I don't see a reason to enforce that the total of the
> weights must be 100.

I'm slightly worried that using weights will be a bit confusing because
adding another script will obviously reduce the frequency of already
defined scripts. But it's probably not worth worrying.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Alvaro Herrera
Josh Berkus wrote:
> On 09/02/2015 02:34 PM, Alvaro Herrera wrote:

> > I think trying to duplicate the exact strings isn't too nice an
> > interface.
> 
> Well, for pg_controldata, no, but what else would you do for pg_config?

I was primarily looking at pg_controldata, so we agree there.

As for pg_config, I'm confused about its usefulness -- which of these
lines are useful in the SQL interface?  Anyway, I don't see anything
better than a couple of text columns for this case.

BINDIR = /home/alvherre/Code/pgsql/install/master/bin
DOCDIR = /home/alvherre/Code/pgsql/install/master/share/doc
HTMLDIR = /home/alvherre/Code/pgsql/install/master/share/doc
INCLUDEDIR = /home/alvherre/Code/pgsql/install/master/include
PKGINCLUDEDIR = /home/alvherre/Code/pgsql/install/master/include
INCLUDEDIR-SERVER = /home/alvherre/Code/pgsql/install/master/include/server
LIBDIR = /home/alvherre/Code/pgsql/install/master/lib
PKGLIBDIR = /home/alvherre/Code/pgsql/install/master/lib
LOCALEDIR = /home/alvherre/Code/pgsql/install/master/share/locale
MANDIR = /home/alvherre/Code/pgsql/install/master/share/man
SHAREDIR = /home/alvherre/Code/pgsql/install/master/share
SYSCONFDIR = /home/alvherre/Code/pgsql/install/master/etc
PGXS = /home/alvherre/Code/pgsql/install/master/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--enable-debug' '--enable-depend' '--enable-cassert' 
'--enable-nls' '--cache-file=/home/alvherre/tmp/pgconfig.master.cache' 
'--enable-thread-safety' '--with-python' '--with-perl' '--with-tcl' 
'--with-openssl' '--with-libxml' '--with-tclconfig=/usr/lib/tcl8.6' 
'--enable-tap-tests' '--prefix=/pgsql/install/master' '--with-pgport=55432'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g 
-O2
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,--as-needed 
-Wl,-rpath,'/pgsql/install/master/lib',--enable-new-dtags
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgcommon -lpgport -lxml2 -lssl -lcrypto -lz -lreadline -lrt -lcrypt 
-ldl -lm 
VERSION = PostgreSQL 9.6devel


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-09-02 Thread Greg Stark
On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund  wrote:
> To me it sounds like this shouldn't go through the full ReadBuffer()
> rigamarole. That code is already complex enough, and here it's really
> not needed. I think it'll be much easier to review - and actually faster
> in many cases to simply have something like
>
> bool
> BufferInCache(Relation, ForkNumber, BlockNumber)
> {
> /* XXX: setup tag, hash, partition */
>
> LWLockAcquire(newPartitionLock, LW_SHARED);
> buf_id = BufTableLookup(&newTag, newHash);
> LWLockRelease(newPartitionLock);
> return buf_id != -1;
> }
>
> and then fall back to the normal ReadBuffer() when it's in cache.


I guess I'm missing something but how is this API useful? As soon as
its returned it the result might be invalid since before you actually
make use of the return value another process could come and read in
and pin the page in question. Is this part of some interlock where you
only have to know it was unpinned at some point in time since some
other event?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Josh Berkus
On 09/02/2015 02:34 PM, Alvaro Herrera wrote:
> Tom Lane wrote:
>> Robert Haas  writes:
>>> But I'm not sure I like the idea of adding a server dependency on the
>>> ability to exec pg_controldata.  That seems like it could be
>>> unreliable at best, and a security vulnerability at worst.
>>
>> I hadn't been paying attention --- the proposed patch actually depends on
>> exec'ing pg_controldata?  That's horrid!  There is no expectation that
>> that's installed.
> 
> No, it doesn't.  For the pg_controldata output it processes the
> pg_control file directly, and for pg_config it relies on compile-time
> CPPFLAGS.
> 
> I think trying to duplicate the exact strings isn't too nice an
> interface.

Well, for pg_controldata, no, but what else would you do for pg_config?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/02/2015 12:30 PM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>>> 4. Therefore, I think that we should instead use logical replication,
>>> which might be either synchronous or asynchronous.  When you modify
>>> one copy of the data, that change will then be replicated to all other
>>> nodes.  If you are OK with eventual consistency, this replication can
>>> be asynchronous, and nodes that are off-line will catch up when they
>>> are on-line.  If you are not OK with that, then you must replicate
>>> synchronously to every node before transaction commit; or at least you
>>> must replicate synchronously to every node that is currently on-line.
>>> This presents some challenges: logical decoding currently can't
>>> replicate transactions that are still in process - replication starts
>>> when the transaction commits.  Also, we don't have any way for
>>> synchronous replication to wait for multiple nodes.
>>
>> Well, there is a WIP patch for that, which IMHO would be much improved
>> by having a concrete use-case like this one.  What nobody is working on
>> -- and we've vetoed in the past -- is a way of automatically failing and
>> removing from replication any node which repeatedly fails to sync, which
>> would be a requirement for this model.
> 
> Yep.  It's clear to me we need that in general, not just for sharding.
> To me, the key is to make sure there's a way for the cluster-ware to
> know about the state transitions.  Currently, when the synchronous
> standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

There are many parts of our replication which are still effectively
unmonitorable. For example, there's still no way to tell from the
replica that it's lost contact with the master except by tailing the
log.  If we try to build bigger systems on top of these components,
we'll find that we need to add a lot of instrumentation.

> 
>> You'd also need a way to let the connection nodes know when a replica
>> has fallen behind so that they can be taken out of
>> load-balancing/sharding for read queries.  For the synchronous model,
>> that would be "fallen behind at all"; for asynchronous it would be
>> "fallen more than ### behind".
> 
> How is that different from the previous thing?  Just that we'd treat
> "lagging" as "down" beyond some threshold?  That doesn't seem like a
> mandatory feature.

It's a mandatory feature if you want to load-balance reads.  We have to
know which nodes not to send reads to because they are out of sync.

>> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
> 
> I don't think so.  If transaction A writes data on X which is
> replicated to Y and then commits, a new snapshot which shows A as
> committed can't be used on Y until A's changes have been replicated
> there.  That could be enforced by having the commit of A wait for
> replication, or by having an attempt by a later transaction to use the
> snapshot on Y wait until replication completes, or some even more
> sophisticated strategy that considers whether the replication backlog
> touches the same data that the new transaction will read.  It's
> complicated, but it doesn't seem intractable.

I need to see this on a chalkboard to understand it.

>>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
>> harder to set up, and becomes a SPOF.
> 
> I partially agree.  I think it's very important that the GTM is an
> optional feature of whatever we end up with, rather than an
> indispensable component.  People who don't want it shouldn't have to
> pay the price in performance and administrative complexity.  But at
> the same time, I think a lot of people will want it, because without
> it, the fact that sharding is in use is much less transparent to the
> application.

If it can be optional, then we're pretty close to covering most use
cases with one general infrastructure.  That would be nice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Alvaro Herrera
Tom Lane wrote:
> Robert Haas  writes:
> > But I'm not sure I like the idea of adding a server dependency on the
> > ability to exec pg_controldata.  That seems like it could be
> > unreliable at best, and a security vulnerability at worst.
> 
> I hadn't been paying attention --- the proposed patch actually depends on
> exec'ing pg_controldata?  That's horrid!  There is no expectation that
> that's installed.

No, it doesn't.  For the pg_controldata output it processes the
pg_control file directly, and for pg_config it relies on compile-time
CPPFLAGS.

I think trying to duplicate the exact strings isn't too nice an
interface.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Josh Berkus
On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> 
> As I explained, spindles have very little to do with it - you need
> multiple I/O requests per device, to get the benefit. Sure, the DBAs
> should know how many spindles they have and should be able to determine
> optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/02/2015 05:25 PM, Tom Lane wrote:
> Robert Haas  writes:
>> But I'm not sure I like the idea of adding a server dependency on
>> the ability to exec pg_controldata.  That seems like it could be 
>> unreliable at best, and a security vulnerability at worst.
> 
> I hadn't been paying attention --- the proposed patch actually
> depends on exec'ing pg_controldata?  That's horrid!  There is no
> expectation that that's installed.

No it doesn't. I'm confused :-/

Joe


- -- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJV52qoAAoJEDfy90M199hltvwP/3MfUQfPPglhBuY1V3CTzHu9
kTw5tNuGI/244Yc11wLtV07W+3QWXzCNY/fL1JCW5ns51mTfZKfkskNWD0O9fIex
gvK4p/3Z+y344qsdDlzbzw0A/PU05UCq1UlgXCF6nyQJW6cZfaCckbEpZWVK/uV7
aYokIqnIiilWaPu224b6jBOukK7oizEjXevdFBafLbetLJHMx+9k8LMbpPdieAm/
RSk17+N77WQ2zTFHcdz8U1MYAbaokmv155s1vUFgrqOUJGc0r6K+vImKgxOjbbmg
pv2jf7vvUwwjUy7f2iPhWJAKfGCV1m9ovWaXsMYqcF55JwSzvP8B2htUtM4Lr1qF
SsWO7e36bLoH++yAGfKp7oZIhA9r6SR6cwEoCvso3immZ2zhOzbRcw4tI4pE9fhB
P/mEbKFF5BsGHjeslB8RrMQG68DxEwPkaafH4mc1QjKiXNfWPH9ci+pgfSLVphJq
gn+ZuPrReIFhQKyMchcvZVVWJd9Dt02D2lsIzUfBWGXwOTLFVikD6BC6siy5KWmy
xuEGLEfts9E7gPD3qPXxNuY7TCvb+L7R+1C9/M5diiV7rerMUocH/RqrPP6nXHTc
BdfJhzOfU+H+Kt0nbdE8Vjw3BOKT6nqT0kc+le+F/Q1h2XLB63KhaOkFzVW73Rfd
JRRqkyks+eVgEn2I4OKm
=OAms
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Tom Lane
Robert Haas  writes:
> But I'm not sure I like the idea of adding a server dependency on the
> ability to exec pg_controldata.  That seems like it could be
> unreliable at best, and a security vulnerability at worst.

I hadn't been paying attention --- the proposed patch actually depends on
exec'ing pg_controldata?  That's horrid!  There is no expectation that
that's installed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:

On 2 Sep 2015 14:54, "Andres Freund"  wrote:




+ /*--
+  * The user-visible GUC parameter is the number of drives (spindles),
+  * which we need to translate to a number-of-pages-to-prefetch target.
+  * The target value is stashed in *extra and then assigned to the actual
+  * variable by assign_effective_io_concurrency.
+  *
+  * The expected number of prefetch pages needed to keep N drives busy is:
+  *
+  * drives |   I/O requests
+  * ---+
+  *  1 |   1
+  *  2 |   2/1 + 2/2 = 3
+  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
+  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+  *  n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.


There are different meanings of "busy". If I get the coupon collector 
problem right (after quickly skimming the wikipedia article today), it 
effectively makes sure that each "spindle" has at least 1 request in the 
queue. Which sucks in practice, because on spinning rust it makes 
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one 
of the multiple channels.


On spinning drives, it's usually good to keep the iodepth>=4. For 
example this 10k Seagate drive [1] can do ~450 random IOPS with 
iodepth=16, while 10k drive should be able to do ~150 IOPS (with 
iodepth=1). The other SAS drives behave quite similarly.


[1] 
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review


On SSDs the good values usually start at 16, depending on the model (and 
controller), and size (large SSDs are basically multiple small ones 
glued together, thus have more channels).


This is why the numbers from coupon collector are way too low in many 
cases. (OTOH this is done per backend, so if there are multiple backends 
doing prefetching ...)




We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I


As I explained, spindles have very little to do with it - you need 
multiple I/O requests per device, to get the benefit. Sure, the DBAs 
should know how many spindles they have and should be able to determine 
optimal IO depth. But we actually say this in the docs:


 A good starting point for this setting is the number of separate
 drives comprising a RAID 0 stripe or RAID 1 mirror being used for
 the database. (For RAID 5 the parity drive should not be counted.)
 However, if the database is often busy with multiple queries
 issued in concurrent sessions, lower values may be sufficient to
 keep the disk array busy. A value higher than needed to keep the
 disks busy will only result in extra CPU overhead.

So we recommend number of drives as a good starting value, and then warn 
against increasing the value further.


Moreover, ISTM it's very unclear what value to use even if you know the 
number of devices and optimal iodepth. Setting (devices * iodepth) 
doesn't really make much sense, because that effectively computes


(devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at 
least one request for each of them", right? I guess we actually want


(devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need 
track both ndevices and iodepth.


There probably is a value X so that

 X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).



think the above formula works for essentially random I/O but for
more predictable I/O it might be possible to use a different formula.
But if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access patte

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-09-02 Thread Vladimir Borodin
25 авг. 2015 г., в 16:03, Michael Paquier  написал(а):On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:To me it sounds like this shouldn't go through the full ReadBuffer()rigamarole. That code is already complex enough, and here it's reallynot needed. I think it'll be much easier to review - and actually fasterin many cases to simply have something likeboolBufferInCache(Relation, ForkNumber, BlockNumber){    /* XXX: setup tag, hash, partition */    LWLockAcquire(newPartitionLock, LW_SHARED);    buf_id = BufTableLookup(&newTag, newHash);    LWLockRelease(newPartitionLock);    return buf_id != -1;}and then fall back to the normal ReadBuffer() when it's in cache.Yep, sounds good. Patch with implementation attached.Patch marked as returned with feedback as input from the author hasbeen waited for some time now.Sorry for delay, I will link it to the current commitfest.

btree_vacuum_v3.patch
Description: Binary data
-- Michael
--May the force be with you…https://simply.name




Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Robert Haas
On Mon, Aug 31, 2015 at 9:47 PM, Peter Eisentraut  wrote:
> On 8/25/15 11:32 PM, Joe Conway wrote:
>> 1.) pg_controldata() function and pg_controldata view added
>
> I don't think dumping out whatever pg_controldata happens to print as a
> bunch of text fields is very sophisticated.  We have functionality to
> compute with WAL positions, for example, and they won't be of much use
> if this is going to be all text.
>
> Also, the GUC settings tracked in pg_control can already be viewed using
> normal mechanisms, so we don't need a second way to see them.
>
> The fact that some of this is stored in pg_control and some is not is
> really an implementation detail.  We should be thinking of ways to
> expose specific useful information in useful ways, not just dump out
> everything we can find.  Ultimately, I think we would like to move away
> from people parsing textual pg_controldata output, but this proposal is
> not moving very far in that direction.

The nice thing about dumping the information as text is that you can
return every value in the same universal format: text.  There's a lot
to like about that.

But I'm not sure I like the idea of adding a server dependency on the
ability to exec pg_controldata.  That seems like it could be
unreliable at best, and a security vulnerability at worst.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Alvaro Herrera
Andrew Dunstan wrote:
> 
> 
> On 09/02/2015 04:22 PM, Robert Haas wrote:

> >For so long as this test suite is not run by 'make check-world' or by
> >the buildfarm, it's likely to keep getting broken, and we're likely to
> >keep not noticing.  I realize that the decision to exclude this from
> >'make check-world' was deliberately made for security reasons, but
> >that doesn't take anything away from the maintenance headache.

> Tell me what's needed and I'll look at creating a buildfarm test module for
> it.

AFAICS it just requires running "make check" in src/test/ssl.  Maybe
this could be part of a generic module that runs "make check" in
specified subdirectories; see
http://www.postgresql.org/message-id/20150813181107.GC5232@alvherre.pgsql
for previous discussion about that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 4:50 PM, Andrew Dunstan  wrote:
> Tell me what's needed and I'll look at creating a buildfarm test module for
> it.

You run the tests via: make -C src/test/ssl check

But nota bene security caveats:

commit e39250c644ea7cd3904e4e24570db21a209cf97f
Author: Heikki Linnakangas 
Date:   Tue Dec 9 17:21:18 2014 +0200

Add a regression test suite for SSL support.

It's not run by the global "check" or "installcheck" targets, because the
temporary installation it creates accepts TCP connections from any user
the same host, which is insecure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx

2015-09-02 Thread Pavel Stehule
Hi

2015-09-02 15:23 GMT+02:00 Andres Freund :

> Hi,
>
> On 2015-07-08 14:50:37 +0200, Pavel Stehule wrote:
> > - static const char *const my_list[] =
> > - {"DEFAULT", NULL};
> > + /* fallback for GUC settings */
> >
> > - COMPLETE_WITH_LIST(my_list);
> > + char *vartype = get_vartype(prev2_wd);
> > +
> > + if (strcmp(vartype, "enum") == 0)
> > + {
> > + char querybuf[1024];
> > +
> > + snprintf(querybuf, 1024, Query_for_enum,
> prev2_wd);
> > + COMPLETE_WITH_QUERY(querybuf);
> > + }
>
> Won't that mean that enum variables don't complete to default anymore?
>

no, it does

#define Query_for_enum \
" SELECT name FROM ( "\
"   SELECT unnest(enumvals) AS name "\
"FROM pg_catalog.pg_settings "\
"   WHERE pg_catalog.lower(name)=pg_catalog.lower('%s') "\
"   UNION SELECT 'DEFAULT' ) ss "\

"  WHERE pg_catalog.substring(name,1,%%d)='%%s'"



>
> > +static char *
> > +get_vartype(const char *varname)
> > +{
> > + PQExpBufferData query_buffer;
> > + char*e_varname;
> > + PGresult *result;
> > + int string_length;
> > + static char resbuf[10];
> > +
> > + initPQExpBuffer(&query_buffer);
> > +
> > + string_length = strlen(varname);
> > + e_varname = pg_malloc(string_length * 2 + 1);
> > + PQescapeString(e_varname, varname, string_length);
>
> Independent of this patch, we really shouldn't do this in several places
> :(
>

fixed

>
> > + appendPQExpBuffer(&query_buffer,
> > + "SELECT vartype FROM pg_settings WHERE
> pg_catalog.lower(name) = pg_catalog.lower('%s')",
> > +  e_varname);
>
> Missing pg_catalog for pg_settings.
>

fixed

>
> Greetings,
>
> Andres Freund
>

I am sending new version

Regards

Pavel
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
new file mode 100644
index 816deda..61216e1
*** a/src/bin/psql/tab-complete.c
--- b/src/bin/psql/tab-complete.c
*** static const SchemaQuery Query_for_list_
*** 757,762 
--- 757,770 
  "   (SELECT polrelid FROM pg_catalog.pg_policy "\
  " WHERE pg_catalog.quote_ident(polname)='%s')"
  
+ #define Query_for_enum \
+ " SELECT name FROM ( "\
+ "   SELECT unnest(enumvals) AS name "\
+ "FROM pg_catalog.pg_settings "\
+ "   WHERE pg_catalog.lower(name)=pg_catalog.lower('%s') "\
+ "   UNION SELECT 'DEFAULT' ) ss "\
+ "  WHERE pg_catalog.substring(name,1,%%d)='%%s'"
+ 
  /*
   * This is a list of all "things" in Pgsql, which can show up after CREATE or
   * DROP; and there is also a query to get a list of them.
*** static PGresult *exec_query(const char *
*** 847,852 
--- 855,863 
  
  static void get_previous_words(int point, char **previous_words, int nwords);
  
+ static char *get_vartype(const char *varname);
+ static char *escape_string(const char *text);
+ 
  #ifdef NOT_USED
  static char *quote_file_name(char *text, int match_type, char *quote_pointer);
  static char *dequote_file_name(char *text, char quote_char);
*** psql_completion(const char *text, int st
*** 3604,3623 
  
  			COMPLETE_WITH_LIST(my_list);
  		}
- 		else if (pg_strcasecmp(prev2_wd, "IntervalStyle") == 0)
- 		{
- 			static const char *const my_list[] =
- 			{"postgres", "postgres_verbose", "sql_standard", "iso_8601", NULL};
- 
- 			COMPLETE_WITH_LIST(my_list);
- 		}
- 		else if (pg_strcasecmp(prev2_wd, "GEQO") == 0)
- 		{
- 			static const char *const my_list[] =
- 			{"ON", "OFF", "DEFAULT", NULL};
- 
- 			COMPLETE_WITH_LIST(my_list);
- 		}
  		else if (pg_strcasecmp(prev2_wd, "search_path") == 0)
  		{
  			COMPLETE_WITH_QUERY(Query_for_list_of_schemas
--- 3615,3620 
*** psql_completion(const char *text, int st
*** 3627,3636 
  		}
  		else
  		{
! 			static const char *const my_list[] =
! 			{"DEFAULT", NULL};
  
! 			COMPLETE_WITH_LIST(my_list);
  		}
  	}
  
--- 3624,3654 
  		}
  		else
  		{
! 			/* fallback for GUC settings */
  
! 			char *vartype = get_vartype(prev2_wd);
! 
! 			if (strcmp(vartype, "enum") == 0)
! 			{
! char querybuf[1024];
! 
! snprintf(querybuf, 1024, Query_for_enum, prev2_wd);
! COMPLETE_WITH_QUERY(querybuf);
! 			}
! 			else if (strcmp(vartype, "bool") == 0)
! 			{
! static const char *const my_list[] =
! {"ON", "OFF", "DEFAULT", NULL};
! 
! COMPLETE_WITH_LIST(my_list);
! 			}
! 			else
! 			{
! static const char *const my_list[] =
! {"DEFAULT", NULL};
! 
! COMPLETE_WITH_LIST(my_list);
! 			}
  		}
  	}
  
*** _complete_from_query(int is_schema_query
*** 4166,4195 
  		result = NULL;
  
  		/* Set up suitably-escaped copies of textual inputs */
! 		e_text = pg_malloc(string_length * 2 + 1);
! 		PQescapeString(e_text, text, string_length);
  
  		if (completion_

Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Andrew Dunstan



On 09/02/2015 04:22 PM, Robert Haas wrote:

On Wed, Aug 26, 2015 at 3:37 AM, Michael Paquier
 wrote:

On Wed, Aug 26, 2015 at 4:35 PM, Michael Paquier
 wrote:

Only HEAD is impacted, and attached is a patch to fix the problem.

Actually this version is better, I forgot to update a comment.

For so long as this test suite is not run by 'make check-world' or by
the buildfarm, it's likely to keep getting broken, and we're likely to
keep not noticing.  I realize that the decision to exclude this from
'make check-world' was deliberately made for security reasons, but
that doesn't take anything away from the maintenance headache.

Still, that's not a reason not commit this, so done.




Tell me what's needed and I'll look at creating a buildfarm test module 
for it.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 1:18 AM, Amit Langote
 wrote:
> Did you get around to making a patch for this?

I've worked on it inconsistently. I'll pick this up again soon. I may
take the opportunity to talk this over with Andres in person when we
meet at Postgres Open shortly.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Robert Haas
On Wed, Aug 26, 2015 at 3:37 AM, Michael Paquier
 wrote:
> On Wed, Aug 26, 2015 at 4:35 PM, Michael Paquier
>  wrote:
>> Only HEAD is impacted, and attached is a patch to fix the problem.
>
> Actually this version is better, I forgot to update a comment.

For so long as this test suite is not run by 'make check-world' or by
the buildfarm, it's likely to keep getting broken, and we're likely to
keep not noticing.  I realize that the decision to exclude this from
'make check-world' was deliberately made for security reasons, but
that doesn't take anything away from the maintenance headache.

Still, that's not a reason not commit this, so done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 15:53, Andres Freund wrote:
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> 
> You also didn't touch /* * How many buffers PrefetchBuffer callers
> should try to stay ahead of their * ReadBuffer calls by.  This is
> maintained by the assign hook for * effective_io_concurrency.  Zero
> means "never prefetch". */ inttarget_prefetch_pages = 
> 0; which
> surely doesn't make sense anymore after these changes.
> 
> But do we even need that variable now?
> 

I thought this was related to the effective_io_concurrency GUC
(possibly overloaded by the per-tablespace setting), so I didn't make
any change on that.

I also just found an issue with my previous patch, the global
effective_io_concurrency GUC was ignored if the tablespace had a
specific seq_page_cost or random_page_cost setting, I just fixed that
in my local branch.


>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h 
>> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++
>> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES
>> (INT_MAX / 1024) #endif
>> 
>> +/* upper limit for effective_io_concurrency */ +#define
>> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name
>> for ALTER SYSTEM. * This file will be used to store values of
>> configuration parameters @@ -256,6 +259,8 @@ extern int
>> temp_file_limit;
>> 
>> extern int   num_temp_buffers;
>> 
>> +extern int  effective_io_concurrency; +
> 
> target_prefetch_pages is declared in bufmgr.h - that seems like a
> better place for these.
> 

I was rather sceptical about that too. I'll move these in bufmgr.h.


Regards.


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD
G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m
9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z
PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs
uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV
WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI=
=4ofZ
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-09-02 Thread Stefan Kaltenbrunner
On 09/02/2015 10:10 PM, dinesh kumar wrote:
> On Tue, Sep 1, 2015 at 10:58 PM, Stefan Kaltenbrunner
> mailto:ste...@kaltenbrunner.cc>> wrote:
> 
> On 07/25/2015 03:38 AM, dinesh kumar wrote:
> >
> >
> > On Fri, Jul 24, 2015 at 10:22 AM, Robert Haas  
> > >> wrote:
> >
> > On Thu, Jul 23, 2015 at 8:15 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>
> >>
> wrote:
> > > On Thu, Jul 23, 2015 at 9:21 AM, Robert Haas
> > mailto:robertmh...@gmail.com>
> >> wrote:
> > >>
> > >> On Thu, Jul 23, 2015 at 12:19 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>
> >>
> > >> wrote:
> > >> > Sorry for my  unclear description about the proposal.
> > >> >
> > >> > "WITH PERMISSIVE" is equal to our existing behavior. That
> is, chmod=644
> > >> > on
> > >> > the created files.
> > >> >
> > >> > If User don't specify "PERMISSIVE" as an option, then the
> chmod=600 on
> > >> > created files. In this way, we can restrict the other
> users from reading
> > >> > these files.
> > >>
> > >> There might be some benefit in allowing the user to choose the
> > >> permissions, but (1) I doubt we want to change the default
> behavior
> > >> and (2) providing only two options doesn't seem flexible
> enough.
> > >>
> > >
> > > Thanks for your inputs Robert.
> > >
> > > 1) IMO, we will keep the exiting behavior as it is.
> > >
> > > 2) As the actual proposal talks about the permissions of
> group/others. So,
> > > we can add few options as below to the WITH clause
> > >
> > > COPY
> > > ..
> > > ..
> > > WITH
> > > [
> > > NO
> > > (READ,WRITE)
> > > PERMISSION TO
> > > (GROUP,OTHERS)
> > > ]
> >
> > If we're going to do anything here, it should use COPY's
> > extensible-options syntax, I think.
> >
> >
> > Thanks Robert. Let me send a patch for this.
> 
> 
> how are you going to handle windows or unix ACLs here?
> Its permission model is quite different and more powerful than (non-acl
> based) unix in general, handling this in a flexible way might soon get
> very complicated and complex for limited gain...
> 
> 
> Hi Stefan,
> 
> I had the same questions too. But, I believe, our initdb works in these
> cases, after creating the data cluster. Isn't ?

maybe - but having a fixed "default"  is very different from baking a
classic unix permission concept of user/group/world^others into actual
DDL or into a COPY option. The proposed syntax might make some sense to
a admin used to a unix style system but it is likely utterly
incomprehensible to somebody who is used to (windows style) ACLs.

I dont have a good answer on what to do else atm but I dont think we
should embedded traditional/historical unix permission models in our
grammer unless really really needed...


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-09-02 Thread dinesh kumar
On Tue, Sep 1, 2015 at 10:58 PM, Stefan Kaltenbrunner <
ste...@kaltenbrunner.cc> wrote:

> On 07/25/2015 03:38 AM, dinesh kumar wrote:
> >
> >
> > On Fri, Jul 24, 2015 at 10:22 AM, Robert Haas  > > wrote:
> >
> > On Thu, Jul 23, 2015 at 8:15 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>> wrote:
> > > On Thu, Jul 23, 2015 at 9:21 AM, Robert Haas
> > mailto:robertmh...@gmail.com>> wrote:
> > >>
> > >> On Thu, Jul 23, 2015 at 12:19 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>>
> > >> wrote:
> > >> > Sorry for my  unclear description about the proposal.
> > >> >
> > >> > "WITH PERMISSIVE" is equal to our existing behavior. That is,
> chmod=644
> > >> > on
> > >> > the created files.
> > >> >
> > >> > If User don't specify "PERMISSIVE" as an option, then the
> chmod=600 on
> > >> > created files. In this way, we can restrict the other users
> from reading
> > >> > these files.
> > >>
> > >> There might be some benefit in allowing the user to choose the
> > >> permissions, but (1) I doubt we want to change the default
> behavior
> > >> and (2) providing only two options doesn't seem flexible enough.
> > >>
> > >
> > > Thanks for your inputs Robert.
> > >
> > > 1) IMO, we will keep the exiting behavior as it is.
> > >
> > > 2) As the actual proposal talks about the permissions of
> group/others. So,
> > > we can add few options as below to the WITH clause
> > >
> > > COPY
> > > ..
> > > ..
> > > WITH
> > > [
> > > NO
> > > (READ,WRITE)
> > > PERMISSION TO
> > > (GROUP,OTHERS)
> > > ]
> >
> > If we're going to do anything here, it should use COPY's
> > extensible-options syntax, I think.
> >
> >
> > Thanks Robert. Let me send a patch for this.
>
>
> how are you going to handle windows or unix ACLs here?
> Its permission model is quite different and more powerful than (non-acl
> based) unix in general, handling this in a flexible way might soon get
> very complicated and complex for limited gain...
>
>
Hi Stefan,

I had the same questions too. But, I believe, our initdb works in these
cases, after creating the data cluster. Isn't ?

Regards,
Dinesh
manojadinesh.blogspot.com

>
>
> Stefan
>


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tomas Vondra



On 09/02/2015 08:27 PM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure 
wrote:


This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might
make sense for single table queries but breaks down in the face of
joins.


+1 to these concerns


Well, I don't think that ALL of the logic should go into the FDW.


Then maybe we shouldn't call this "FDW-based sharding" (or "FDW 
approach" or whatever was used in this thread so far) because that kinda 
implies that the proposal is to build on FDW.


In my mind, FDW is a wonderful tool to integrate PostgreSQL with 
external data sources, and it's nicely shaped for this purpose, which 
implies the abstractions and assumptions in the code.


The truth however is that many current uses of the FDW API are actually 
using it for different purposes because there's no other way to do that, 
not because FDWs are the "right way". And this includes the attempts to 
build sharding on FDW, I think.


Situations like this result in "improvements" of the API that seem to 
improve the API for the second group, but make the life harder for the 
original FDW API audience by making the API needlessly complex. And I 
say "seem to improve" because the second group eventually runs into the 
fundamental abstractions and assumptions the API is based on anyway.


And based on the discussions at pgcon, I think this is the main reason 
why people cringe when they hear "FDW" and "sharding" in the same sentence.


I'm not opposed to reusing the FDW infrastructure, of course.

> In particular, in this example, parallel aggregate needs the same
> query rewrite, so the logic for that should live in core so that
> both parallel and distributed queries get the benefit.

I'm not sure the parallel query is a great example here - maybe I'm 
wrong but I think it's a fairly isolated piece of code, and we have 
pretty clear idea of the two use cases.


I'm sure it's non-trivial to design it well for both cases, but I think 
the questions for FWD/sharding will be much more about abstract concepts 
than particular technical solutions.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby

On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an 
exception block thing. If it was that simple then we'd still get the 
crash without involving pgTap. So I suspect you need to have a named 
cursor in the mix as well.


Let me make another attempt at something simpler.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Alvaro Herrera
Michael Paquier wrote:
> Hi all,
> 
> As of now, file access functions in genfile.c can only be used by
> superusers. This proposal is to relax those functions so as
> replication users can use them as well. Here are the functions aimed
> by this patch:
> - pg_stat_file
> - pg_read_binary_file
> - pg_read_file
> - pg_ls_dir
> The main argument for this change is that pg_rewind makes use of those
> functions, forcing users to use a superuser role when rewinding a
> node.

Can this piggyback on Stephen Frost's proposed patch for reserved roles et al?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade remote copy

2015-09-02 Thread Alvaro Herrera
Bruce Momjian wrote:
> On Fri, Aug 28, 2015 at 10:34:38PM -0700, AI Rumman wrote:

> > In pg_upgrade, how about adding a feature to copy data directory over 
> > network.
> > That is, we can run pg_upgrade from our new host, where old host will be a
> > remote machine.

> I think it is much simpler to just copy the old clsuter to the remote
> server and run pg_upgrade in --link mode on the remote server.

Couldn't it run pg_basebackup as a first step?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-02 Thread dinesh kumar
On Mon, Aug 31, 2015 at 10:08 PM, Pavel Stehule 
wrote:

>
>
> 2015-09-01 6:59 GMT+02:00 Pavel Stehule :
>
>>
>>
>> 2015-08-31 20:43 GMT+02:00 dinesh kumar :
>>
>>> Hi,
>>>
>>> On Sat, Aug 29, 2015 at 4:22 PM, Pavel Stehule 
>>> wrote:
>>>
 Hi

 I am starting to work review of this patch

 2015-07-13 9:54 GMT+02:00 dinesh kumar :

> Hi All,
>
> Greetings for the day.
>
> Would like to discuss on below feature here.
>
> Feature:
> Having an SQL function, to write messages to log destination.
>
> Justification:
> As of now, we don't have an SQL function to write
> custom/application messages to log destination. We have "RAISE" clause
> which is controlled by
> log_ parameters. If we have an SQL function which works irrespective
> of log settings, that would be a good for many log parsers. What i mean 
> is,
> in DBA point of view, if we route all our native OS stats to log files in 
> a
> proper format, then we can have our log reporting tools to give most
> effective reports. Also, Applications can log their own messages to
> postgres log files, which can be monitored by DBAs too.
>
> Implementation:
> Implemented a new function "pg_report_log" which takes one
> argument as text, and returns void. I took, "LOG" prefix for all the
> reporting messages.I wasn't sure to go with new prefix for this, since
> these are normal LOG messages. Let me know, if i am wrong here.
>
> Here is the attached patch.
>

 This patch is not complex, but the implementation doesn't cover a
 "ereport" well.

 Although this functionality should be replaced by custom function in
 any PL (now or near future), I am not against to have this function in
 core. There are lot of companies with strong resistance against stored
 procedures - and sometimes this functionality can help with SQL debugging.

 Issues:

 1. Support only MESSAGE field in exception - I am expecting to support
 all fields: HINT, DETAIL, ...

>>>
>>> Added these functionalities too.
>>>
>>>
 2. Missing regress tests

>>>
>>> Adding here.
>>>
>>>
 3. the parsing ereport level should be public function shared with
 PLpgSQL and other PL

>>>
>>> Sorry Pavel. I am not getting your point here. Would you give me an
>>> example.
>>>
>>
>> The transformation: text -> error level is common task - and PLpgSQL it
>> does in pl_gram.y. My idea is to add new function to error utils named
>> "parse_error_level" and use it from PLpgSQL and from your code.
>>
>>
>>>
>>>
 4. should be hidestmt mandatory parameter?

>>>
>>> I changed this argument's default value as "true".
>>>
>>>
 5. the function declaration is strange

 postgres=# \sf pg_report_log (text, anyelement, boolean)
 CREATE OR REPLACE FUNCTION pg_catalog.pg_report_log(text, anyelement,
 boolean)
  RETURNS void
  LANGUAGE sql
  STABLE STRICT COST 1
 AS $function$SELECT pg_report_log($1::pg_catalog.text,
 $2::pg_catalog.text, $3::boolean)$function$

 Why polymorphic? It is useless on any modern release


>>> I took quote_ident(anyelement) as referral code, for implementing this.
>>> Could you guide me if I am doing wrong here.
>>>
>>
>> I was wrong - this is ok - it is emulation of force casting to text
>>
>>
>>>
>>>
 postgres=# \sf pg_report_log (text, text, boolean)
 CREATE OR REPLACE FUNCTION pg_catalog.pg_report_log(text, text, boolean)
  RETURNS void
  LANGUAGE internal
  IMMUTABLE STRICT
 AS $function$pg_report_log$function$

 Why stable, why immutable? This function should be volatile.

 Fixed these to volatile.
>>>
>>>
 6. using elog level enum as errcode is wrong idea - errcodes are
 defined in table
 http://www.postgresql.org/docs/9.4/static/errcodes-appendix.html

>>>
>>> You mean, if the elevel is 'ERROR', then we need to allow errcode. Let
>>> me know your inputs.
>>>
>>
>> I was blind, but the code was not good. Yes, error and higher needs error
>> code. From ANSI/SQL anything can has error code "00 is ok", "01 ..
>> warnings" ...
>>
>> There is more possibilities - look to PLpgSQL implementation - it can be
>> optional parameter - it default can use ERRCODE_RAISE_EXCEPTION
>>
>>
>>>
>>> Adding new patch, with the above fixes.
>>>
>>
> the code looks better
>
> my objections:
>
> 1. I prefer default values be NULL
>

Fixed it.


> 2. readability:
>   * parsing error level should be in alone cycle
>   * you don't need to use more ereport calls - one is good enough - look
> on implementation of stmt_raise in PLpgSQL
>

Sorry for my ignorance. I have tried to implement parse_error_level in
pl_gram.y, but not able to do it. I was not able to parse the given string
with tokens, and return the error levels. I tried for a refferal code, but
not able to find any. Would you g

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.
>
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.

Yep.  It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions.  Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

How is that different from the previous thing?  Just that we'd treat
"lagging" as "down" beyond some threshold?  That doesn't seem like a
mandatory feature.

>> But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so.  If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there.  That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read.  It's
complicated, but it doesn't seem intractable.

>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
> harder to set up, and becomes a SPOF.

I partially agree.  I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component.  People who don't want it shouldn't have to
pay the price in performance and administrative complexity.  But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> Hm. Is a compiler test actually test anything reliably here? Won't this
> just throw a warning during compile time about an unknown function?

I'll need to look into that.

>> +/*
>> + * Prefetch support -- Support memory prefetching hints on some platforms.
>> + *
>> + * pg_rfetch() is specialized for the case where an array is accessed
>> + * sequentially, and we can prefetch a pointer within the next element (or 
>> an
>> + * even later element) in order to hide memory latency.  This case involves
>> + * prefetching addresses with low temporal locality.  Note that it's rather
>> + * difficult to get any kind of speedup using pg_rfetch();  any use of the
>> + * intrinsic should be carefully tested.  Also note that it's okay to pass 
>> it
>> + * an invalid or NULL address, although it's best avoided.
>> + */
>> +#if defined(USE_MEM_PREFETCH)
>> +#define pg_rfetch(addr)  __builtin_prefetch((addr), 0, 0)
>> +#endif
>
> What is that name actually supposed to mean? 'rfetch' doesn't seem to be
> particularly clear - or is it a meant to be a wordplay combined with the
> p?

"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.

> I don't think you should specify the read/write and locality parameters
> if we don't hand-pick them - right now you're saying the memory will
> only be read and that it has no temporal locality.
>
> I think there should be a empty fallback definition even if the current
> only caller ends up not needing it - not all callers will require it.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

>> + /*
>> +  * Perform memory prefetch of tuple 
>> that's three places
>> +  * ahead of current (which is returned 
>> to caller).
>> +  * Testing shows that this 
>> significantly boosts the
>> +  * performance for TSS_INMEM "forward" 
>> callers by
>> +  * hiding memory latency behind their 
>> processing of
>> +  * returned tuples.
>> +  */
>> +#ifdef USE_MEM_PREFETCH
>> + if (readptr->current + 3 < 
>> state->memtupcount)
>> + 
>> pg_rfetch(state->memtuples[readptr->current + 3]);
>> +#endif
>
> Hm. Why do we need a branch here? The target of prefetches don't need to
> be valid addresses and adding a bunch of arithmetic and a branch for the
> corner case doesn't sound worthwhile to me.

...which is needed, because I'm still required to not dereference wild
pointers. In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid". Also, for
cases that don't benefit (like datum sorts, which never have a "tuple
proper" -- the above code is for tuplestore, not tuplesort, so you
can't see this), it seemed faster to have the branch than to rely on
__builtin_prefetch() being forgiving about invalid/wild pointers. I
saw a regression without the branch for that case.

> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. But right now, time spent in
_bt_buildadd() is massively reduced with the patch, which makes it
worthwhile in my mind. It's really very effective if you look at the
only part of (say) a CREATE INDEX that is affected one way or another
-- the final fetching of tuples.

BTW, I am very close to posting a patch that makes (say) CREATE INDEX
very fast for external sorts. That uses a memory pool to make final
on-the-fly merge memory access sequential per tape. That's what I'll
do rather than explicitly prefetching for external sorts. It makes a
huge difference on top of the external sort work already posted,

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Alvaro Herrera
Michael Paquier wrote:

> I haven't written yet a test case but I think that we could reproduce
> that simply by having a relation referenced in the exception block of
> a first function, calling a second function that itself raises an
> exception, causing the referencing error.

Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

On 02/09/2015 18:06, Tomas Vondra wrote:
> Hi
> 
> On 09/02/2015 03:53 PM, Andres Freund wrote:
>> 
>> Hi,
>> 
>> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>>> I didn't know that the thread must exists on -hackers to be
>>> able to add a commitfest entry, so I transfer the thread here.
>> 
>> Please, in the future, also update the title of the thread to
>> something fitting.
>> 

Sorry for that.

>>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan
>>> *node, EState *estate, int eflags) { BitmapHeapScanState
>>> *scanstate; RelationcurrentRelation; +#ifdef USE_PREFETCH +
>>> int new_io_concurrency; +#endif
>>> 
>>> /* check for unsupported flags */ Assert(!(eflags &
>>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@
>>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate,
>>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, 
>>> node->scan.scanrelid, eflags);
>>> 
>>> +#ifdef USE_PREFETCH +/* check if the
>>> effective_io_concurrency has been overloaded for the + *
>>> tablespace storing the relation and compute the 
>>> target_prefetch_pages, + * or just get the current
>>> target_prefetch_pages + */ +new_io_concurrency =
>>> get_tablespace_io_concurrency( +
>>> currentRelation->rd_rel->reltablespace); + + +
>>> scanstate->target_prefetch_pages = target_prefetch_pages; + +
>>> if (new_io_concurrency != effective_io_concurrency) +{ +
>>> double prefetch_pages; +   if
>>> (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) +
>>> scanstate->target_prefetch_pages = rint(prefetch_pages); +
>>> } +#endif
>> 
>> Maybe it's just me - but imo there should be as few USE_PREFETCH 
>> dependant places in the code as possible. It'll just be 0 when
>> not supported, that's fine?
> 
> It's not just you. Dealing with code with plenty of ifdefs is
> annoying - it compiles just fine most of the time, until you
> compile it with some rare configuration. Then it either starts
> producing strange warnings or the compilation fails entirely.
> 
> It might make a tiny difference on builds without prefetching
> support because of code size, but realistically I think it's
> noise.
> 
>> Especially changing the size of externally visible structs
>> depending ona configure detected ifdef seems wrong to me.
> 
> +100 to that
> 

I totally agree. I'll remove the ifdefs.

>> Nitpick: Wrong comment style (/* stands on its own line).

I did run pgindent before submitting patch, but apparently I picked
the wrong one. Already fixed in local branch.

>>> +/*-- + * The user-visible GUC parameter is the
>>> number of drives (spindles), + * which we need to translate
>>> to a number-of-pages-to-prefetch target. + * The target
>>> value is stashed in *extra and then assigned to the actual +
>>> * variable by assign_effective_io_concurrency. + * + *
>>> The expected number of prefetch pages needed to keep N drives 
>>> busy is: + * + * drives |   I/O requests + *
>>> ---+ + *1 |   1 + *
>>> 2 |   2/1 + 2/2 = 3 + *3 |   3/1 + 3/2 + 3/3 = 5
>>> 1/2 + *4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + *
>>> n |   n * H(n)
>> 
>> I know you just moved this code. But: I don't buy this formula.
>> Like at all. Doesn't queuing and reordering entirely invalidate
>> the logic here?
> 
> Well, even the comment right next after the formula says that:
> 
> * Experimental results show that both of these formulas aren't *
> aggressiveenough, but we don't really have any better proposals.
> 
> That's the reason why users generally either use 0 or some rather
> high value (16 or 32 are the most common values see). The problem
> is that we don't really care about the number of spindles (and not
> just because SSDs don't have them at all), but about the target
> queue length per device. Spinning rust uses TCQ/NCQ to optimize the
> head movement, SSDs are parallel by nature (stacking multiple chips
> with separate channels).
> 
> I doubt we can really improve the formula, except maybe for saying
> "we want 16 requests per device" and multiplying the number by
> that. We don't really have the necessary introspection to determine
> better values (and it's not really possible, because the devices
> may be hidden behind a RAID controller (or a SAN). So we can't
> really do much.
> 
> Maybe the best thing we can do is just completely abandon the
> "number of spindles" idea, and just say "number of I/O requests to
> prefetch". Possibly with an explanation of how to estimate it
> (devices * queue length).
> 
>> I think that'd be a lot better.

+1 for that too.

If everone's ok with this change, I can submit a patch for that too.
Should I split that into two patches, and/or start a new thread?



- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAG

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/02/2015 11:41 AM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
>> Even if it's only on paper, any new sharding design needs to address
>> these questions:
>>
>> 1. How do we ensure no/minimal data is lost if we lose a node?
>> 2. How do we replace a lost node (without taking the cluster down)?
>>2. a. how do we allow an out-of-sync node to "catch up"?
>> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
>> 4. How do we add nodes to expand the cluster?
>>
>> There doesn't need to be code for all of the above from version 0.1, but
>> there needs to be a plan to tackle those problems.  Otherwise, we'll
>> just end up with another dead-end, not-useful-in-production technology.
> 
> This is a good point, and I think I agree with it.  Let me make a few
> observations:
> 
> 1. None of this stuff matters very much when the data is strictly
> read-only. 

Yep.

> 2. None of this stuff matters when you only have one copy of the data.
> Your system is low-availability, but you just don't care for whatever
> reason. 

Uh-huh.

> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

pg_shard also sends the data to each node, and automatically notices
which nodes are not responding and takes them out of availability.
There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce
volatile functions.

For that matter, last I worked on it Greenplum also did multiplexing via
the writing node (or via the data loader).  So this is a popular
approach; it has a number of drawbacks, though, of which volatile
functions are a major one.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one.  What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries.  For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

> But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
 I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 4:52 AM, Shulgin, Oleksandr
 wrote:
> On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund  wrote:
>>
>> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
>> > But I think it's quite wrong to assume that the infrastructure for
>> > this is available and usable everywhere, because in my experience,
>> > that's far from the case.
>>
>> Especially when the alternative is a rather short patch implementing an
>> otherwise widely available feature.
>
> But that won't actually help in the case described by Robert: if the master
> server A failed, the client has no idea if B or C would become the new
> master.

Sure it does.  You just need to ensure that whichever of those is the
new master accepts connections, and the other one doesn't.  There are
lots of ways to do this; e.g. give the machine a second IP that
accepts connections only when the machine is the designated master,
and have read-write clients connect to that IP, and read-only clients
connect to the machine's main IP.

Andres's point is the same as mine: we ought to accept this feature,
in some form, because it's really quite useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-09-02 Thread Robert Haas
On Mon, Aug 10, 2015 at 1:16 PM, Jeff Janes  wrote:
> I have a simple test case that inserts an array of 101 md5 digests into each
> row.  With 10_000 of these rows inserted into an already indexed table, I
> get 40MB for the table and 80MB for the index unpatched.  With the patch, I
> get 7.3 MB for the index.

Uh, wow.  Seems like we should do something about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-02 Thread Robert Haas
On Tue, Sep 1, 2015 at 6:13 PM, Jim Nasby  wrote:
> My thought is that there's a fair amount of places where we do string
> comparison for not a great reason. Perhaps a better example is data that
> comes back from a trigger; AFTER/BEFORE, INSERT/UPDATE/..., which is more
> expensive to setup the variables for (strdup a fixed string, which means a
> palloc), and then comparisons are done as text varlena (iirc).
>
> Instead if this information came back as an ENUM the variable would be a
> simple int as would the comparison. We'd still have a raw string being
> parsed in the function body, but that would happen once during initial
> compilation and it would be replaced with an ENUM value.
>
> For RAISE, AFAIK we still end up converting the raw string into a varlena
> CONST, which means a palloc. If it was an ENUM it'd be converted to an int.
>
> If we're worried about the overhead of the enum machinery we could create a
> relcache for system enums, but I suspect that even without that it'd be a
> win over the string stuff. Especially since I bet most people run UTF8.

I agree with Pavel on this one: creating an extra type here is going
to cause more pain than it removes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark
On 2 Sep 2015 14:54, "Andres Freund"  wrote:
>
>
> > + /*--
> > +  * The user-visible GUC parameter is the number of drives (spindles),
> > +  * which we need to translate to a number-of-pages-to-prefetch target.
> > +  * The target value is stashed in *extra and then assigned to the 
> > actual
> > +  * variable by assign_effective_io_concurrency.
> > +  *
> > +  * The expected number of prefetch pages needed to keep N drives busy 
> > is:
> > +  *
> > +  * drives |   I/O requests
> > +  * ---+
> > +  *  1 |   1
> > +  *  2 |   2/1 + 2/2 = 3
> > +  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
> > +  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> > +  *  n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
think the above formula works for essentially random I/O but for more
predictable I/O it might be possible to use a different formula. But
if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need to
issue to keep all the spindles busy.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-02 Thread Robert Haas
On Tue, Sep 1, 2015 at 6:43 PM, and...@anarazel.de  wrote:
> Why a new tranche for each of these? And it can't be correct that each
> has the same base?

I complained about the same-base problem before.  Apparently, that got ignored.

> I don't really like the tranche model as in the patch right now. I'd
> rather have in a way that we have one tranch for all the individual
> lwlocks, where the tranche points to an array of names alongside the
> tranche's name. And then for the others we just supply the tranche name,
> but leave the name array empty, whereas a name can be generated.

That's an interesting idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
> Even if it's only on paper, any new sharding design needs to address
> these questions:
>
> 1. How do we ensure no/minimal data is lost if we lose a node?
> 2. How do we replace a lost node (without taking the cluster down)?
>2. a. how do we allow an out-of-sync node to "catch up"?
> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
> 4. How do we add nodes to expand the cluster?
>
> There doesn't need to be code for all of the above from version 0.1, but
> there needs to be a plan to tackle those problems.  Otherwise, we'll
> just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it.  Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.  You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't.  You
replace a lost node by taking anew copy.  Nodes never need to catch up
because there are no changes happening.  To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime).  So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.  The issue arises when you have multiple copies of the data,
and the data is being changed.  Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants.  Then it
generates an SQL statement to be sent to the data node to make the
appropriate change.  If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node.  I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good.  Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 2:20 PM, Fabien COELHO  wrote:
>> I'm wondering if percentages instead of weights would be a better
>> idea. That'd mean you'd be forced to be more careful when adding another
>> script (having to adjust the percentages of other scripts) but arguably
>> that's a good thing?
>
> If you use only percent, then you have to check that the total is 100,
> probably you have to use floats, to do something when the total is not 100,
> checking would complicate the code and test people mental calculus
> abilities. Not sure this is a good idea:-)

I agree.  I don't see a reason to enforce that the total of the
weights must be 100.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure  wrote:
> On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
>> It would be a bad idea to cling blindly to the FDW infrastructure if
>> it's fundamentally inadequate to do what we want.  On the other hand,
>> it would also be a bad idea to set about recreating it without a
>> really good reason, and - just to take one example - the fact that it
>> doesn't currently push down DML operations to the remote side is not a
>> really good reason to rewrite the whole thing.  On the contrary, it's
>> a reason to put some energy into the already-written patch which
>> implements that optimization.
>
> The problem with FDW for these purposes as I see it is that too much
> intelligence is relegated to the implementer of the API.  There needs
> to be a mechanism so that the planner can rewrite the remote query and
> then do some after the fact processing.  This exactly what citus does;
> if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
> so that aggregation can be properly weighted to the result.   To do
> this (distributed OLAP-type processing) right, the planner needs to
> *know* that this table is in fact distributed and also know that it
> can make SQL compatible adjustments to the query.
>
> This strikes me as a bit of a conflict of interest with FDW which
> seems to want to hide the fact that it's foreign; the FDW
> implementation makes it's own optimization decisions which might make
> sense for single table queries but breaks down in the face of joins.

Well, I don't think that ALL of the logic should go into the FDW.  In
particular, in this example, parallel aggregate needs the same query
rewrite, so the logic for that should live in core so that both
parallel and distributed queries get the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Fabien COELHO


Hello Andres,


Maybe add --builtin list to show them?


Yep, easy enough.


[...]
+Shorthand for -b simple-update@1.
+Shorthand for -b select-only@1.


I'm a bit inclined to remove these options.


Hm...

This is really backward compatibility, and people may find reference to 
these in blogs or elswhere, so I think that it would make sense to

be upward compatible.

I would certainly be against adding any other of these options, though.


+   Each script may be given a relative weight specified after a
+   @ so as to change its drawing probability.
+   The default weight is 1.


I'm wondering if percentages instead of weights would be a better
idea. That'd mean you'd be forced to be more careful when adding another
script (having to adjust the percentages of other scripts) but arguably
that's a good thing?


If you use only percent, then you have to check that the total is 100, 
probably you have to use floats, to do something when the total is not 
100, checking would complicate the code and test people mental calculus 
abilities. Not sure this is a good idea:-)


In the use case you outline, when adding a script, maybe you know that it 
runs "as much as" this other script, so you can pick up the same weight 
without bothering.


Also, when testing, there is an issue when you want to remove one script 
for a quick test, and that would mean changing all percentages on the 
command line...


So I would advise not to put such a constraint.


+static SQLScript sql_script[MAX_SCRIPTS];

+static struct {
+   char *name;   /* very short name for -b ...*/
+   char *desc;   /* short description */
+   char *script; /* actual pgbench script */
+} builtin_script[]


Can't we put these in the same array?


I do not understand.


+   printf("transaction type: %s\n",
+  num_scripts == 1? sql_script[0].name: "multiple scripts");


Seems like it'd be more useful to simply always list the scripts +
weights here.


The detailed list is shown later, with the summary performance figure for 
each scripts, so ISTM that it would be redundant? Maybe the transaction 
type could be moved downwards just before said list?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane
I wrote:
> ... I imagine that we should allow FDWs to
> store some data within RelOptInfo structs that represent foreign joins
> belonging entirely to them, so that there'd be a handy place to keep that
> data till later.

Actually, if we do that (ie, provide a "void *fdw_state" field in join
RelOptInfos), then the FDW could use the nullness or not-nullness of
such a field to realize whether or not it had already considered this
join relation.  So I'm now thinking that the best API is to call the
FDW at the end of each make_join_rel call, whether it's the first one
for the joinrel or not.  If the FDW wants a call for each legal pair of
input sub-relations, it's got one.  If it only wants one call per joinrel,
it can just make sure to put something into fdw_state, and then on
subsequent calls for the same joinrel it can just exit immediately if
fdw_state is already non-null.  So we have both use-cases covered.
Also, by doing this at the end, the FDW can look at the "regular" (local
join execution) paths that were already generated, should it wish to.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Merlin Moncure
On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
> It would be a bad idea to cling blindly to the FDW infrastructure if
> it's fundamentally inadequate to do what we want.  On the other hand,
> it would also be a bad idea to set about recreating it without a
> really good reason, and - just to take one example - the fact that it
> doesn't currently push down DML operations to the remote side is not a
> really good reason to rewrite the whole thing.  On the contrary, it's
> a reason to put some energy into the already-written patch which
> implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API.  There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing.  This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result.   To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> On 2015-09-02 00:09, Josh Berkus wrote:
>> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>>> So while you may be right in single-DC deployments, with multi-DC
>>> deployments the situation is quite different - not only that the network
>>> bandwidth is not unlimited, but because latencies within DC may be a
>>> fraction of latencies between the locations (to the extent that the
>>> increase due to syncrep may be just noise). So the local replication may
>>> be actually way faster.
>>
>> I'm not seeing how the above is better using syncrep than using shard
>> copying?
> 
> Shard copying usually assumes that the origin node does the copy - the
> data has to go twice through the slow connection. With replication you
> can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication.  We don't have that at
present (sync rep is whole-node, and BDR is asynchronous).  There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

>> Not really, the mechanism is different and the behavior is different.
>> One critical deficiency in using binary syncrep is that you can't do
>> round-robin redundancy at all; every redundant node has to be an exact
>> mirror of another node.  In a good HA distributed system, you want
>> multiple shards per node, and you want each shard to be replicated to a
>> different node, so that in the event of node failure you're not dumping
>> the full load on one other server.
>>
> 
> This assumes that we use binary replication, but we can reasonably use
> logical replication which can quite easily do filtering of what's
> replicated where.

Is there a way to do logical synchronous replication?  I didn't think
there was.

>>> IMHO the design has to address the multi-DC setups somehow. I think that
>>> many of the customers who are so concerned about scaling to many shards
>>> are also concerned about availability in case of DC outages, no?
>>
>> Certainly.  But users located in a single DC shouldn't pay the same
>> overhead as users who are geographically spread.
>>
> 
> Agreed, so we should support both ways, but I don't think it's necessary
> to support both ways in version 0.1. It's just important to not paint
> ourselves into a corner with design decisions that would make one of the
> ways impossible.

Exactly!

Let me explain why I'm so vocal on this point.  PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened).  I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
   2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems.  Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane
Robert Haas  writes:
> On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane  wrote:
>> But if you have in mind that typical FDWs would actually create join paths
>> at that point, consider that
>> 
>> 1. The FDW would have to find all the combinations of its supplied
>> relations (unless you are only intending to generate one path for the
>> union of all such rels, which seems pretty narrow-minded from here).

> Well, if the remote end is another database server, presumably we can
> leave it to optimize the query, so why would we need more than one
> path?

If you have say 5 relations in the query, 3 of which are foreign, it might
make sense to join all 3 at the remote end, or maybe you should only join
2 of them remotely because it's better to then join to one of the local
rels before joining the last remote rel.  Even if you claim that that
would never make sense from a cost standpoint (a claim easily seen to be
silly), there might not be any legal way to join all 3 directly because of
join order constraints.

The larger point is that we can't expect the remote server to be fully
responsible for optimizing, because it will know nothing of what's being
done on our end.

> I can see that we need more than one path because of sort-order
> considerations, which would affect the query we ship to the remote
> side.  But I don't see the point of considering multiple join orders
> unless the remote end is dumber than our optimizer, which might be
> true in some cases, but not if the remote end is PostgreSQL.

(1) not all remote ends are Postgres, (2) the remote end doesn't have any
access to info about our end.

> So, the problem is that I don't think this entirely skirts the
> join_is_legal issues, which are a principal point of concern for me.
> Say this is a joinrel between (A B) and (C D E).  We need to generate
> an SQL query for (A B C D E).  We know that the outermost syntactic
> join can be (A B) to (C D E).  But how do we know which join orders
> are legal as among (C D E)?  Maybe there's a simple way to handle this
> that I'm not seeing.

Well, if the joins get built up in the way I think should happen, we'd
have already considered (C D E), and we could have recorded the legal join
orders within that at the time.  (I imagine that we should allow FDWs to
store some data within RelOptInfo structs that represent foreign joins
belonging entirely to them, so that there'd be a handy place to keep that
data till later.)  Or we could trawl through the paths associated with the
child joinrel, which will presumably include instances of every reasonable
sub-join combination.  Or the FDW could look at the SpecialJoinInfo data
and determine things for itself (or more likely, ask join_is_legal about
that).

In the case of postgres_fdw, I think the actual requirement will be to be
able to reconstruct a SQL query that correctly expresses the join; that
is, we need to send over something like "from c left join d on (...) full
join e on (...)", not just "from c, d, e", or we'll get totally bogus
estimates as well as bogus execution results.  Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it.  But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Alvaro Herrera
Andres Freund wrote:

> > As I recall, Andrew Dunstan has a module that
> > tests cross-version pg_upgrade and one thing he does is dump both and
> > compare; the problem is that there are differences, so he keeps a count
> > of how many lines he expect to differ between any two releases.
> 
> I'm not suggesting to do anything cross-release - that'd indeed be
> another ballpark.

Ah, you're right.

> Just that instead of a pg_dump test that tests some individual things we
> have one that tests the whole regression test output and then does a
> diff?

Sorry if I'm slow -- A diff against what?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Andres Freund
On 2015-09-02 14:30:33 -0300, Alvaro Herrera wrote:
> I was thinking in having this be renamed src/test/modules/extensions/
> and then the extension contained here would be renamed ext001_fk_tables
> or something like that; later we could ext002_something for testing some
> other angle of extensions, not necessarily pg_dump related.

The largest dataset we have for this is the current regression test
database, it seems a waste not to include it...

> That's another option, but we've had this idea for many years now and it
> hasn't materialized.

But that's just minimally more complex than what's currently done in
that test and in the pg_upgrade test?

> As I recall, Andrew Dunstan has a module that
> tests cross-version pg_upgrade and one thing he does is dump both and
> compare; the problem is that there are differences, so he keeps a count
> of how many lines he expect to differ between any two releases.

I'm not suggesting to do anything cross-release - that'd indeed be
another ballpark.

Just that instead of a pg_dump test that tests some individual things we
have one that tests the whole regression test output and then does a
diff?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Alvaro Herrera
Andres Freund wrote:

> Isn't a full test with a separate initdb, create extension etc. a really
> heavyhanded way to test this? I mean that's a test where the setup takes
> up to 10s, whereas the actual runtime is in the millisecond range?

I spent some time looking over this patch yesterday, and that was one
concern I had too.  I was thinking that if we turn this into a single
module that can contain several extensions, or several pg_dump-related
tests, and then test multiple features without requiring an initdb for
each, it would be more palatable.

> Adding tests in this manner doesn't seem scalable to me.

I was thinking in having this be renamed src/test/modules/extensions/
and then the extension contained here would be renamed ext001_fk_tables
or something like that; later we could ext002_something for testing some
other angle of extensions, not necessarily pg_dump related.

> I think we should rather add *one* test that does dump/restore over the
> normal regression test database. Something similar to the pg_upgrade
> tests. And then work at adding more content to the regression test
> database - potentially sourcing from src/test/modules.

That's another option, but we've had this idea for many years now and it
hasn't materialized.  As I recall, Andrew Dunstan has a module that
tests cross-version pg_upgrade and one thing he does is dump both and
compare; the problem is that there are differences, so he keeps a count
of how many lines he expect to differ between any two releases.  Or
something like that.  While it's a good enough start, I don't think it's
robust enough to be in core.  How would we do it?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Andres Freund
On 2015-07-30 18:03:56 +0200, Fabien COELHO wrote:
> 
> >v6 is just a rebase after a bug fix by Andres Freund.
> >
> >Also a small question: The patch currently displays pgbench scripts
> >starting numbering at 0. Probably a little too geek... should start at 1?
> 
> v7 is a rebase after another small bug fix in pgbench.
> 
> -- 
> Fabien.

> diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
> index 2517a3a..99670d4 100644
> --- a/doc/src/sgml/ref/pgbench.sgml
> +++ b/doc/src/sgml/ref/pgbench.sgml
> @@ -261,6 +261,23 @@ pgbench  options  
> dbname
>  benchmarking arguments:
>  
>  
> + 
> +  -b scriptname[@weight]
> +  --builtin scriptname[@weight]
> +  
> +   
> +Add the specified builtin script to the list of executed scripts.
> +An optional integer weight after @ allows to adjust the
> +probability of drawing the test.
> +Available builtin scripts are: tpcb-like,
> +simple-update and select-only.
> +The provided scriptname needs only to be a prefix
> +of the builtin name, hence simp would be enough to select
> +simple-update.
> +   
> +  
> + 

Maybe add --builtin list to show them?

> @@ -404,10 +422,7 @@ pgbench  options  
> dbname
>--skip-some-updates
>
> 
> -Do not update pgbench_tellers and
> -pgbench_branches.
> -This will avoid update contention on these tables, but
> -it makes the test case even less like TPC-B.
> +Shorthand for -b simple-update@1.
> 
>
>   

> @@ -511,7 +526,7 @@ pgbench  options  
> dbname
>--select-only
>
> 
> -Perform select-only transactions instead of TPC-B-like test.
> +Shorthand for -b select-only@1.
> 
>
>   

I'm a bit inclined to remove these options.

>
> -   The default transaction script issues seven commands per transaction:
> +   Pgbench executes test scripts chosen randomly from a specified list.
> +   They include built-in scripts with -b and
> +   user-provided custom scripts with -f.
> +   Each script may be given a relative weight specified after a
> +   @ so as to change its drawing probability.
> +   The default weight is 1.
> + 

I'm wondering if percentages instead of weights would be a better
idea. That'd mean you'd be forced to be more careful when adding another
script (having to adjust the percentages of other scripts) but arguably
that's a good thing?

> +static SQLScript sql_script[MAX_SCRIPTS];
> +static struct {
> + char *name;   /* very short name for -b ...*/
> + char *desc;   /* short description */
> + char *script; /* actual pgbench script */
> +} builtin_script[]

Can't we put these in the same array?

> + printf("transaction type: %s\n",
> +num_scripts == 1? sql_script[0].name: "multiple scripts");

Seems like it'd be more useful to simply always list the scripts +
weights here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >