[HACKERS] Many processes blocked at ProcArrayLock

2014-12-02 Thread Xiaoyulei
Test configuration:
Hardware:
4P intel server, 60 core 120 hard thread.
Memory:512G
SSD:2.4T

PG:
max_connections = 160   # (change requires restart)
shared_buffers = 32GB   
work_mem = 128MB
maintenance_work_mem = 32MB 
bgwriter_delay = 100ms  # 10-1ms between rounds
bgwriter_lru_maxpages = 200 # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0   # 0-10.0 multipler on buffers 
scanned/round
wal_level = minimal # minimal, archive, or hot_standby
wal_buffers = 256MB # min 32kB, -1 sets based on 
shared_buffers
autovacuum = off
checkpoint_timeout=60min
checkpoint_segments = 1000
archive_mode = off
synchronous_commit = off
fsync = off
full_page_writes = off  


We use tpcc and pgbench to test postgresql 9.4beat2 performance. And we found 
the tps/tpmc could not increase with the terminal increase. The detail 
information is in attachment.

Many processes is blocked, I dump the call stack, and found these processes is 
blocked at: ProcArrayLock. 60% processes is blocked in ProcArrayEndTransaction 
with ProcArrayLock EXCLUSIVE, 20% is in GetSnapshotData with ProcArrayLock 
SHARED. Others locks like XLogFlush and WALInsertLock are not very heavy.

Is there any way we solve this problem?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Peter Geoghegan
On Fri, Nov 28, 2014 at 8:43 AM, Peter Eisentraut pete...@gmx.net wrote:
 At the moment, this is probably just an experiment that shows where
 refactoring and better abstractions might be suitable if we want to
 support multiple locale libraries.  If we want to pursue ICU, I think
 this could be a useful third option.

FWIW, I think that the richer API that ICU provides for string
transformations could be handy in optimizing sorting using abbreviated
keys. For example, ICU will happily only produce parts of sort keys
(the equivalent of strxfrm() blobs) if that is all that is required
[1].

I think that ICU also allows clients to parse individual primary
weights in a principled way (primary weights tend to be isomorphic to
the Unicode code points in the original string). I think that this
will enable order-preserving compression of the type anticipated by
the Unicode collation algorithm [2]. That could be useful for certain
languages, like Russian, where the primary weight level usually
contains multi-byte code points with glibc's strxfrm() (this is
generally not true of languages that use the Latin alphabet, or of
East Asian languages).

Note that there is already naturally a form of what you might call
compression with strxfrm() [3]. This is very useful for abbreviated
keys.

[1] http://userguide.icu-project.org/collation/architecture
[2] http://www.unicode.org/reports/tr10/#Run-length_Compression
[3] 
http://www.postgresql.org/message-id/cam3swztywe5j69tapvzf2cm7mhskke3uhhnk9gluqckkwqo...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-02 Thread Tomas Vondra

Dne 2014-12-02 02:52, Tom Lane napsal:

Tomas Vondra t...@fuzzy.cz writes:

On 2.12.2014 01:33, Tom Lane wrote:

What I suspect you're looking at here is the detritus of creation of
a huge number of memory contexts. mcxt.c keeps its own state about
existing contents in TopMemoryContext. So, if we posit that those
contexts weren't real small, there's certainly room to believe that
there was some major memory bloat going on recently.



Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?


Right.


That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?


Well, if they're each 6K, that's your 6GB right there.


Yeah, but this memory should be freed after the query finishes, no?


Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed?


We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such shell parent 
contexts

should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to 
rethink

whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.


Maybe. For me, the 130MB is not really a big deal, because for this to
happen there really needs to be many child contexts at the same time,
consuming much more memory. With 6.5GB consumed in total, 130MB amounts
to ~2% which is negligible. Unless we can fix the RSS bloat.

Also, this explains the TopMemoryContext size, but not the RSS size 
(or

am I missing something)?


Very possibly you're left with islands that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.


Yes, that's something I was thinking about too - I believe what happens
is that allocations of info in TopMemoryContext and the actual contexts
are interleaved, and at the end only the memory contexts are deleted. 
The

blocks allocated in TopMemoryContexts are kept, creating the islands.

If that's the case, allocating the child context info within the parent
context would solve this, because these pieces would be reclaimed with 
the

rest of the parent memory.

But then again, there are probably other ways to create such islands
(e.g. allocating additional block in a long-lived context while the 
child

contexts exist).

regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Serialization exception : Who else was involved?

2014-12-02 Thread Olivier MATROT
Hello,

 

I'm using PostgreSQL .9.2.8 on Windows from a .NET application using
Npgsql.

I'm working in the Radiology Information System field.

 

We have thousands of users against a big accounting database.

We're using the SERIALIZABLE isolation level to ensure data consistency.

 

Because of the large number of users, and probably because of the
database design, we're facing serialization exception and we retry our
transactions.

So far so good.

 

I was wondering if there was a log level in PostgreSQL that could tell
me which query was the trigger of a doomed transaction.

The goal is to understand the failures to improve the database and
application designs.

 

I pushed the logs to the DEBUG5 level with no luck.

 

After carefully reviewing the documentation, it seems that there was
nothing.

So I downloaded the code and looked at it.

 

Serialization conflict detection is done in
src/backend/storage/lmgr/predicate.c, where transactions that are doomed
to fail are marked as such with the SXACT_FLAG_DOOMED flag.

 

I simply added elog(...) calls with the NOTIFY level, each time the flag
is set, compiled the code and give it a try.

 

The results are amazing for me, because this simple modification allows
me to know which query is marking other running transactions to fail.

I'm pretty sure that in the production environment of our major
customers, there should be no more than a few transaction involved.

 

I would like to see this useful and simple addition in a future version
of PostgreSQL.

Is it in the spirit of what is done when it comes to ease the work of
the developer ?

May be the level I've chosen is not appropriate ?

 

Please let me know what you think.

 

Kind Regards.

 

Olivier.

 



Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-02 Thread Stephen Frost
Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
 Ok.  Though, this would affect how CATUPDATE is handled.  Peter Eisentraut
 previously raised a question about whether superuser checks should be
 included with catupdate which led me to create the following post.
 
 http://www.postgresql.org/message-id/cakrt6cqovt2kiykg2gff7h9k8+jvu1149zlb0extkkk7taq...@mail.gmail.com
 
 Certainly, we could keep has_rolcatupdate for this case and put the
 superuser check in role_has_attribute, but it seems like it might be worth
 taking a look at whether a superuser can bypass catupdate or not.  Just a
 thought.

My recollection matches the documentation- rolcatupdate should be
required to update the catalogs.  The fact that rolcatupdate is set by
AlterUser to match rolsuper is an interesting point and one which we
might want to reconsider, but that's beyond the scope of this patch.

Ergo, I'd suggest keeping has_rolcatupdate, but have it do the check
itself directly instead of calling down into role_has_attribute().

There's an interesting flip side to that though, which is the question
of what to do with pg_roles and psql.  Based on the discussion this far,
it seems like we'd want to keep the distinction for pg_roles and psql
based on what bits have explicitly been set rather than what's actually
checked for.  As such, I'd have one other function-
check_has_attribute() which *doesn't* have the superuser allow-all check
and is what is used in pg_roles and by psql.  I'd expose both functions
at the SQL level.

 Ok.  I had originally thought for this patch that I would try to minimize
 these types of changes, though perhaps this is that opportunity previously
 mentioned in refactoring those.  However, the catupdate question still
 remains.

It makes sense to me, at least, to include removing those individual
attribute functions in this patch.

 I have no reason for one over the other, though I did ask myself that
 question.  I did find it curious that in some cases there is has_X and
 then in others pg_has_X.  Perhaps I'm not looking in the right places,
 but I haven't found anything that helps to distinguish when one vs the
 other is appropriate (even if it is a general rule of thumb).

Given that we're changing things anyway, it seems to me that the pg_
prefix makes sense.

 Yes, we were, however the latter causes a syntax error with initdb. :-/

Ok, then just stuff the 255 back there and add a comment about why it's
required and mention that cute tricks to calculate the value won't work.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-02 Thread Tomas Vondra
Dne 2 Prosinec 2014, 10:59, Tomas Vondra napsal(a):
 Dne 2014-12-02 02:52, Tom Lane napsal:
 Tomas Vondra t...@fuzzy.cz writes:

 Also, this explains the TopMemoryContext size, but not the RSS size
 (or am I missing something)?

 Very possibly you're left with islands that prevent reclaiming very
 much of the peak RAM usage.  It'd be hard to be sure without some sort
 of memory map, of course.

 Yes, that's something I was thinking about too - I believe what happens
 is that allocations of info in TopMemoryContext and the actual contexts
 are interleaved, and at the end only the memory contexts are deleted.
 The blocks allocated in TopMemoryContexts are kept, creating the
 islands.

 If that's the case, allocating the child context info within the parent
 context would solve this, because these pieces would be reclaimed with
 the rest of the parent memory.

On second thought, I'm not sure this explains the continuous increase of
consumed memory. When the first iteration consumes 2,818g of memory, why
should the following iterations consume significantly more? The allocation
patterns should be (almost) exactly the same, reusing the already
allocated memory (either at the system or TopMemoryContext level).

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_stat_statement normalization fails due to temporary tables

2014-12-02 Thread Andres Freund
Hi,

pg_stat_statement's query normalization fails when temporary tables are
used in a query. That's because JumbleRangeTable() uses the relid in the
RTE to build the query fingerprint. I think in this case pgss from 9.1
actually would do better than 9.2+ as the hash lookup previously didn't
use the relid.

I don't really have a good idea about fixing this though. The best thing
that comes to mind is simply use eref-aliasname for the
disambiguation...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Alex Shulgin

Michael Paquier michael.paqu...@gmail.com writes:

 On Tue, Dec 2, 2014 at 1:58 AM, Alex Shulgin a...@commandprompt.com wrote:
 Here's the patch rebased against current HEAD, that is including the
 recently committed action_at_recovery_target option.
 If this patch gets in, it gives a good argument to jump to 10.0 IMO.
 That's not a bad thing, only the cost of making recovery params as
 GUCs which is still a feature wanted.

 The default for the new GUC is 'pause', as in HEAD, and
 pause_at_recovery_target is removed completely in favor of it.
 Makes sense. Another idea that popped out was to rename this parameter
 as recovery_target_action as well, but that's not really something
 this patch should care about.

Indeed, but changing the name after the fact is straightforward.

 I've also taken the liberty to remove that part that errors out when
 finding $PGDATA/recovery.conf.
 I am not in favor of this part. It may be better to let the users know
 that their old configuration is not valid anymore with an error. This
 patch cuts in the flesh with a huge axe, let's be sure that users do
 not ignore the side pain effects, or recovery.conf would be simply
 ignored and users would not be aware of that.

Yeah, that is good point.

I'd be in favor of a solution that works the same way as before the
patch, without the need for extra trigger files, etc., but that doesn't
seem to be nearly possible.  Whatever tricks we might employ will likely
be defeated by the fact that the oldschool user will fail to *include*
recovery.conf in the main conf file.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 3:44 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Today, I had a talk with Hanada-san to clarify which can be a common portion
 of them and how to implement it. Then, we concluded both of features can be
 shared most of the infrastructure.
 Let me put an introduction of join replacement by foreign-/custom-scan below.

 Its overall design intends to inject foreign-/custom-scan node instead of
 the built-in join logic (based on the estimated cost). From the viewpoint of
 core backend, it looks like a sub-query scan that contains relations join
 internally.

 What we need to do is below:

 (1) Add a hook add_paths_to_joinrel()
 It gives extensions (including FDW drivers and custom-scan providers) chance
 to add alternative paths towards a particular join of relations, using
 ForeignScanPath or CustomScanPath, if it can run instead of the built-in ones.

 (2) Informs the core backend varno/varattno mapping
 One thing we need to pay attention is, foreign-/custom-scan node that performs
 instead of the built-in join node must return mixture of values come from both
 relations. In case when FDW driver fetch a remote record (also, fetch a record
 computed by external computing resource), the most reasonable way is to store
 it on ecxt_scantuple of ExprContext, then kicks projection with varnode that
 references this slot.
 It needs an infrastructure that tracks relationship between original varnode
 and the alternative varno/varattno. We thought, it shall be mapped to 
 INDEX_VAR
 and a virtual attribute number to reference ecxt_scantuple naturally, and
 this infrastructure is quite helpful for both of ForegnScan/CustomScan.
 We'd like to add List *fdw_varmap/*custom_varmap variable to both of plan 
 nodes.
 It contains list of the original Var node that shall be mapped on the position
 according to the list index. (e.g, the first varnode is varno=INDEX_VAR and
 varattno=1)

 (3) Reverse mapping on EXPLAIN
 For EXPLAIN support, above varnode on the pseudo relation scan needed to be
 solved. All we need to do is initialization of dpns-inner_tlist on
 set_deparse_planstate() according to the above mapping.

 (4) case of scanrelid == 0
 To skip open/close (foreign) tables, we need to have a mark to introduce the
 backend not to initialize the scan node according to table definition, but
 according to the pseudo varnodes list.
 As earlier custom-scan patch doing, scanrelid == 0 is a straightforward mark
 to show the scan node is not combined with a particular real relation.
 So, it also need to add special case handling around foreign-/custom-scan 
 code.

 We expect above changes are enough small to implement basic join push-down
 functionality (that does not involves external computing of complicated
 expression node), but valuable to support in v9.5.

 Please comment on the proposition above.

I don't really have any technical comments on this design right at the
moment, but I think it's an important area where PostgreSQL needs to
make some progress sooner rather than later, so I hope that we can get
something committed in time for 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statement normalization fails due to temporary tables

2014-12-02 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 pg_stat_statement's query normalization fails when temporary tables are
 used in a query. That's because JumbleRangeTable() uses the relid in the
 RTE to build the query fingerprint. I think in this case pgss from 9.1
 actually would do better than 9.2+ as the hash lookup previously didn't
 use the relid.

 I don't really have a good idea about fixing this though. The best thing
 that comes to mind is simply use eref-aliasname for the
 disambiguation...

Hmm ... by fails I suppose you mean doesn't treat two different
instances of the same temp table name as the same?  I'm not sure
that's a bug.

If we go over to using the aliasname then select x from foo a and
select x from bar a would be treated as the same query, which
clearly *is* a bug.  More generally, I don't think that schema1.tab
and schema2.tab should be considered the same table for this purpose.
So it's hard to see how to do much better than using the OID.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statement normalization fails due to temporary tables

2014-12-02 Thread Andres Freund
On 2014-12-02 09:59:00 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  pg_stat_statement's query normalization fails when temporary tables are
  used in a query. That's because JumbleRangeTable() uses the relid in the
  RTE to build the query fingerprint. I think in this case pgss from 9.1
  actually would do better than 9.2+ as the hash lookup previously didn't
  use the relid.
 
  I don't really have a good idea about fixing this though. The best thing
  that comes to mind is simply use eref-aliasname for the
  disambiguation...
 
 Hmm ... by fails I suppose you mean doesn't treat two different
 instances of the same temp table name as the same?  I'm not sure
 that's a bug.

Well, it did work  9.2...

 If we go over to using the aliasname then select x from foo a and
 select x from bar a would be treated as the same query, which
 clearly *is* a bug.

Yep. The fully qualified name + alias would probably work - but IIRC
isn't available in the RTE without further syscache lookups...

  More generally, I don't think that schema1.tab
 and schema2.tab should be considered the same table for this purpose.
 So it's hard to see how to do much better than using the OID.

Well, from a practical point of view it's highly annoying if half of
pg_stat_statement suddenly consists out of the same query, just issued
in a different session.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] test_shm_mq failing on mips*

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 10:42 AM, Christoph Berg c...@df7cb.de wrote:
 Re: To Robert Haas 2014-11-24 20141124200824.ga22...@msg.df7cb.de
  Does it fail every time when run on a machine where it fails sometimes?

 So far there's a consistent host - fail-or-not mapping, but most
 mips/mipsel build hosts have seen only one attempt so far which
 actually came so far to actually run the shm_mq test.

 I got the build rescheduled on the same machine and it's hanging
 again.

 Atm I don't have access to the boxes where it was failing (the builds
 succeed on the mips(el) porter hosts available to Debian developers).
 I'll see if I can arrange access there and run a test.

 Julien Cristau was so kind to poke into the hanging processes. The
 build has been stuck now for about 4h, while that postgres backend has
 only consumed 4s of CPU time (according to plain ps). The currently
 executing query is:

 SELECT test_shm_mq_pipelined(16384, (select 
 string_agg(chr(32+(random()*95)::int), '') from generate_series(1,27)), 
 200, 3);

 (Waiting f, active, backend_start 6s older than xact_start, xact_start
 same as query_start, state_change 19盜 newer, all 4h old)

I can't tell from this exactly what's going wrong.  Questions:

1. Are there any background worker processes running at the same time?
 If so, how many?  We'd expect to see 3.
2. Can we printout of the following variables in stack frame 3
(test_shm_mq_pipelined)?  send_count, loop_count, *outqh, *inqh,
inqh-mqh_queue[0], outqh-mqh_queue[0]
3. What does a backtrace of each background worker process look like?
If they are stuck inside copy_messages(), can you print *outqh, *inqh,
inqh-mqh_queue[0], outqh-mqh_queue[0] from that stack frame?

Sorry for the hassle; I just don't have a better idea how to debug this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nitpicky doc corrections for BRIN functions of pageinspect

2014-12-02 Thread Alvaro Herrera
Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 8:02 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  Just a small thing I noticed while looking at pageinspect.sgml, the
  set of SQL examples related to BRIN indexes uses lower-case characters
  for reserved keywords. This has been introduced by 7516f52.
  Patch is attached.
 
 My nitpicky observation about pageinspect + BRIN was that the comments
 added to the pageinspect SQL file for the SQL-callable function
 brin_revmap_data() have a comment header style slightly inconsistent
 with the existing items.

Pushed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 Before going too much further with this I'd mock up schemas for your
 proposed catalogs and a list of DDL operations to be supported, with
 the corresponding syntax, and float that here for comment.

More people should really comment on this.  This is a pretty big deal
if it goes forward, so it shouldn't be based on what one or two people
think.

 * Catalog schema:

 CREATE TABLE pg_catalog.pg_partitioned_rel
 (
partrelidoidNOT NULL,
partkindoidNOT NULL,
partissub  bool  NOT NULL,
partkey int2vector NOT NULL, -- partitioning attributes
partopclass oidvector,

PRIMARY KEY (partrelid, partissub),
FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
 )
 WITHOUT OIDS ;

So, we're going to support exactly two levels of partitioning?
partitions with partissub=false and subpartitions with partissub=true?
 Why not support only one level of partitioning here but then let the
children have their own pg_partitioned_rel entries if they are
subpartitioned?  That seems like a cleaner design and lets us support
an arbitrary number of partitioning levels if we ever need them.

 CREATE TABLE pg_catalog.pg_partition_def
 (
partitionid  oid NOT NULL,
partitionparentrel   oidNOT NULL,
partitionisoverflow bool  NOT NULL,
partitionvalues anyarray,

PRIMARY KEY (partitionid),
FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
 )
 WITHOUT OIDS;

 ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;

What is an overflow partition and why do we want that?

What are you going to do if the partitioning key has two columns of
different data types?

 * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):

 -- create partitioned table and child partitions at once.
 CREATE TABLE parent (...)
 PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
 [ (
  PARTITION child
{
VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
  | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
}
[ WITH ( ... ) ] [ TABLESPACE tbs ]
  [, ...]
   ) ] ;

How are you going to dump and restore this, bearing in mind that you
have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
wants to do pg_dump --table name_of_a_partition?

I actually think it will be much cleaner to declare the parent first
and then have separate CREATE TABLE statements that glue the children
in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1,
1).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Robert Haas
On Wed, Nov 26, 2014 at 10:12 AM, Stephen Frost sfr...@snowman.net wrote:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 In the context at hand, I think most of the messages in question are
 currently phrased like must be superuser to do X.  I'd be fine with
 changing that to permission denied to do X, but not to just
 permission denied.

 Apologies for the terseness of my (earlier) reply.  This is exactly what
 I'm suggesting.  What was in the patch was this change:

 ! ERROR:  must be superuser or replication role to use replication slots

 ---

 ! ERROR:  permission denied to use replication slots
 ! HINT:  You must be superuser or replication role to use replication slots.

Your proposed change takes two lines to convey the same amount of
information that we are currently conveying in one line.  How is that
better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 recovery/startup problems

2014-12-02 Thread Robert Haas
On Wed, Nov 26, 2014 at 7:13 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl stop
 -mi, then neither goes away.  It is only with the /sbin/reboot that I get
 the fatal combination of _init being gone but the other still present.

Eh?  That sounds wonky.

I mean, reboot normally kills processes with SIGTERM or SIGKILL, in
which case I'd expect the outcome to match what you get with pg_ctl
stop -mf or pg_ctl stop -mi.  The only way I can see that you'd get a
different behavior is if you did a hard reboot (like echo b 
/proc/sysrq-trigger); if that changes things, then we might have a
missing-fsync bug.  How is that reboot managing to leave the main fork
behind while losing the init fork?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory explosion on planning complex query

2014-12-02 Thread Robert Haas
On Wed, Nov 26, 2014 at 7:24 PM, Andrew Dunstan and...@dunslane.net wrote:
 On 11/26/2014 05:00 PM, Andrew Dunstan wrote:
 Attached is some anonymized DDL for a fairly complex schema from a
 PostgreSQL Experts client. Also attached is an explain query that runs
 against the schema. The client's problem is that in trying to run the
 explain, Postgres simply runs out of memory. On my untuned 9.3 test rig,
 (Scientific Linux 6.4 with 24Gb of RAM and 24Gb of swap) vmstat clearly
 shows the explain chewing up about 7Gb of memory. When it's done the free
 memory jumps back to where it was. On a similar case on the clients test rig
 we saw memory use jump lots more.

 The client's question is whether this is not a bug. It certainly seems
 like it should be possible to plan a query without chewing up this much
 memory, or at least to be able to limit the amount of memory that can be
 grabbed during planning. Going from humming along happily to OOM conditions
 all through running explain somequery is not very friendly.


 Further data point - thanks to Andrew Gierth (a.k.a. RhodiumToad) for
 pointing this out. The query itself grabs about 600Mb to 700Mb to run,
 whereas the EXPLAIN takes vastly more - on my system 10 times more. Surely
 that's not supposed to happen?

Hmm.  So you can run the query but you can't EXPLAIN it?

That sounds like it could well be a bug, but I'm thinking you might
have to instrument palloc() to find out where all of that space is
being allocated to figure out why it's happening - or maybe connect
gdb to the server while the EXPLAIN is chewing up memory and pull some
backtraces to figure out what section of code it's stuck in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink_get_connections() result for no connections

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 3:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 While fooling around with the array_agg(anyarray) patch, I noted
 that dblink_get_connections() returns NULL if there are no active
 connections.  It seems like an empty array might be a saner
 definition --- thoughts?

I don't think it matters much.  I wouldn't break backward
compatibility to change it from this behavior to that one, and if we'd
picked the other one I wouldn't break compatibility to change it to
this behavior.  Anybody who cares can wrap a COALESCE() around it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] copy.c handling for RLS is insecure

2014-12-02 Thread Robert Haas
On Thu, Nov 27, 2014 at 2:03 AM, Stephen Frost sfr...@snowman.net wrote:
 Alright, I've done the change to use the RangeVar from CopyStmt, but
 also added a check wherein we verify that the relation's OID returned
 from the planned query is the same as the relation's OID that we did the
 RLS check on- if they're different, we throw an error.  Please let me
 know if there are any remaining concerns.

That's clearly an improvement, but I'm not sure it's water-tight.
What if the name that originally referenced a table ended up
referencing a view?  Then you could get
list_length(plan-relationOids) != 1.

(And, in that case, I also wonder if you could get
eval_const_expressions() to do evil things on your behalf while
planning.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] test_shm_mq failing on mips*

2014-12-02 Thread Christoph Berg
Re: Robert Haas 2014-12-02 
CA+Tgmoa9HE=1GObU7MKuGLDBjBNSNRW0bDE4P3JA=P=9mqg...@mail.gmail.com
 I can't tell from this exactly what's going wrong.  Questions:
 
 1. Are there any background worker processes running at the same time?
  If so, how many?  We'd expect to see 3.
 2. Can we printout of the following variables in stack frame 3
 (test_shm_mq_pipelined)?  send_count, loop_count, *outqh, *inqh,
 inqh-mqh_queue[0], outqh-mqh_queue[0]
 3. What does a backtrace of each background worker process look like?
 If they are stuck inside copy_messages(), can you print *outqh, *inqh,
 inqh-mqh_queue[0], outqh-mqh_queue[0] from that stack frame?
 
 Sorry for the hassle; I just don't have a better idea how to debug this.

No problem, I'm aware this isn't an easy target.

I didn't have access to the build env myself, I'm not sure if I can
find someone to retry the test there. I'll let you know when I have
more data, but that will take a while (if at all :-/).

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb: Improve error recovery.

2014-12-02 Thread Robert Haas
On Thu, Nov 27, 2014 at 7:28 AM, Mats Erik Andersson b...@gisladisker.se 
wrote:
 I would like to improve error recovery of initdb when the
 password file is empty. The present code declares Error 0
 as the cause of failure. It suffices to use ferrer() since
 fgets() returns NULL also at a premature EOF.

 In addition, a minor case is the need of a line feed in
 order to print the error message on a line of its own
 seems desirable.

 Best regards,
   Mats Erik Andersson

Please add your patch here so that it doesn't get forgotten about:

https://commitfest.postgresql.org/action/commitfest_view/open

Also, note that the PostgreSQL project prefers for patches to be
attached rather than inline, as mailers have a tendency to mangle
them.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Nov 26, 2014 at 10:12 AM, Stephen Frost sfr...@snowman.net wrote:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
  In the context at hand, I think most of the messages in question are
  currently phrased like must be superuser to do X.  I'd be fine with
  changing that to permission denied to do X, but not to just
  permission denied.
 
  Apologies for the terseness of my (earlier) reply.  This is exactly what
  I'm suggesting.  What was in the patch was this change:
 
  ! ERROR:  must be superuser or replication role to use replication slots
 
  ---
 
  ! ERROR:  permission denied to use replication slots
  ! HINT:  You must be superuser or replication role to use replication slots.
 
 Your proposed change takes two lines to convey the same amount of
 information that we are currently conveying in one line.  How is that
 better?

It includes the actual error message, unlike what we have today which is
guidence as to what's required to get past the permission denied error.

In other words, I disagree that the same amount of information is being
conveyed.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] copy.c handling for RLS is insecure

2014-12-02 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Nov 27, 2014 at 2:03 AM, Stephen Frost sfr...@snowman.net wrote:
  Alright, I've done the change to use the RangeVar from CopyStmt, but
  also added a check wherein we verify that the relation's OID returned
  from the planned query is the same as the relation's OID that we did the
  RLS check on- if they're different, we throw an error.  Please let me
  know if there are any remaining concerns.
 
 That's clearly an improvement, but I'm not sure it's water-tight.
 What if the name that originally referenced a table ended up
 referencing a view?  Then you could get
 list_length(plan-relationOids) != 1.

I'll test it out and see what happens.  Certainly a good question and
if there's an issue there then I'll get it addressed.

 (And, in that case, I also wonder if you could get
 eval_const_expressions() to do evil things on your behalf while
 planning.)

If it can be made to reference a view then there's an issue as the view
might include a function call itself which is provided by the attacker..
I'm not sure that we have to really worry about anything more
complicated than that.

Clearly, if we found a relation originally then we need that same
relation with the same OID after the conversion to a query.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Nov 26, 2014 at 10:12 AM, Stephen Frost sfr...@snowman.net wrote:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
  In the context at hand, I think most of the messages in question are
  currently phrased like must be superuser to do X.  I'd be fine with
  changing that to permission denied to do X, but not to just
  permission denied.
 
  Apologies for the terseness of my (earlier) reply.  This is exactly what
  I'm suggesting.  What was in the patch was this change:
 
  ! ERROR:  must be superuser or replication role to use replication slots
 
  ---
 
  ! ERROR:  permission denied to use replication slots
  ! HINT:  You must be superuser or replication role to use replication 
  slots.

 Your proposed change takes two lines to convey the same amount of
 information that we are currently conveying in one line.  How is that
 better?

 It includes the actual error message, unlike what we have today which is
 guidence as to what's required to get past the permission denied error.

 In other words, I disagree that the same amount of information is being
 conveyed.

So, you think that someone might see the message must be superuser or
replication role to use replication slots and fail to understand that
they have a permissions problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-12-02 Thread Robert Haas
 Normally you would replicate between an older master and a newer
 replica, so this shouldn't be an issue.  I find it unlikely that we
 would de-support some syntax that works in an older version: it would
 break pg_dump, for one thing.

The most common way in which we break forward-compatibility is by
reserving additional keywords.  Logical replication can deal with this
by quoting all identifiers, so it's not a big issue.

It's not the only possible issue; it is certainly conceivable that we
could change something in the server and then try to change pg_dump to
compensate.  I think we wouldn't do that with anything very big, but
suppose spgist were supplanted by a new and better access method
tsigps.  Well, we might figure that there are few enough people
accustomed to the current syntax that we can get away with hacking the
pg_dump output, and yes, anyone with an existing dump from an older
system will have problems, but there won't be many of them, and they
can always adjust the dump by hand.  If we ever decided to do such a
thing, whatever syntax transformation we did in pg_dump would have to
be mimicked by any logical replication solution.

I think it's legitimate to say that this could be a problem, but I
think it probably won't be a problem very often, and I think when it
is a problem it probably won't be a very big problem, because we
already have good reasons not to break the ability to restore older
dumps on newer servers more often than absolutely necessary.

One of the somewhat disappointing things about this is that we're
talking about adding a lot of new deparsing code to the server that
probably, in some sense, duplicates pg_dump.  Perhaps in an ideal
world there would be a way to avoid that, but in the world we really
live in, there probably isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2014-12-02 Thread Robert Haas
On Thu, Nov 27, 2014 at 8:49 AM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Nov  6, 2014 at 05:46:42PM -0500, Peter Eisentraut wrote:
 Finally, the fact that a configuration change is in progress is
 privileged information.  Unprivileged users can deduct from the presence
 of this message that administrators are doing something, and possibly
 that they have done something wrong.

 I think it's fine to log a message in the server log if the pg_hba.conf
 file needs reloading.  But the client shouldn't know about this at all.

 Should we do this for postgresql.conf too?

It doesn't really make sense; or at least, the exact same thing
doesn't make any sense.  If an authentication attempt fails
unexpectedly, it might be because of a pg_hba.conf change that wasn't
reloaded, so it makes sense to try to tip off the DBA.  But it can't
really be because of a pending postgresql.conf change that hasn't been
reloaded, because those don't generally affect authentication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 recovery/startup problems

2014-12-02 Thread Jeff Janes
On Tue, Dec 2, 2014 at 7:41 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Nov 26, 2014 at 7:13 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl
 stop
  -mi, then neither goes away.  It is only with the /sbin/reboot that I get
  the fatal combination of _init being gone but the other still present.

 Eh?  That sounds wonky.

 I mean, reboot normally kills processes with SIGTERM or SIGKILL, in
 which case I'd expect the outcome to match what you get with pg_ctl
 stop -mf or pg_ctl stop -mi.  The only way I can see that you'd get a
 different behavior is if you did a hard reboot (like echo b 
 /proc/sysrq-trigger); if that changes things, then we might have a
 missing-fsync bug.  How is that reboot managing to leave the main fork
 behind while losing the init fork?


During abort processing after getting a SIGTERM, the back end truncates
59288 to zero size, and unlinks all the other files
(including 59288_init).  The actual removal of 59288 is left until the
checkpoint.  So if you SIGTERM the backend, then take down the server
uncleanly before the next checkpoint completes, you are left with just
59288.

Here is the strace:

open(base/16416/59288, O_RDWR)= 8
ftruncate(8, 0) = 0
close(8)= 0
unlink(base/16416/59288.1)= -1 ENOENT (No such file or
directory)
unlink(base/16416/59288_fsm)  = -1 ENOENT (No such file or
directory)
unlink(base/16416/59288_vm)   = -1 ENOENT (No such file or
directory)
unlink(base/16416/59288_init) = 0
unlink(base/16416/59288_init.1)   = -1 ENOENT (No such file or
directory)

Cheers,

Jeff


Re: [HACKERS] why is PG_AUTOCONF_FILENAME is pg_config_manual.h?

2014-12-02 Thread Robert Haas
On Thu, Nov 27, 2014 at 12:56 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Nov 27, 2014 at 11:58 PM, Peter Eisentraut pete...@gmx.net wrote:
 Surely that's not a value that we expect users to be able to edit.  Is
 pg_config_manual.h just abused as a place that's included everywhere?

 (I suggest utils/guc.h as a better place.)

 +1

+1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Report search_path value back to the client.

2014-12-02 Thread Alexander Kukushkin
Hi,

As of now postgres is reporting only really important variables, but
among them
there are also not so important, like 'application_name'. The only reason
to report
it, was: help pgbouncer, and perhaps other connection poolers.
Change was introduced by commit: 59ed94ad0c9f74a3f057f359316c845cedc4461e

This fact makes me wonder, why 'search_path' value is not reported back to
the
client? Use-case is absolutely the same as with 'application_name' but a
little bit
more important.

Our databases provides different version of stored procedures which are
located
in a different schemas: 'api_version1', 'api_version2', 'api_version5',
etc...
When application establish connection to the database it set search_path
value
for example to api_version1. At the same time, new version of the same
application
will set search_path value to api_version2. Sometimes we have hundreds of
instances of applications which may use different versions of stored
procedures
which are located in different schemas.

It's really crazy to keep so many (hundreds) connections to the database and
it would be much better to have something like pgbouncer in front of
postgres.
Right now it's not possible, because pgbouncer is not aware of search_path
and it's not really possible to fix pgbouncer, because there is no easy way
to
get current value of search_path.

I would like to mark 'search_path' as GUC_REPORT:
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2904,7 +2904,7 @@ static struct config_string ConfigureNamesString[] =
{search_path, PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop(Sets the schema search order for
names that are not schema-qualified.),
NULL,
-   GUC_LIST_INPUT | GUC_LIST_QUOTE
+   GUC_LIST_INPUT | GUC_LIST_QUOTE | GUC_REPORT
},
namespace_search_path,
\$user\,public,

What do you think?

Regards,
--
Alexander Kukushkin


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Robert Haas
On Wed, Nov 26, 2014 at 11:00 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila rahila.s...@nttdata.com wrote:
 Don't we need to initialize doPageCompression  similar to doPageWrites in 
 InitXLOGAccess?
 Yep, you're right. I missed this code path.

 Also , in the earlier patches compression was set 'on' even when fpw GUC is 
 'off'. This was to facilitate compression of FPW which are forcibly written 
 even when fpw GUC is turned off.
  doPageCompression in this patch is set to true only if value of fpw GUC is 
 'compress'. I think its better to compress forcibly written full page writes.
 Meh? (stealing a famous quote).
 This is backward-incompatible in the fact that forcibly-written FPWs
 would be compressed all the time, even if FPW is set to off. The
 documentation of the previous patches also mentioned that images are
 compressed only if this parameter value is switched to compress.

If we have a separate GUC to determine whether to do compression of
full page writes, then it seems like that parameter ought to apply
regardless of WHY we are doing full page writes, which might be either
that full_pages_writes=on in general, or that we've temporarily turned
them on for the duration of a full backup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing DDL deparsing support

2014-12-02 Thread Robert Haas
On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick i...@2ndquadrant.com wrote:
 DDL deparsing is a feature that allows collection of DDL commands as they
 are
 executed in a server, in some flexible, complete, and fully-contained format
 that allows manipulation, storage, and transmission.  This feature has
 several
 use cases; the two best known ones are DDL replication and DDL auditing.

 We have came up with a design that uses a JSON structure to store commands.
 It is similar to the C sprintf() call in spirit: there is a base format
 string, which is a generic template for each command type, and contains
 placeholders that represent the variable parts of the command.  The values
 for
 the placeholders in each specific command are members of the JSON object.  A
 helper function is provided that expands the format string and replaces the
 placeholders with the values, and returns the SQL command as text.  This
 design lets the user operate on the JSON structure in either a read-only
 fashion (for example to block table creation if the names don't satisfy a
 certain condition), or by modifying it (for example, to change the schema
 name
 so that tables are created in different schemas when they are replicated to
 some remote server).

 This design is mostly accepted by the community.  The one sticking point is
 testing: how do we ensure that the JSON representation we have created
 correctly deparses back into a command that has the same effect as the
 original command.  This was expressed by Robert Haas:
 http://www.postgresql.org/message-id/CA+TgmoZ=vzrijmxlkqi_v0jg4k4leapmwusc6rwxs5mquxu...@mail.gmail.com

 The problem cannot be solved by a standard regression test module which runs
 a
 bunch of previously-defined commands and verifies the output.  We need not
 only check the output for the commands as they exist today, but also we need
 to ensure that this does not get broken as future patches modify the
 existing
 commands as well as create completely new commands.

 The challenge here is to create a new testing framework that ensures the DDL
 deparsing module will be maintained by future hackers as the DDL grammar is
 modified.

 What and How to Test
 

 Our goal should be that patch authors run make check-world in their
 patched
 copies and notice that the DDL deparse test is failing; they can then modify
 deparse_utility.c to add support for the new commands, which should in
 general
 be pretty straightforward.  This way, maintaining deparsing code would be
 part
 of new patches just like we require pg_dump support and documentation for
 new
 features.

 It would not work to require patch authors to add their new commands to a
 new
 pg_regress test file, because most would not be aware of the need, or they
 would just forget to do it, and patches would be submitted and possibly even
 committed without any realization of the breakage caused.

 There are two things we can rely on: standard regression tests, and pg_dump.

 Standard regression tests are helpful because patch authors include new test
 cases in the patches that stress their new options or commands.  This new
 test
 framework needs to be something that internally runs the regression tests
 and
 exercises DDL deparsing as the regression tests execute DDL.  That would
 mean
 that new commands and options would automatically be deparse-tested by our
 new
 framework as soon as patch authors add standard regress support.

 One thing is first-grade testing, that is ensure that the deparsed version
 of
 a DDL command can be executed at all, for if the deparsed version throws an
 error, it's immediately obvious that the deparse code is bogus.  This is
 because we know the original command did not throw an error: otherwise, the
 deparse code would not have run at all, because ddl_command_end triggers are
 only executed once the original command has completed execution.  So
 first-grade testing ensures that no trivial bugs are present.

 But there's second-grade verification as well: is the object produced by the
 deparsed version identical to the one produced by the original command?  One
 trivial but incomplete approach is to run the command, then save the
 deparsed
 version; run the deparsed version, and deparse that one too; compare both.
 The problem with this approach is that if the deparse code is omitting some
 clause (say it omits IN TABLESPACE in a CREATE TABLE command), then both
 deparsed versions would contain the same bug yet they would compare equal.
 Therefore this approach is not good enough.

 The best idea we have so far to attack second-grade testing is to trust
 pg_dump to expose differences: accumulate commands as they run in the
 regression database, the run the deparsed versions in a different database;
 then pg_dump both databases and compare the dumped outputs.

 Proof-of-concept
 

 We have now implemented this as a proof-of-concept; the code is available
 in the deparse branch at:

   

Re: [HACKERS] Using pg_rewind for differential backup

2014-12-02 Thread Robert Haas
On Fri, Nov 28, 2014 at 2:49 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 It also would be quite straightforward to write a separate tool to do just
 that. Would be better than conflating pg_rewind with this. You could use
 pg_rewind as the basis for it - it's under the same license as PostgreSQL.

If we had such a tool in core, would that completely solve the
differential backup problem, or would more be needed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fillfactor for GIN indexes

2014-12-02 Thread Robert Haas
On Fri, Nov 28, 2014 at 4:27 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 On Fri, Nov 21, 2014 at 8:12 AM, Michael Paquier michael.paqu...@gmail.com
 wrote:
 Please find attached a simple patch adding fillfactor as storage parameter
 for GIN indexes. The default value is the same as the one currently aka 100
 to have the pages completely packed when a GIN index is created.


 That's not true. Let us discuss it a little bit.
 In btree access method index is created by sorting index tuples. Once they
 are sorted it can write a tree with any fullness of the pages. With
 completely filled pages you will get flood of index page splits on table
 updates or inserts. In order to avoid that the default fillfactor is 90.
 Besides index creation, btree access method tries to follow given fillfactor
 in the case of inserting ordered data. When rightmost page splits then
 fillfactor percent of data goes to the left page.
 Btree page life cycle is between being freshly branched by split (50%
 filled) and being full packed (100% filled) and then splitted. Thus we can
 roughly estimate average fullness of btree page to be 75%. This
 equilibrium state can be achieved by huge amount of inserts over btree
 index.

 Fillfactor was spreaded to other access methods. For instance, GiST. In
 GiST, tree is always constructed by page splits. So, freshly created GiST is
 already in its equilibrium state. Even more GiST doesn't splits page by
 halves. Split ration depends on opclass. So, equilibrium fullness of
 particular GiST index is even less than 75%. What does fillfactor do with
 GiST? It makes GiST pages split on fillfactor fullness on index build. So,
 GiST is even less fulled. But why? You anyway don't get flood of split on
 fresh GiST index because it's in equilibrium state. That's why I would
 rather delete fillfactor from GiST until we have some algorithm to construct
 GiST without page splits.

 What is going on with GIN? GIN is, basically, two-level nested btree. So,
 fillfactor should be applicable here. But GIN is not constructed like
 btree...
 GIN accumulates maintenance_work_mem of data and then inserts it into the
 tree. So fresh GIN is the result of insertion of maintenance_work_mem
 buckets. And each bucket is sorted. On small index (or large
 maintenance_work_mem) it would be the only one bucket. GIN has two levels of
 btree: entry tree and posting tree. Currently entry tree has on special
 logic to control fullness during index build. So, with only one bucket entry
 tree will be 50% filled. With many buckets entry tree will be at
 equilibrium state. In some specific cases (about two buckets) entry tree
 can be almost fully packed. Insertion into posting tree is always ordered
 during index build because posting tree is a tree of TIDs. When posting tree
 page splits it leaves left page fully packed during index build and 75%
 packed during insertion (because it expects some sequential inserts).

 Idea of GIN building algorithm is that entry tree is expected to be
 relatively small and fit to cache. Insertions into posting trees are orders.
 Thus this algorithm should be IO efficient and have low overhead. However,
 with large entry trees this algorithm can perform much worse.

 My summary is following:
 1) In order to have fully correct support of fillfactor in GIN we need to
 rewrite GIN build algorithm.
 2) Without rewriting GIN build algorithm, not much can be done with entry
 tree. However, you can implement some heuristics.
 3) You definitely need to touch code that selects ratio of split in
 dataPlaceToPageLeaf (starting with if (!btree-isBuild)).
 3) GIN data pages are always compressed excepts pg_upgraded indexes from
 pre-9.4. Take care about it in following code.
   if (GinPageIsCompressed(page))
   freespace = GinDataLeafPageGetFreeSpace(page);
 + else if (btree-isBuild)
 + freespace = BLCKSZ * (100 - fillfactor) / 100;

This is a very interesting explanation; thanks for writing it up!

It does leave me wondering why anyone would want fillfactor for GIN at
all, even if they were willing to rewrite the build algorithm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2014-12-02 Thread Robert Haas
On Fri, Nov 28, 2014 at 11:38 AM, Alex Shulgin a...@commandprompt.com wrote:
 Should I add this patch to the next CommitFest?

If you don't want it to get forgotten about, yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-02 Thread Robert Haas
On Fri, Nov 28, 2014 at 11:59 AM, Petr Jelinek p...@2ndquadrant.com wrote:
 I'm a bit late to the party, but wouldn't

 recovery_target_action = ...

 have been a better name for this? It'd be in line with the other
 recovery_target_* parameters, and also a bit shorter than the imho
 somewhat ugly action_at_recovery_target.

 FWIW, I too think that recovery_target_action is a better name.

 I agree.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing DDL deparsing support

2014-12-02 Thread Alvaro Herrera
Robert Haas wrote:
 On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick i...@2ndquadrant.com wrote:

  A simple schedule to demonstrate this is available; execute from the
  src/test/regress/ directory like this:
 
  ./pg_regress \
--temp-install=./tmp_check \
--top-builddir=../../.. \
--dlpath=. \
--schedule=./schedule_ddl_deparse_demo
 
 I haven't read the code, but this concept seems good to me.

Excellent, thanks.

 It has the unfortunate weakness that a difference could exist during
 the *middle* of the regression test run that is gone by the *end* of
 the run, but our existing pg_upgrade testing has the same weakness, so
 I guess we can view this as one more reason not to be too aggressive
 about having regression tests drop the unshared objects they create.

Agreed.  Not dropping objects also helps test pg_dump itself; the normal
procedure there is run the regression tests, then pg_dump the regression
database.  Objects that are dropped never exercise their corresponding
pg_dump support code, which I think is a bad thing.  I think we should
institute a policy that regression tests must keep the objects they
create; maybe not all of them, but at least a sample large enough to
cover all interesting possibilities.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Josh Berkus
On 12/02/2014 06:25 AM, Alex Shulgin wrote:
 I am not in favor of this part. It may be better to let the users know
  that their old configuration is not valid anymore with an error. This
  patch cuts in the flesh with a huge axe, let's be sure that users do
  not ignore the side pain effects, or recovery.conf would be simply
  ignored and users would not be aware of that.
 Yeah, that is good point.
 
 I'd be in favor of a solution that works the same way as before the
 patch, without the need for extra trigger files, etc., but that doesn't
 seem to be nearly possible.  

As previously discussed, there are ways to avoid having a trigger file
for replication.  However, it's hard to avoid having one for PITR
recovery; at least, I can't think of a method which isn't just as
awkward, and we might as well stick with the awkward method we know.

 Whatever tricks we might employ will likely
 be defeated by the fact that the oldschool user will fail to *include*
 recovery.conf in the main conf file.

Well, can we merge this patch and then fight out what to do for the
transitional users as a separate patch?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Robert Haas
On Sat, Nov 29, 2014 at 3:00 PM, Stephen Frost sfr...@snowman.net wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
  Attached is a patch to address the pg_cancel/terminate_backend and the
  statistics info as discussed previously.  It sounds like we're coming to

 And I forgot the attachment, of course.  Apologies.

 Updated patch attached which also changes the error messages (which
 hadn't been updated in the prior versions and really should be).

 Barring objections, I plan to move forward with this one and get this
 relatively minor change wrapped up.  As mentioned in the commit, while
 this might be an arguably back-patchable change, the lack of field
 complaints and the fact that it changes existing behavior mean it should
 go only against master, imv.

This patch does a couple of different things:

1. It makes more of the crappy error message change that Andres and I
already objected to on the other thread.  Whether you disagree with
those objections or not, don't make an end-run around them by putting
more of the same stuff into patches on other threads.

2. It changes the functions in pgstatfuncs.c so that you can see the
relevant information not only for your own role, but also for roles of
which you are a member.  That seems fine, but do we need a
documentation change someplace?

3. It messes around with pg_signal_backend().  There are currently no
cases in which pg_signal_backend() throws an error, which is good,
because it lets you write queries against pg_stat_activity() that
don't fail halfway through, even if you are missing permissions on
some things.  This patch introduces such a case, which is bad.

I think it's unfathomable that you would consider anything in this
patch a back-patchable bug fix.  It's clearly a straight-up behavior
change... or more properly three different changes, only one of which
I agree with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Robert Haas
On Sat, Nov 29, 2014 at 11:46 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 What do you mean by never succeed? Is it skipping a large number of pages?
 Might re-trying the locks within the same vacuum help, or are the user locks
 too persistent?

You are confused.  He's talking about the relation-level lock that
vacuum attempts to take before doing any work at all on a given table,
not the per-page cleanup locks that it takes while processing each
page.  If the relation-level lock can't be acquired, the whole table
is skipped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Alvaro Herrera
Josh Berkus wrote:
 On 12/02/2014 06:25 AM, Alex Shulgin wrote:

  Whatever tricks we might employ will likely
  be defeated by the fact that the oldschool user will fail to *include*
  recovery.conf in the main conf file.
 
 Well, can we merge this patch and then fight out what to do for the
 transitional users as a separate patch?

You seem to be saying I don't have any good idea how to solve this
problem now, but I will magically have one once this is committed.  I'm
not sure that works very well.

In any case, the proposal upthread that we raise an error if
recovery.conf is found seems sensible enough.  Users will see it and
they will adjust their stuff -- it's a one-time thing.  It's not like
they switch a version forwards one week and back the following week.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Alvaro Herrera
Robert Haas wrote:
 On Sat, Nov 29, 2014 at 11:46 PM, Jim Nasby jim.na...@bluetreble.com wrote:
  What do you mean by never succeed? Is it skipping a large number of pages?
  Might re-trying the locks within the same vacuum help, or are the user locks
  too persistent?
 
 You are confused.  He's talking about the relation-level lock that
 vacuum attempts to take before doing any work at all on a given table,
 not the per-page cleanup locks that it takes while processing each
 page.  If the relation-level lock can't be acquired, the whole table
 is skipped.

Almost there.  Autovacuum takes the relation-level lock, starts
processing.  Some time later, another process wants a lock that
conflicts with the one autovacuum has.  This is flagged by the deadlock
detector, and a signal is sent to autovacuum, which commits suicide.

If the table is large, the time window for this to happen is large also;
there might never be a time window large enough between two lock
acquisitions for one autovacuum run to complete in a table.  This
starves the table from vacuuming completely, until things are bad enough
that an emergency vacuum is forced.  By then, the bloat is disastrous.

I think it's that suicide that Andres wants to disable.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Josh Berkus
On 12/02/2014 10:31 AM, Alvaro Herrera wrote:
 Josh Berkus wrote:
 On 12/02/2014 06:25 AM, Alex Shulgin wrote:
 
 Whatever tricks we might employ will likely
 be defeated by the fact that the oldschool user will fail to *include*
 recovery.conf in the main conf file.

 Well, can we merge this patch and then fight out what to do for the
 transitional users as a separate patch?
 
 You seem to be saying I don't have any good idea how to solve this
 problem now, but I will magically have one once this is committed.  I'm
 not sure that works very well.

No, I'm saying this problem is easy to solve technically, but we have
intractable arguments on this list about the best way to solve it, even
though the bulk of the patch isn't in dispute.

 In any case, the proposal upthread that we raise an error if
 recovery.conf is found seems sensible enough.  Users will see it and
 they will adjust their stuff -- it's a one-time thing.  It's not like
 they switch a version forwards one week and back the following week.

I'm OK with that solution.  Apparently others aren't though.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Josh Berkus
On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
 If the table is large, the time window for this to happen is large also;
 there might never be a time window large enough between two lock
 acquisitions for one autovacuum run to complete in a table.  This
 starves the table from vacuuming completely, until things are bad enough
 that an emergency vacuum is forced.  By then, the bloat is disastrous.
 
 I think it's that suicide that Andres wants to disable.

A much better solution for this ... and one which would solve a *lot* of
other issues with vacuum and autovacuum ... would be to give vacuum a
way to track which blocks an incomplete vacuum had already visited.
This would be even more valuable for freeze.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund
On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
 On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
  If the table is large, the time window for this to happen is large also;
  there might never be a time window large enough between two lock
  acquisitions for one autovacuum run to complete in a table.  This
  starves the table from vacuuming completely, until things are bad enough
  that an emergency vacuum is forced.  By then, the bloat is disastrous.
  
  I think it's that suicide that Andres wants to disable.

Correct.

 A much better solution for this ... and one which would solve a *lot* of
 other issues with vacuum and autovacuum ... would be to give vacuum a
 way to track which blocks an incomplete vacuum had already visited.
 This would be even more valuable for freeze.

That's pretty much a different problem. Yes, some more persistent would
be helpful - although it'd need to be *much* more than which pages it
has visited - but you'd still be vulnerable to the same issue.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Josh Berkus
On 12/02/2014 11:08 AM, Andres Freund wrote:
 On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
 On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
 If the table is large, the time window for this to happen is large also;
 there might never be a time window large enough between two lock
 acquisitions for one autovacuum run to complete in a table.  This
 starves the table from vacuuming completely, until things are bad enough
 that an emergency vacuum is forced.  By then, the bloat is disastrous.

 I think it's that suicide that Andres wants to disable.
 
 Correct.
 
 A much better solution for this ... and one which would solve a *lot* of
 other issues with vacuum and autovacuum ... would be to give vacuum a
 way to track which blocks an incomplete vacuum had already visited.
 This would be even more valuable for freeze.
 
 That's pretty much a different problem. Yes, some more persistent would
 be helpful - although it'd need to be *much* more than which pages it
 has visited - but you'd still be vulnerable to the same issue.

If we're trying to solve the problem that vacuums of large, high-update
tables never complete, it's solving the same problem.  And in a much
better way.

And yeah, doing a vacuum placeholder wouldn't be simple, but it's the
only solution I can think of that's worthwhile.  Just disabling the
vacuum releases sharelock behavior puts the user in the situation of
deciding between maintenance and uptime.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing INNER JOINs

2014-12-02 Thread Robert Haas
On Sun, Nov 30, 2014 at 12:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bottom line, given all the restrictions on whether the optimization can
 happen, I have very little enthusiasm for the whole idea.  I do not think
 the benefit will be big enough to justify the amount of mess this will
 introduce.

This optimization applies to a tremendous number of real-world cases,
and we really need to have it.  This was a huge problem for me in my
previous life as a web developer.  The previous work that we did to
remove LEFT JOINs was an enormous help, but it's not enough; we need a
way to remove INNER JOINs as well.

I thought that David's original approach of doing this in the planner
was a good one.  That fell down because of the possibility that
apparently-valid referential integrity constraints might not be valid
at execution time if the triggers were deferred.  But frankly, that
seems like an awfully nitpicky thing for this to fall down on.  Lots
of web applications are going to issue only SELECT statements that run
as as single-statement transactions, and so that issue, so troubling
in theory, will never occur in practice.  That doesn't mean that we
don't need to account for it somehow to make the code safe, but any
argument that it abridges the use case significantly is, in my
opinion, not credible.

Anyway, David was undeterred by the rejection of that initial approach
and rearranged everything, based on suggestions from Andres and later
Simon, into the form it's reached now.  Kudos to him for his
persistance.  But your point that we might have chosen a whole
different plan if it had known that this join was cheaper is a good
one.  However, that takes us right back to square one, which is to do
this at plan time.  I happen to think that's probably better anyway,
but I fear we're just going around in circles here.  We can either do
it at plan time and find some way of handling the fact that there
might be deferred triggers that haven't fired yet; or we can do it at
execution time and live with the fact that we might have chosen a plan
that is not optimal, though still better than executing a
completely-unnecessary join.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund
On 2014-12-02 11:12:40 -0800, Josh Berkus wrote:
 On 12/02/2014 11:08 AM, Andres Freund wrote:
  On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
  On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
  If the table is large, the time window for this to happen is large also;
  there might never be a time window large enough between two lock
  acquisitions for one autovacuum run to complete in a table.  This
  starves the table from vacuuming completely, until things are bad enough
  that an emergency vacuum is forced.  By then, the bloat is disastrous.
 
  I think it's that suicide that Andres wants to disable.
  
  Correct.
  
  A much better solution for this ... and one which would solve a *lot* of
  other issues with vacuum and autovacuum ... would be to give vacuum a
  way to track which blocks an incomplete vacuum had already visited.
  This would be even more valuable for freeze.
  
  That's pretty much a different problem. Yes, some more persistent would
  be helpful - although it'd need to be *much* more than which pages it
  has visited - but you'd still be vulnerable to the same issue.
 
 If we're trying to solve the problem that vacuums of large, high-update
 tables never complete, it's solving the same problem.

Which isn't what I'm talking about.

The problem is that vacuum is cancelled if a conflicting lock request is
acquired. Plain updates don't do that. But there's workloads where you
need more heavyweight updates, and then it can easily happen.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2014-12-02 Thread Andres Freund
On 2014-11-24 13:16:24 +0200, Heikki Linnakangas wrote:
 To be clear: I don't think this API is very good for its stated purpose, for
 implementing global sequences for use in a cluster. For the reasons I've
 mentioned before.  I'd like to see two changes to this proposal:
 
 1. Make the AM implementation solely responsible for remembering the last
 value. (if it's a global or remote sequence, the current value might not be
 stored in the local server at all)

I think that reason isn't particularly good. The practical applicability
for such a implementation doesn't seem to be particularly large.

 2. Instead of the single amdata field, make it possible for the
 implementation to define any number of fields with any datatype in the
 tuple. That would make debugging, monitoring etc. easier.

My main problem with that approach is that it pretty much nails the door
shut for moving sequences into a catalog table instead of the current,
pretty insane, approach of a physical file per sequence. Currently, with
our without seqam, it'd not be all that hard to force it into a catalog,
taking care to to force each tuple onto a separate page...


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Jeff Janes
On Tue, Dec 2, 2014 at 11:12 AM, Josh Berkus j...@agliodbs.com wrote:

 On 12/02/2014 11:08 AM, Andres Freund wrote:
  On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
  On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
  If the table is large, the time window for this to happen is large
 also;
  there might never be a time window large enough between two lock
  acquisitions for one autovacuum run to complete in a table.  This
  starves the table from vacuuming completely, until things are bad
 enough
  that an emergency vacuum is forced.  By then, the bloat is disastrous.
 
  I think it's that suicide that Andres wants to disable.
 
  Correct.
 
  A much better solution for this ... and one which would solve a *lot* of
  other issues with vacuum and autovacuum ... would be to give vacuum a
  way to track which blocks an incomplete vacuum had already visited.
  This would be even more valuable for freeze.
 
  That's pretty much a different problem. Yes, some more persistent would
  be helpful - although it'd need to be *much* more than which pages it
  has visited - but you'd still be vulnerable to the same issue.

 If we're trying to solve the problem that vacuums of large, high-update
 tables never complete, it's solving the same problem.  And in a much
 better way.

 And yeah, doing a vacuum placeholder wouldn't be simple, but it's the
 only solution I can think of that's worthwhile.  Just disabling the
 vacuum releases sharelock behavior puts the user in the situation of
 deciding between maintenance and uptime.


I think it would be more promising to work on downgrading lock strengths so
that fewer things conflict, and it would be not much more work than what
you propose.

What operations are people doing on a regular basis that take locks which
cancel vacuum?  create index?

Cheers,

Jeff


Re: [HACKERS] Getting references for OID

2014-12-02 Thread Borodin Vladimir

 Hello,
  
 How can I get all depend objects for oid of object? For example, I have oid 
 of db db_id and I want to get all oids of namespaces, which contains this db. 
 Thank you!

Hello, Dmitry.

Actually, this list is for developers (as described here [0]). You should ask 
this question on pgsql-general@, for example, or you can do it on 
pgsql-ru-general@, in Russian.

And if I understood you right you can use oid2name tool which is part of 
contrib. More info about it can be found here [1].

[0] http://www.postgresql.org/list/
[1] http://www.postgresql.org/docs/current/static/oid2name.html

  
 -- 
 С уважением, Дмитрий Воронин
  


--
Vladimir






Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost sfr...@snowman.net wrote:
  It includes the actual error message, unlike what we have today which is
  guidence as to what's required to get past the permission denied error.
 
  In other words, I disagree that the same amount of information is being
  conveyed.
 
 So, you think that someone might see the message must be superuser or
 replication role to use replication slots and fail to understand that
 they have a permissions problem?

I think that the error message about a permissions problem should be
that there is a permissions problem, first and foremost.

We don't say 'You must have SELECT rights on relation X to select from
it.' when you try to do a SELECT that you're not allowed to.  We throw a
permissions denied error.  As pointed out up-thread, we do similar
things for other cases of permissions denied and even beyond permission
denied errors we tend to match up the errmsg() with the errcode(), which
makes a great deal of sense to me and is why I'm advocating that
position.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund
On 2014-12-02 11:23:31 -0800, Jeff Janes wrote:
 I think it would be more promising to work on downgrading lock strengths so
 that fewer things conflict, and it would be not much more work than what
 you propose.

I think you *massively* underestimate the effort required to to lower
lock levels. There's some very ugly corners you have to think about to
do so. Just look at how long it took to implement the lock level
reductions for ALTER TABLE - and those were the simpler cases.

 What operations are people doing on a regular basis that take locks
 which cancel vacuum?  create index?

Locking tables against modifications in this case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Stephen Frost
Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
 On Sat, Nov 29, 2014 at 3:00 PM, Stephen Frost sfr...@snowman.net wrote:
  * Stephen Frost (sfr...@snowman.net) wrote:
  Updated patch attached which also changes the error messages (which
  hadn't been updated in the prior versions and really should be).
 
  Barring objections, I plan to move forward with this one and get this
  relatively minor change wrapped up.  As mentioned in the commit, while
  this might be an arguably back-patchable change, the lack of field
  complaints and the fact that it changes existing behavior mean it should
  go only against master, imv.
 
 This patch does a couple of different things:
 
 1. It makes more of the crappy error message change that Andres and I
 already objected to on the other thread.  Whether you disagree with
 those objections or not, don't make an end-run around them by putting
 more of the same stuff into patches on other threads.

The error message clearly needed to be updated either way or I wouldn't
have touched it.  I changed it to match what I feel is the prevelant and
certainly more commonly seen messaging from PG when it comes to
permissions errors, and drew attention to it by commenting on the fact
that I changed it.  Doing otherwise would have drawn similar criticism
(is it did upthread, by Peter or Alvaro, I believe..) that I wasn't
updating it to match the messaging which we should be using.

 2. It changes the functions in pgstatfuncs.c so that you can see the
 relevant information not only for your own role, but also for roles of
 which you are a member.  That seems fine, but do we need a
 documentation change someplace?

Yes, I've added the documentation changes to my branch, just hadn't
posted an update yet (travelling today).

 3. It messes around with pg_signal_backend().  There are currently no
 cases in which pg_signal_backend() throws an error, which is good,
 because it lets you write queries against pg_stat_activity() that
 don't fail halfway through, even if you are missing permissions on
 some things.  This patch introduces such a case, which is bad.

Good point, I'll move that check up into the other functions, which will
allow for a more descriptive error as well.

 I think it's unfathomable that you would consider anything in this
 patch a back-patchable bug fix.  It's clearly a straight-up behavior
 change... or more properly three different changes, only one of which
 I agree with.

I didn't think it was back-patchable and stated as much.  I anticipated
that argument and provided my thoughts on it.  I *do* think it's wrong
to be using GetUserId() in this case and it's only very slightly
mollified by being documented that way.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Removing INNER JOINs

2014-12-02 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Sun, Nov 30, 2014 at 12:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Bottom line, given all the restrictions on whether the optimization can
  happen, I have very little enthusiasm for the whole idea.  I do not think
  the benefit will be big enough to justify the amount of mess this will
  introduce.
 
 This optimization applies to a tremendous number of real-world cases,
 and we really need to have it.  This was a huge problem for me in my
 previous life as a web developer.  The previous work that we did to
 remove LEFT JOINs was an enormous help, but it's not enough; we need a
 way to remove INNER JOINs as well.

For my 2c, I'm completely with Robert on this one.  There are a lot of
cases this could help with, particularly things coming out of ORMs
(which, yes, might possibly be better written, but that's a different
issue).

 I thought that David's original approach of doing this in the planner
 was a good one.  That fell down because of the possibility that
 apparently-valid referential integrity constraints might not be valid
 at execution time if the triggers were deferred.  But frankly, that
 seems like an awfully nitpicky thing for this to fall down on.  Lots
 of web applications are going to issue only SELECT statements that run
 as as single-statement transactions, and so that issue, so troubling
 in theory, will never occur in practice.  That doesn't mean that we
 don't need to account for it somehow to make the code safe, but any
 argument that it abridges the use case significantly is, in my
 opinion, not credible.

Agreed with this also, deferred triggers are not common-place in my
experience and when it *does* happen, ime at least, it's because you
have a long-running data load or similar where you're not going to
care one bit that large, complicated JOINs aren't as fast as they
might have been otherwise.

 Anyway, David was undeterred by the rejection of that initial approach
 and rearranged everything, based on suggestions from Andres and later
 Simon, into the form it's reached now.  Kudos to him for his
 persistance.  But your point that we might have chosen a whole
 different plan if it had known that this join was cheaper is a good
 one.  However, that takes us right back to square one, which is to do
 this at plan time.  I happen to think that's probably better anyway,
 but I fear we're just going around in circles here.  We can either do
 it at plan time and find some way of handling the fact that there
 might be deferred triggers that haven't fired yet; or we can do it at
 execution time and live with the fact that we might have chosen a plan
 that is not optimal, though still better than executing a
 completely-unnecessary join.

Right, we can't get it wrong in the face of deferred triggers either.
Have we considered only doing the optimization for read-only
transactions?  I'm not thrilled with that, but at least we'd get out
from under this deferred triggers concern.  Another way might be an
option to say use the optimization, but throw an error if you run
into a deferred trigger, or perhaps save both plans and use whichever
one we can when we get to execution time?  That could make planning
time go up too much to work, but perhaps it's worth testing..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Jeff Janes
On Tue, Dec 2, 2014 at 11:41 AM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-12-02 11:23:31 -0800, Jeff Janes wrote:
  I think it would be more promising to work on downgrading lock strengths
 so
  that fewer things conflict, and it would be not much more work than what
  you propose.

 I think you *massively* underestimate the effort required to to lower
 lock levels. There's some very ugly corners you have to think about to
 do so. Just look at how long it took to implement the lock level
 reductions for ALTER TABLE - and those were the simpler cases.


Or maybe I overestimate how hard it would be to make vacuum restartable.
You would have to save a massive amount of state (upto maintenance_work_mem
tid list, the block you left off on both the table and all of the indexes
in that table), and you would somehow have to validate that saved state
against any changes that might have occurred to the table or the indexes
while it was saved and you were not holding the lock, which seems like it
would almost as full of corner cases as weakening the lock in the first
place.  Aren't they logically the same thing?  If we could drop the lock
and take it up again later, maybe the answer is not to save the state, but
just to pause the vacuum until the lock becomes free again, in effect
saving the state in situ.  That would allow autovac worker to be held
hostage to anyone taking a lock, though.

The only easy way to do it that I see is to have it only stop at the end of
a index-cleaning cycle, which probably takes too long to block for.  Or
record a restart point at the end of each index-cleaning cycle, and then
when it yields the lock it abandons all work since the last cycle end,
rather than since the beginning.  That would be better than what we have,
but seems like a far cry from actual restarting from any point.



  What operations are people doing on a regular basis that take locks
  which cancel vacuum?  create index?

 Locking tables against modifications in this case.


So in share mode, then?  I don't think there is any reason that there
can't be a lock mode that conflicts with ROW EXCLUSIVE but not SHARE
UPDATE EXCLUSIVE.  Basically something that conflicts with logical
changes, but not with physical changes.

Cheers,

Jeff


Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
  3. It messes around with pg_signal_backend().  There are currently no
  cases in which pg_signal_backend() throws an error, which is good,
  because it lets you write queries against pg_stat_activity() that
  don't fail halfway through, even if you are missing permissions on
  some things.  This patch introduces such a case, which is bad.
 
 Good point, I'll move that check up into the other functions, which will
 allow for a more descriptive error as well.

Err, I'm missing something here, as pg_signal_backend() is a misc.c
static internal function?  How would you be calling it from a query
against pg_stat_activity()?

I'm fine making the change anyway, just curious..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 2:50 PM, Stephen Frost sfr...@snowman.net wrote:
 1. It makes more of the crappy error message change that Andres and I
 already objected to on the other thread.  Whether you disagree with
 those objections or not, don't make an end-run around them by putting
 more of the same stuff into patches on other threads.

 The error message clearly needed to be updated either way or I wouldn't
 have touched it.  I changed it to match what I feel is the prevelant and
 certainly more commonly seen messaging from PG when it comes to
 permissions errors, and drew attention to it by commenting on the fact
 that I changed it.  Doing otherwise would have drawn similar criticism
 (is it did upthread, by Peter or Alvaro, I believe..) that I wasn't
 updating it to match the messaging which we should be using.

OK, I guess that's a fair point.

 I think it's unfathomable that you would consider anything in this
 patch a back-patchable bug fix.  It's clearly a straight-up behavior
 change... or more properly three different changes, only one of which
 I agree with.

 I didn't think it was back-patchable and stated as much.  I anticipated
 that argument and provided my thoughts on it.  I *do* think it's wrong
 to be using GetUserId() in this case and it's only very slightly
 mollified by being documented that way.

It's not wrong.  It's just different than what you happen to prefer.
It's fine to want to change things, but not the way I would have done
it is not the same as arguably a bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund
On 2014-12-02 12:22:42 -0800, Jeff Janes wrote:
 Or maybe I overestimate how hard it would be to make vacuum
 restartable.

That's a massive project. Which is why I'm explicitly *not* suggesting
that. What I instead suggest is a separate threshhold after which vacuum
isn't going to abort automaticlaly after a lock conflict. So after that
threshold just behave like anti wraparound vacuum already does.

Maybe autovacuum_vacuum/analyze_force_threshold or similar. If set to
zero, the default, that behaviour is disabled. If set to a positive
value it's an absolute one, if negative it's a factor of the normal
autovacuum_vacuum/analyze_threshold.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 2:35 PM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost sfr...@snowman.net wrote:
  It includes the actual error message, unlike what we have today which is
  guidence as to what's required to get past the permission denied error.
 
  In other words, I disagree that the same amount of information is being
  conveyed.

 So, you think that someone might see the message must be superuser or
 replication role to use replication slots and fail to understand that
 they have a permissions problem?

 I think that the error message about a permissions problem should be
 that there is a permissions problem, first and foremost.

I can't disagree with that, but it's not like the current message is:

ERROR: I ran into some kind of a problem but I'm not going to tell you
what it is for a long time OK I guess that's enough well actually you
see there is a problem and it happens to do have to do with
permissions and in particular that they are denied.

It's pretty clear that the current message is complaining about a
permissions problem, so the fact that it doesn't specifically include
the words permission and denied doesn't bother me.  Let me ask the
question again: what information do you think is conveyed by your
proposed rewrite that is not conveyed by the current message?

 We don't say 'You must have SELECT rights on relation X to select from
 it.' when you try to do a SELECT that you're not allowed to.  We throw a
 permissions denied error.  As pointed out up-thread, we do similar
 things for other cases of permissions denied and even beyond permission
 denied errors we tend to match up the errmsg() with the errcode(), which
 makes a great deal of sense to me and is why I'm advocating that
 position.

I don't thing that trying to match up the errmsg() with the errcode()
is a useful exercise.  That's just restricting the ways that people
can write error messages in a way that is otherwise unnecessary.  The
errmsg() and errcode() have different purposes; the former is to
provide a concise, human-readable description of the problem, and the
latter is to group errors into categories so that related errors can
be handled programmatically.  They need not be, and in many existing
cases are not, even vaguely similar; and the fact that many of our
permission denied errors happen to use that phrasing is not a
sufficient justification for changing the rest unless the new phrasing
is a bona fide improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Andres Freund
On 2014-12-02 17:25:14 +0300, Alex Shulgin wrote:
 I'd be in favor of a solution that works the same way as before the
 patch, without the need for extra trigger files, etc., but that doesn't
 seem to be nearly possible.  Whatever tricks we might employ will likely
 be defeated by the fact that the oldschool user will fail to *include*
 recovery.conf in the main conf file.

I think removing the ability to define a trigger file is pretty much
unacceptable. It's not *too* bad to rewrite recovery.conf's contents
into actual valid postgresql.conf format, but it's harder to change an
existing complex failover setup that relies on the existance of such a
trigger.  I think aiming for removal of that is a sure way to prevent
the patch from getting in.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 1:38 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas robertmh...@gmail.com wrote:
 - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.

 Actually, the word only is removed (because PG_CACHE_LINE_SIZE has a
 new client now). So it isn't quite the same paragraph as before.

Oy, I missed that.

 - I really don't think we need a #define in pg_config_manual.h for
 this.  Please omit that.

 You'd prefer to not offer a way to disable abbreviation? Okay. I guess
 that makes sense - it should work well as a general optimization.

I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
that we expect a reasonably decent number of users to need to change
should be in that file, and this is too marginal for that.  If anybody
other than the developers of the feature is disabling this, the whole
thing is going to be ripped back out.

 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.
 Otherwise, sortsupport needs to assume that the leading attribute is
 going to be the abbreviation-applicable one, which might not always be
 true. Still, it's not as if I feel strongly about it.

When wouldn't that be true?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-12-02 Thread Emre Hasegeli
 I spent a fair chunk of the weekend hacking on this patch to make
 it more understandable and fix up a lot of what seemed to me pretty
 clear arithmetic errors in the upper layers of the patch.  However,
 I couldn't quite convince myself to commit it, because the business
 around estimation for partial histogram-bucket matches still doesn't
 make any sense to me.  Specifically this:

 /* Partial bucket match. */
 left_divider = inet_hist_match_divider(left, query, opr_codenum);
 right_divider = inet_hist_match_divider(right, query, 
 opr_codenum);

 if (left_divider = 0 || right_divider = 0)
 match += 1.0 / pow(2.0, Max(left_divider, right_divider));

 Now unless I'm missing something pretty basic about the divider
 function, it returns larger numbers for inputs that are further away
 from each other (ie, have more not-in-common significant address bits).
 So the above calculation seems exactly backwards to me: if one endpoint
 of a bucket is close to the query, or even an exact match, and the
 other endpoint is further away, we completely ignore the close/exact
 match and assign a bucket match fraction based only on the further-away
 endpoint.  Isn't that exactly backwards?

You are right that partial bucket match calculation isn't fair on
some circumstances.

 I experimented with logic like this:

 if (left_divider = 0  right_divider = 0)
 match += 1.0 / pow(2.0, Min(left_divider, right_divider));
 else if (left_divider = 0 || right_divider = 0)
 match += 1.0 / pow(2.0, Max(left_divider, right_divider));

 ie, consider the closer endpoint if both are valid.  But that didn't seem
 to work a whole lot better.  I think really we need to consider both
 endpoints not just one to the exclusion of the other.

I have tried many combinations like this.  Including arithmetic,
geometric, logarithmic mean functions.  I could not get good results
with any of them, so I left it in a basic form.

Max() works pretty well most of the time, because if the query matches
the bucket generally it is close to both of the endpoints.  By using
Max(), we are actually crediting the ones which are close to the both
of the endpoints.  

 I'm also not exactly convinced by the divider function itself,
 specifically about the decision to fail and return -1 if the masklen
 comparison comes out wrong.  This effectively causes the masklen to be
 the most significant part of the value (after the IP family), which seems
 totally wrong.  ISTM we ought to consider the number of leading bits in
 common as the primary indicator of how far apart a query and a
 histogram endpoint are.

The partial match calculation with Max() is especially unfair on
the buckets where more significant bits change.  For example 63/8 and
64/8.  Returning -1 instead of a high divider, forces it to use
the divider for the other endpoint.  We consider the number of leading
bits in common as the primary indicator, just for the other endpoint.

I have also experimented with the count of the common bits of
the endpoints of the bucket for better partial match calculation.
I could not find out a meaningful equation with it.

 Even if the above aspects of the code are really completely right, the
 comments fail to explain why.  I spent a lot of time on the comments,
 but so far as these points are concerned they still only explain what
 is being done and not why it's a useful calculation to make.

I couldn't write better comments because I don't have strong arguments
about it.  We can say that we don't try to make use of the both of
the endpoints, because we don't know how to combine them.  We only use
the one with matching family and masklen, and when both of them match
we use the distant one to be on the safer side.

Thank you for looking at it.  Comments look much better now.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-02 Thread Robert Haas
On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan p...@heroku.com wrote:
 Alright, so let me summarize what I think are the next steps in
 working towards getting this patch committed. I should produce a new
 revision which:

 * Uses default costings.

 * Applies a generic final quality check that enforces a distance of no
 greater than 50% of the total string size. (The use of default
 costings removes any reason to continue to do this)

 * Work through Robert's suggestions on other aspects that need work
 [1], most of which I already agreed to.

Sounds good so far.

 What is unclear is whether or not I should continue to charge extra
 for non-matching user supplied alias (and, I think more broadly,
 consider multiple RTEs iff the user did use an alias) - Robert was
 skeptical, but didn't seem to have made his mind up. I still think I
 should cost things based on aliases, and consider multiple RTEs even
 when the user supplied an alias (the penalty should just be a distance
 of 1 and not 3, though, in light of other changes to the
 weighing/costing). If I don't hear anything in the next day or two,
 I'll more or less preserve aliases-related aspects of the patch.

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name.  I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case (HINT: You must refer to this as column
as prime_minister.id rather than cameron.id).

Another idea, which I think I like less well, is to check the
Levenshtein distance between the allowed alias and the entered alias
and, if that's within the half-the-shorter-length threshold, consider
possible matches from that RTE, charge the distance between the
correct alias and the entered alias as a penalty to each potential
column match.

What I think won't do is to look at a situation where the user has
entered automobile.id and suggest that maybe they meant student.iq, or
even student.id. The amount of difference between the names has got to
matter for the RTE names, just as it does for the column names.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Dec 2, 2014 at 2:35 PM, Stephen Frost sfr...@snowman.net wrote:
  * Robert Haas (robertmh...@gmail.com) wrote:
  On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost sfr...@snowman.net wrote:
   It includes the actual error message, unlike what we have today which is
   guidence as to what's required to get past the permission denied error.
  
   In other words, I disagree that the same amount of information is being
   conveyed.
 
  So, you think that someone might see the message must be superuser or
  replication role to use replication slots and fail to understand that
  they have a permissions problem?
 
  I think that the error message about a permissions problem should be
  that there is a permissions problem, first and foremost.
 
 I can't disagree with that, but it's not like the current message is:
 
 ERROR: I ran into some kind of a problem but I'm not going to tell you
 what it is for a long time OK I guess that's enough well actually you
 see there is a problem and it happens to do have to do with
 permissions and in particular that they are denied.

I agree that doing this would be rather overkill.

 It's pretty clear that the current message is complaining about a
 permissions problem, so the fact that it doesn't specifically include
 the words permission and denied doesn't bother me.  Let me ask the
 question again: what information do you think is conveyed by your
 proposed rewrite that is not conveyed by the current message?

I do like that it's explicitly stating that the problem is with
permissions and not because of some lack-of-capability in the software
and I like the consistency of messaging (to the point where I was
suggesting somewhere along the way that we update the error messaging
documentation to be explicit that the errmsg and the errcode should make
sense and match up- NOT that we should use some simple mapping from one
to the other as we'd then be reducing the information provided, but I've
seen a number of cases where people have the SQL error-code also
returned in their psql session; I've never been a fan of that as I much
prefer words over error-codes, but I wonder if they've done that
because, in some cases, it *isn't* clear what the errcode is from the
errmsg..).

  We don't say 'You must have SELECT rights on relation X to select from
  it.' when you try to do a SELECT that you're not allowed to.  We throw a
  permissions denied error.  As pointed out up-thread, we do similar
  things for other cases of permissions denied and even beyond permission
  denied errors we tend to match up the errmsg() with the errcode(), which
  makes a great deal of sense to me and is why I'm advocating that
  position.
 
 I don't thing that trying to match up the errmsg() with the errcode()
 is a useful exercise.  That's just restricting the ways that people
 can write error messages in a way that is otherwise unnecessary.  The
 errmsg() and errcode() have different purposes; the former is to
 provide a concise, human-readable description of the problem, and the
 latter is to group errors into categories so that related errors can
 be handled programmatically.  They need not be, and in many existing
 cases are not, even vaguely similar; and the fact that many of our
 permission denied errors happen to use that phrasing is not a
 sufficient justification for changing the rest unless the new phrasing
 is a bona fide improvement.

It provides a consistency of messaging that I feel is an improvement.
That they aren't related in a number of existing cases strikes me as an
opportunity to improve rather than cases where we really know better.
Perhaps in some of those cases the problem is that there isn't a good
errcode, and maybe we should actually add an error code instead of
changing the message.  Perhaps there are cases where we just know better
or it's a very generic errcode without any real meaning.  I agree that
we want the messaging to be good, I'd like it to also be consistent with
itself and with the errcode.

I've not gone and tried to do a deep look at the errmsg vs errcode, and
that's likely too much to bite off at once anyway, but I have gone and
looked at the role attribute uses and the other permissions denied
errcode cases and we're not at all consistent today and we change from
release-to-release depending on changes to the permissions system (see:
TRUNCATE).  I specifically don't like that.

If we want to say that role-attribute-related error messages should be
You need X to do Y, then I'll go make those changes, removing the
'permission denied' where those are used and be happier that we're at
least consistent, but it'll be annoying if we ever make those
role-attributes be GRANT'able rights (GRANT .. ON CLUSTER?)  as those
errmsg's will then change from you need X to permission denied to do
Y.  Having the errdetail change bothers me a lot less.

How would you feel about changing the various usages of
have_createrole_privilege() 

Re: [HACKERS] Selectivity estimation for inet operators

2014-12-02 Thread Emre Hasegeli
 Actually, there's a second large problem with this patch: blindly
 iterating through all combinations of MCV and histogram entries makes the
 runtime O(N^2) in the statistics target.  I made up some test data (by
 scanning my mail logs) and observed the following planning times, as
 reported by EXPLAIN ANALYZE:

 explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
 explain analyze select * from relays r1, relays r2 where r1.ip  r2.ip;

 stats targeteqjoinsel   networkjoinsel

 100 0.27 ms 1.85 ms
 10002.56 ms 167.2 ms
 1   56.6 ms 13987.1 ms

 I don't think it's necessary for network selectivity to be quite as
 fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
 time for a query that might need just milliseconds to execute :-(

 It seemed to me that it might be possible to reduce the runtime by
 exploiting knowledge about the ordering of the histograms, but
 I don't have time to pursue that right now.

I make it break the loop when we passed the last possible match. Patch
attached.  It reduces the runtime almost 50% with large histograms.

We can also make it use only some elements of the MCV and histogram
for join estimation.  I have experimented with reducing the right and
the left hand side of the lists on the previous versions.  I remember
it was better to reduce only the left hand side.  I think it would be
enough to use log(n) elements of the right hand side MCV and histogram.
I can make the change, if you think selectivity estimation function
is the right place for this optimization.
diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index f854847..16f39db 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -612,20 +612,23 @@ inet_hist_value_sel(Datum *values, int nvalues, Datum constvalue,
 		return 0.0;
 
 	query = DatumGetInetPP(constvalue);
 
 	/* left is the left boundary value of the current bucket ... */
 	left = DatumGetInetPP(values[0]);
 	left_order = inet_inclusion_cmp(left, query, opr_codenum);
 
 	for (i = 1; i  nvalues; i++)
 	{
+		if (left_order == 256)
+			break;
+
 		/* ... and right is the right boundary value */
 		right = DatumGetInetPP(values[i]);
 		right_order = inet_inclusion_cmp(right, query, opr_codenum);
 
 		if (left_order == 0  right_order == 0)
 		{
 			/* The whole bucket matches, since both endpoints do. */
 			match += 1.0;
 		}
 		else if ((left_order = 0  right_order = 0) ||
@@ -854,20 +857,23 @@ inet_opr_codenum(Oid operator)
 static int
 inet_inclusion_cmp(inet *left, inet *right, int opr_codenum)
 {
 	if (ip_family(left) == ip_family(right))
 	{
 		int			order;
 
 		order = bitncmp(ip_addr(left), ip_addr(right),
 		Min(ip_bits(left), ip_bits(right)));
 
+		if (order  0)
+			return 256;
+
 		if (order != 0)
 			return order;
 
 		return inet_masklen_inclusion_cmp(left, right, opr_codenum);
 	}
 
 	return ip_family(left) - ip_family(right);
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote:
 I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
 that we expect a reasonably decent number of users to need to change
 should be in that file, and this is too marginal for that.  If anybody
 other than the developers of the feature is disabling this, the whole
 thing is going to be ripped back out.

I agree. The patch either works well in general and it goes in, or it
doesn't and should be rejected. That doesn't mean that the standard
applied is that regressions are absolutely unacceptable, but the
standard shouldn't be too far off that.  I feel pretty confident that
we'll be able to meet that standard, though, because database luminary
Jim Gray recommended this technique in about 1995. Even if the details
of what I have here could stand to be tweaked, there is no getting
around the fact that a good sort routine needs to strongly consider
locality. That was apparent even in 1995, but now it's a very major
trend.

Incidentally, I think that an under-appreciated possible source of
regressions here is that attributes abbreviated have a strong
physical/logical correlation. I could see a small regression for one
such case even though my cost model indicated that it should be very
profitable. On the other hand, on other occasions my cost model (i.e.
considering how good a proxy for full key cardinality abbreviated key
cardinality is) was quite pessimistic. Although, at least it was only
a small regression, even though the correlation was something like
0.95. And at least the sort will be very fast in any case.

You'll recall that Heikki's test case involved correlation like that,
even though it was mostly intended to make a point about the entropy
in abbreviated keys. Correlation was actually the most important
factor there. I think it might be generally true that it's the most
important factor, in practice more important even than capturing
sufficient entropy in the abbreviated key representation.

 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.
 Otherwise, sortsupport needs to assume that the leading attribute is
 going to be the abbreviation-applicable one, which might not always be
 true. Still, it's not as if I feel strongly about it.

 When wouldn't that be true?

It just feels a bit wrong to me. There might be a future in which we
want to use the datum1 field for a non-leading attribute. For example,
when it is known ahead of time that there are low cardinality integers
in the leading key/attribute. Granted, that's pretty speculative, but
then it's not as if I'm insisting that it must be done that way. I
defer to you.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Bruce Momjian
On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
 Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the
 merge function would be the same as the transition function, as the state type
 is just the same as the input type. It would only be aggregates like avg(),
 stddev*(), bool_and() and bool_or() that would need a new merge function
 made... These would be no more complex than the transition functions... Which
 are just a few lines of code anyway.
 
 We'd simply just not run parallel query if any aggregates used in the query
 didn't have a merge function.
 
 When I mentioned this, I didn't mean to appear to be placing a road block.I 
 was
 just bringing to the table the information that COUNT(*) + COUNT(*) works ok
 for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not.

Sorry, late reply, but, FYI, I don't think our percentile functions
can't be parallelized in the same way:

test= \daS *percent*
  List of aggregate 
functions
   Schema   |  Name   |  Result data type  | 
Argument data types  | Description

+-++--+-
 pg_catalog | percent_rank| double precision   | VARIADIC any 
ORDER BY VARIADIC any   | fractional rank of hypothetical row
 pg_catalog | percentile_cont | double precision   | double precision 
ORDER BY double precision   | continuous distribution percentile
 pg_catalog | percentile_cont | double precision[] | double precision[] 
ORDER BY double precision | multiple continuous percentiles
 pg_catalog | percentile_cont | interval   | double precision 
ORDER BY interval   | continuous distribution percentile
 pg_catalog | percentile_cont | interval[] | double precision[] 
ORDER BY interval | multiple continuous percentiles
 pg_catalog | percentile_disc | anyelement | double precision 
ORDER BY anyelement | discrete percentile
 pg_catalog | percentile_disc | anyarray   | double precision[] 
ORDER BY anyelement   | multiple discrete percentiles

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 recovery/startup problems

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 11:54 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 During abort processing after getting a SIGTERM, the back end truncates
 59288 to zero size, and unlinks all the other files (including 59288_init).
 The actual removal of 59288 is left until the checkpoint.  So if you SIGTERM
 the backend, then take down the server uncleanly before the next checkpoint
 completes, you are left with just 59288.

 Here is the strace:

 open(base/16416/59288, O_RDWR)= 8
 ftruncate(8, 0) = 0
 close(8)= 0
 unlink(base/16416/59288.1)= -1 ENOENT (No such file or
 directory)
 unlink(base/16416/59288_fsm)  = -1 ENOENT (No such file or
 directory)
 unlink(base/16416/59288_vm)   = -1 ENOENT (No such file or
 directory)
 unlink(base/16416/59288_init) = 0
 unlink(base/16416/59288_init.1)   = -1 ENOENT (No such file or
 directory)

Hmm, that's not good.

I guess we can either adopt your suggestion of adjusting
ResetUnloggedRelationsInDbspaceDir() to cope with the possibility that
the situation has changed during recovery, or else figure out how to
be more stringent about the order in which forks get removed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 4:21 PM, Peter Geoghegan p...@heroku.com wrote:
 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.
 Otherwise, sortsupport needs to assume that the leading attribute is
 going to be the abbreviation-applicable one, which might not always be
 true. Still, it's not as if I feel strongly about it.

 When wouldn't that be true?

 It just feels a bit wrong to me. There might be a future in which we
 want to use the datum1 field for a non-leading attribute. For example,
 when it is known ahead of time that there are low cardinality integers
 in the leading key/attribute. Granted, that's pretty speculative, but
 then it's not as if I'm insisting that it must be done that way. I
 defer to you.

Well, maybe you should make the updates we've agreed on and I can take
another look at it.  But I didn't think that I was proposing to change
anything about the level at which the decision about whether to
abbreviate or not was made; rather, I thought I was suggesting that we
pass that flag down to the code that initializes the sortsupport
object as an argument rather than through the sortsupport structure
itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Alex Shulgin

Andres Freund and...@2ndquadrant.com writes:

 On 2014-12-02 17:25:14 +0300, Alex Shulgin wrote:
 I'd be in favor of a solution that works the same way as before the
 patch, without the need for extra trigger files, etc., but that doesn't
 seem to be nearly possible.  Whatever tricks we might employ will likely
 be defeated by the fact that the oldschool user will fail to *include*
 recovery.conf in the main conf file.

 I think removing the ability to define a trigger file is pretty much
 unacceptable. It's not *too* bad to rewrite recovery.conf's contents
 into actual valid postgresql.conf format, but it's harder to change an
 existing complex failover setup that relies on the existance of such a
 trigger.  I think aiming for removal of that is a sure way to prevent
 the patch from getting in.

To make it clear, I was talking not about trigger_file, but about
standby.enabled file which triggers the recovery/pitr/standby scenario
in the current form of the patch and stands as a replacement check
instead of the original check for the presense of recovery.conf.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

Agreed.

 But I didn't think that I was proposing to change
 anything about the level at which the decision about whether to
 abbreviate or not was made; rather, I thought I was suggesting that we
 pass that flag down to the code that initializes the sortsupport
 object as an argument rather than through the sortsupport structure
 itself.

The flag I'm talking about concerns the *applicability* of
abbreviation, and not whether or not it will actually be used (maybe
the opclass lacks support, or decides not to for some platform
specific reason). Tuplesort has a contract with abbreviation +
sortsupport that considers whether or not the function pointer used to
abbreviate is set, which relates to whether or not abbreviation will
*actually* be used. Note that for non-abbreviation-applicable
attributes, btsortsupport_worker() never sets the function pointer
(nor, incidentally, does it set the other abbreviation related
function pointers in the struct).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 5:16 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

 Agreed.

 But I didn't think that I was proposing to change
 anything about the level at which the decision about whether to
 abbreviate or not was made; rather, I thought I was suggesting that we
 pass that flag down to the code that initializes the sortsupport
 object as an argument rather than through the sortsupport structure
 itself.

 The flag I'm talking about concerns the *applicability* of
 abbreviation, and not whether or not it will actually be used (maybe
 the opclass lacks support, or decides not to for some platform
 specific reason). Tuplesort has a contract with abbreviation +
 sortsupport that considers whether or not the function pointer used to
 abbreviate is set, which relates to whether or not abbreviation will
 *actually* be used. Note that for non-abbreviation-applicable
 attributes, btsortsupport_worker() never sets the function pointer
 (nor, incidentally, does it set the other abbreviation related
 function pointers in the struct).

Right, and what I'm saying is that maybe the applicability flag
shouldn't be stored in the SortSupport object, but passed down as an
argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Right, and what I'm saying is that maybe the applicability flag
 shouldn't be stored in the SortSupport object, but passed down as an
 argument.

But then how does that information get to any given sortsupport
routine? That's the place that really needs to know if abbreviation is
useful. In general, they're only passed a SortSupport object. Are you
suggesting revising the signature required of SortSupport routines to
add that extra flag as an additional argument?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Right, and what I'm saying is that maybe the applicability flag
 shouldn't be stored in the SortSupport object, but passed down as an
 argument.

 But then how does that information get to any given sortsupport
 routine? That's the place that really needs to know if abbreviation is
 useful. In general, they're only passed a SortSupport object. Are you
 suggesting revising the signature required of SortSupport routines to
 add that extra flag as an additional argument?

I think that is what he's suggesting, and I too am wondering why it's
a good idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Nitpicky doc corrections for BRIN functions of pageinspect

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 12:23 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Peter Geoghegan wrote:
 On Wed, Nov 19, 2014 at 8:02 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  Just a small thing I noticed while looking at pageinspect.sgml, the
  set of SQL examples related to BRIN indexes uses lower-case characters
  for reserved keywords. This has been introduced by 7516f52.
  Patch is attached.

 My nitpicky observation about pageinspect + BRIN was that the comments
 added to the pageinspect SQL file for the SQL-callable function
 brin_revmap_data() have a comment header style slightly inconsistent
 with the existing items.

 Pushed.
Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 2:17 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Nov 26, 2014 at 11:00 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila rahila.s...@nttdata.com 
 wrote:
 Don't we need to initialize doPageCompression  similar to doPageWrites in 
 InitXLOGAccess?
 Yep, you're right. I missed this code path.

 Also , in the earlier patches compression was set 'on' even when fpw GUC is 
 'off'. This was to facilitate compression of FPW which are forcibly written 
 even when fpw GUC is turned off.
  doPageCompression in this patch is set to true only if value of fpw GUC is 
 'compress'. I think its better to compress forcibly written full page 
 writes.
 Meh? (stealing a famous quote).
 This is backward-incompatible in the fact that forcibly-written FPWs
 would be compressed all the time, even if FPW is set to off. The
 documentation of the previous patches also mentioned that images are
 compressed only if this parameter value is switched to compress.

 If we have a separate GUC to determine whether to do compression of
 full page writes, then it seems like that parameter ought to apply
 regardless of WHY we are doing full page writes, which might be either
 that full_pages_writes=on in general, or that we've temporarily turned
 them on for the duration of a full backup.

In the latest versions of the patch, control of compression is done
within full_page_writes by assigning a new value 'compress'. Something
that I am scared of is that if we enforce compression when
full_page_writes is off for forcibly-written pages and if a bug shows
up in the compression/decompression algorithm at some point (that's
unlikely to happen as this has been used for years with toast but
let's say if), we may corrupt a lot of backups. Hence why not simply
having a new GUC parameter to fully control it. First versions of the
patch did that but ISTM that it is better than enforcing the use of a
new feature for our user base.

Now, something that has not been mentioned on this thread is to make
compression the default behavior in all cases so as we won't even have
to use a GUC parameter. We are usually conservative about changing
default behaviors so I don't really think that's the way to go. Just
mentioning the possibility.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Jim Nasby

On 12/2/14, 2:22 PM, Jeff Janes wrote:

Or maybe I overestimate how hard it would be to make vacuum restartable.  You 
would have to save a massive amount of state (upto maintenance_work_mem tid 
list, the block you left off on both the table and all of the indexes in that 
table), and you would somehow have to validate that saved state against any 
changes that might have occurred to the table or the indexes while it was saved 
and you were not holding the lock, which seems like it would almost as full of 
corner cases as weakening the lock in the first place.  Aren't they logically 
the same thing?  If we could drop the lock and take it up again later, maybe 
the answer is not to save the state, but just to pause the vacuum until the 
lock becomes free again, in effect saving the state in situ.  That would allow 
autovac worker to be held hostage to anyone taking a lock, though.


Yeah, rather than messing with any of that, I think it would make a lot more 
sense to split vacuum into smaller operations that don't require such a huge 
chunk of time.


The only easy way to do it that I see is to have it only stop at the end of a 
index-cleaning cycle, which probably takes too long to block for.  Or record a 
restart point at the end of each index-cleaning cycle, and then when it yields 
the lock it abandons all work since the last cycle end, rather than since the 
beginning.  That would be better than what we have, but seems like a far cry 
from actual restarting from any point.


Now that's not a bad idea. This would basically mean just saving a block number 
in pg_class after every intermediate index clean and then setting that back to 
zero when we're done with that relation, right?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-12-02 Thread Kouhei Kaigai
 -Original Message-
 From: Simon Riggs [mailto:si...@2ndquadrant.com]
 Sent: Thursday, November 27, 2014 8:48 PM
 To: Kaigai Kouhei(海外 浩平)
 Cc: Robert Haas; Thom Brown; Kohei KaiGai; Tom Lane; Alvaro Herrera; Shigeru
 Hanada; Stephen Frost; Andres Freund; PgHacker; Jim Mlodgenski; Peter
 Eisentraut
 Subject: Re: [HACKERS] [v9.5] Custom Plan API
 
 On 27 November 2014 at 10:33, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
  The reason why documentation portion was not yet committed is, sorry,
  it is due to quality of documentation from the standpoint of native
  English speaker.
  Now, I'm writing up a documentation stuff according to the latest code
  base, please wait for several days and help to improve.
 
 Happy to help with that.
 
 Please post to the Wiki first so we can edit it communally.
 
Simon, 

I tried to describe how custom-scan provider interact with the core backend,
and expectations to individual callbacks here.

  https://wiki.postgresql.org/wiki/CustomScanInterface

I'd like to see which kind of description should be added, from third person's
viewpoint.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Many processes blocked at ProcArrayLock

2014-12-02 Thread Michael Paquier
On Tue, Dec 2, 2014 at 5:07 PM, Xiaoyulei xiaoyu...@huawei.com wrote:
 Test configuration:
 Hardware:
 4P intel server, 60 core 120 hard thread.
 Memory:512G
 SSD:2.4T

 PG:
 max_connections = 160   # (change requires restart)
 shared_buffers = 32GB
 work_mem = 128MB
 maintenance_work_mem = 32MB
 bgwriter_delay = 100ms  # 10-1ms between rounds
 bgwriter_lru_maxpages = 200 # 0-1000 max buffers written/round
 bgwriter_lru_multiplier = 2.0   # 0-10.0 multipler on buffers 
 scanned/round
 wal_level = minimal # minimal, archive, or hot_standby
 wal_buffers = 256MB # min 32kB, -1 sets based on 
 shared_buffers
 autovacuum = off
 checkpoint_timeout=60min
 checkpoint_segments = 1000
 archive_mode = off
 synchronous_commit = off
 fsync = off
 full_page_writes = off


 We use tpcc and pgbench to test postgresql 9.4beat2 performance. And we found 
 the tps/tpmc could not increase with the terminal increase. The detail 
 information is in attachment.

 Many processes is blocked, I dump the call stack, and found these processes 
 is blocked at: ProcArrayLock. 60% processes is blocked in 
 ProcArrayEndTransaction with ProcArrayLock EXCLUSIVE, 20% is in 
 GetSnapshotData with ProcArrayLock SHARED. Others locks like XLogFlush and 
 WALInsertLock are not very heavy.

 Is there any way we solve this problem?
Providing complete backtraces showing in which code paths those
processes are blocked would help better in understand what may be
going on.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

 Agreed.

Attached, revised patchset makes these updates. I continue to use the
sortsupport struct to convey that a given attribute of a given sort is
abbreviation-applicable (although the field is now a bool, not an
enum).

-- 
Peter Geoghegan
From 865a1abeb6bfb601b1ec605afb1e339c0e444e10 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sun, 9 Nov 2014 14:38:44 -0800
Subject: [PATCH 2/2] Estimate total number of rows to be sorted

Sortsupport opclasses now accept a row hint, indicating the estimated
number of rows to be sorted.  This gives opclasses a sense of proportion
about how far along the copying of tuples is when considering aborting
abbreviation.

Estimates come from various sources.  The text opclass now always avoids
aborting abbreviation if the total number of rows to be sorted is high
enough, without considering cardinality at all.
---
 src/backend/access/nbtree/nbtree.c |  5 ++-
 src/backend/access/nbtree/nbtsort.c| 14 +-
 src/backend/commands/cluster.c |  4 +-
 src/backend/executor/nodeAgg.c |  5 ++-
 src/backend/executor/nodeSort.c|  1 +
 src/backend/utils/adt/orderedsetaggs.c |  2 +-
 src/backend/utils/adt/varlena.c| 80 --
 src/backend/utils/sort/tuplesort.c | 14 --
 src/include/access/nbtree.h|  2 +-
 src/include/utils/sortsupport.h|  7 ++-
 src/include/utils/tuplesort.h  |  6 +--
 11 files changed, 121 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d881525..d26c60b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -109,14 +109,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, index \%s\ already contains data,
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique,
+	 indexInfo-ii_Predicate != NIL, false);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo-ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, true);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 593571b..473ac54 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -73,6 +73,7 @@
 #include storage/smgr.h
 #include tcop/tcopprot.h
 #include utils/rel.h
+#include utils/selfuncs.h
 #include utils/sortsupport.h
 #include utils/tuplesort.h
 
@@ -149,10 +150,13 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool ispartial,
+			  bool isdead)
 {
 	BTSpool*btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
+	double		estRows;
+	float4		relTuples;
 
 	btspool-heap = heap;
 	btspool-index = index;
@@ -165,10 +169,16 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * Certain cases will always have a relTuples of 0, such as reindexing as
+	 * part of a CLUSTER operation, or when reindexing toast tables.  This is
+	 * interpreted as no estimate available.
 	 */
 	btKbytes = isdead ? work_mem : maintenance_work_mem;
+	relTuples = RelationGetForm(heap)-reltuples;
+	estRows =  relTuples * (isdead || ispartial ?  DEFAULT_INEQ_SEL : 1);
 	btspool-sortstate = tuplesort_begin_index_btree(heap, index, isunique,
-	 btKbytes, false);
+	 btKbytes, estRows, false);
 
 	return btspool;
 }
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index bc5f33f..8e5f536 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -890,7 +890,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	/* Set up sorting if wanted */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
-			maintenance_work_mem, false);
+			maintenance_work_mem,
+			RelationGetForm(OldHeap)-reltuples,
+			false);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 89de755..95143c3 100644
--- 

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 5:28 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached, revised patchset makes these updates.

Whoops. Missed some obsolete comments. Here is a third commit that
makes a further small modification to one comment.

-- 
Peter Geoghegan
From 8d1aba80f95e05742047cba5bd83d8f17aa5ef37 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Tue, 2 Dec 2014 17:42:21 -0800
Subject: [PATCH 3/3] Alter comments to reflect current naming

---
 src/include/utils/sortsupport.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 659233b..f7c73b3 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -127,9 +127,9 @@ typedef struct SortSupportData
 	 * Returning zero from the alternative comparator does not indicate
 	 * equality, as with a conventional support routine 1, though -- it
 	 * indicates that it wasn't possible to determine how the two abbreviated
-	 * values compared.  A proper comparison, using auth_comparator/
-	 * ApplySortComparatorFull() is therefore required.  In many cases this
-	 * results in most or all comparisons only using the cheap alternative
+	 * values compared.  A proper comparison, using abbrev_full_comparator/
+	 * ApplySortAbbrevFullComparator() is therefore required.  In many cases
+	 * this results in most or all comparisons only using the cheap alternative
 	 * comparison func, which is typically implemented as code that compiles to
 	 * just a few CPU instructions.  CPU cache miss penalties are expensive; to
 	 * get good overall performance, sort infrastructure must heavily weigh
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] About xmllint checking for the validity of postgres.xml in 9.5

2014-12-02 Thread Michael Paquier
Hi all,

Since commit 5d93ce2d, the output of xmllint is checked by passing
--valid to it. Isn't that a regression with what we were doing for
pre-9.4 versions? For example, with 9.4 and older versions it is
possible to compile man pages even if the xml spec is not entirely
valid when using docbook 4.2.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] About xmllint checking for the validity of postgres.xml in 9.5

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 12:09 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Since commit 5d93ce2d, the output of xmllint is checked by passing
 --valid to it. Isn't that a regression with what we were doing for
 pre-9.4 versions? For example, with 9.4 and older versions it is
 possible to compile man pages even if the xml spec is not entirely
 valid when using docbook 4.2.

Another thing coming to my mind is why don't we simply have a variable
to pass flags to xmllint similarly to xsltproc? Packagers would be
then free to pass the arguments they want. (Note that in some of the
environments where I build the docs postgres.xml is found as invalid,
making build fail for master only, not for older branches).

In any case, attached is a patch showing the idea, bringing more
flexibility in the build, default value being --valid --noout if the
flag is not passed by the caller.
Regards,
-- 
Michael
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 8bdd26c..99ce106 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -48,6 +48,10 @@ ifndef XMLLINT
 XMLLINT = $(missing) xmllint
 endif
 
+ifndef XMLLINTCFLAGS
+XMLLINTCFLAGS = --valid --noout
+endif
+
 ifndef XSLTPROC
 XSLTPROC = $(missing) xsltproc
 endif
@@ -82,7 +86,7 @@ override SPFLAGS += -wall -wno-unused-param -wno-empty -wfully-tagged
 man distprep-man: man-stamp
 
 man-stamp: stylesheet-man.xsl postgres.xml
-	$(XMLLINT) --noout --valid postgres.xml
+	$(XMLLINT) $(XMLLINTCFLAGS) postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_MAN_FLAGS) $^
 	touch $@
 
@@ -259,13 +263,13 @@ endif
 xslthtml: xslthtml-stamp
 
 xslthtml-stamp: stylesheet.xsl postgres.xml
-	$(XMLLINT) --noout --valid postgres.xml
+	$(XMLLINT) $(XMLLINTCFLAGS) postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $^
 	cp $(srcdir)/stylesheet.css html/
 	touch $@
 
 htmlhelp: stylesheet-hh.xsl postgres.xml
-	$(XMLLINT) --noout --valid postgres.xml
+	$(XMLLINT) $(XMLLINTCFLAGS) postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $^
 
 %-A4.fo.tmp: stylesheet-fo.xsl %.xml
@@ -287,7 +291,7 @@ FOP = fop
 
 epub: postgres.epub
 postgres.epub: postgres.xml
-	$(XMLLINT) --noout --valid $
+	$(XMLLINT) $(XMLLINTCFLAGS) $
 	$(DBTOEPUB) $
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Robert Haas
On Tue, Dec 2, 2014 at 7:16 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 In the latest versions of the patch, control of compression is done
 within full_page_writes by assigning a new value 'compress'. Something
 that I am scared of is that if we enforce compression when
 full_page_writes is off for forcibly-written pages and if a bug shows
 up in the compression/decompression algorithm at some point (that's
 unlikely to happen as this has been used for years with toast but
 let's say if), we may corrupt a lot of backups. Hence why not simply
 having a new GUC parameter to fully control it. First versions of the
 patch did that but ISTM that it is better than enforcing the use of a
 new feature for our user base.

That's a very valid concern.  But maybe it shows that
full_page_writes=compress is not the Right Way To Do It, because then
there's no way for the user to choose the behavior they want when
full_page_writes=off but yet a backup is in progress.  If we had a
separate GUC, we could know the user's actual intention, instead of
guessing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 12:35 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 2, 2014 at 7:16 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 In the latest versions of the patch, control of compression is done
 within full_page_writes by assigning a new value 'compress'. Something
 that I am scared of is that if we enforce compression when
 full_page_writes is off for forcibly-written pages and if a bug shows
 up in the compression/decompression algorithm at some point (that's
 unlikely to happen as this has been used for years with toast but
 let's say if), we may corrupt a lot of backups. Hence why not simply
 having a new GUC parameter to fully control it. First versions of the
 patch did that but ISTM that it is better than enforcing the use of a
 new feature for our user base.

 That's a very valid concern.  But maybe it shows that
 full_page_writes=compress is not the Right Way To Do It, because then
 there's no way for the user to choose the behavior they want when
 full_page_writes=off but yet a backup is in progress.  If we had a
 separate GUC, we could know the user's actual intention, instead of
 guessing.
Note that implementing a separate parameter for this patch would not
be much complicated if the core portion does not change much. What
about the long name full_page_compression or the longer name
full_page_writes_compression?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-02 Thread Amit Langote

Hi Robert,

From: Robert Haas [mailto:robertmh...@gmail.com]
  * Catalog schema:
 
  CREATE TABLE pg_catalog.pg_partitioned_rel
  (
 partrelidoidNOT NULL,
 partkindoidNOT NULL,
 partissub  bool  NOT NULL,
 partkey int2vector NOT NULL, -- partitioning attributes
 partopclass oidvector,
 
 PRIMARY KEY (partrelid, partissub),
 FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
 FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
  )
  WITHOUT OIDS ;
 
 So, we're going to support exactly two levels of partitioning?
 partitions with partissub=false and subpartitions with partissub=true?
  Why not support only one level of partitioning here but then let the
 children have their own pg_partitioned_rel entries if they are
 subpartitioned?  That seems like a cleaner design and lets us support
 an arbitrary number of partitioning levels if we ever need them.
 

Yeah, that's what I thought at some point in favour of dropping partissub 
altogether. 

However, not that this design solves it, there is one question - if we would 
want to support defining for a table both partition key and sub-partition key 
in advance? That is, without having defined a first level partition yet; in 
that case, what level do we associate sub-(sub-) partitioning key with or more 
to the point where do we keep it? One way is to replace partissub by 
partkeylevel with level 0 being the topmost-level partitioning key and so on 
while keeping the partrelid equal to the pg_class.oid of the parent. That 
brings us to next question of managing hierarchies in pg_partition_def 
corresponding to partkeylevel in the definition of topmost partitioned 
relation. But I guess those are implementation details rather than 
representational unless I am being too naïve.

  CREATE TABLE pg_catalog.pg_partition_def
  (
 partitionid  oid NOT NULL,
 partitionparentrel   oidNOT NULL,
 partitionisoverflow bool  NOT NULL,
 partitionvalues anyarray,
 
 PRIMARY KEY (partitionid),
 FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
  )
  WITHOUT OIDS;
 
  ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;
 
 What is an overflow partition and why do we want that?
 

That would be a default partition. That is, where the tuples that don't belong 
elsewhere (other defined partitions) go. VALUES clause of the definition for 
such a partition would look like:

(a range partition) ... VALUES LESS THAN MAXVALUE 
(a list partition) ... VALUES DEFAULT

There has been discussion about whether there shouldn't be such a place for 
tuples to go. That is, it should generate an error if a tuple can't go anywhere 
(or support auto-creating a new one like in interval partitioning?)

 What are you going to do if the partitioning key has two columns of
 different data types?
 

Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. 
They are one of the most crucial elements of the scheme.

I wonder if your suggestion of pg_node_tree plays well here. This then could be 
a list of CONSTs or some such... And I am thinking it's a concern only for 
range partitions, no? (that is, a multicolumn partition key)

I think partkind switches the interpretation of the field as appropriate. Am I 
missing something? By the way, I had mentioned we could have two values fields 
each for range and list partition kind.

  * DDL syntax (no multi-column partitioning, sub-partitioning support as 
  yet):
 
  -- create partitioned table and child partitions at once.
  CREATE TABLE parent (...)
  PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
  [ (
   PARTITION child
 {
 VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
   | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
 }
 [ WITH ( ... ) ] [ TABLESPACE tbs ]
   [, ...]
) ] ;
 
 How are you going to dump and restore this, bearing in mind that you
 have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
 wants to do pg_dump --table name_of_a_partition?
 

Assuming everything's (including partitioned relation and partitions at all 
levels) got a pg_class entry of its own, would OIDs be a problem? Or what is 
the nature of this problem if it's possible that it may be.

If someone pg_dump's an individual partition as a table, we could let it be 
dumped as just a plain table. I am thinking we should be able to do that or 
should be doing just that (?)

 I actually think it will be much cleaner to declare the parent first
 and then have separate CREATE TABLE statements that glue the children
 in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1,
 1).
 

Oh, do you mean to do away without any syntax for defining partitions with 
CREATE TABLE parent?

By the way, do you mean the following:

CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 1)


Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Noah Misch
On Fri, Nov 28, 2014 at 11:43:28AM -0500, Peter Eisentraut wrote:
 In light of the recent discussions about using ICU on OS X, I looked
 into the Core Foundation locale functions (Core Foundation = traditional
 Mac API in OS X, as opposed to the Unix/POSIX APIs).
 
 Attached is a proof of concept patch that just about works for the
 sorting aspects.  (The ctype aspects aren't there yet and will crash,
 but they could be done similarly.)  It passes an appropriately adjusted
 collate.linux.utf8 test, meaning that it does produce language-aware
 sort orders that are equivalent to what glibc produces.
 
 At the moment, this is probably just an experiment that shows where
 refactoring and better abstractions might be suitable if we want to
 support multiple locale libraries.  If we want to pursue ICU, I think
 this could be a useful third option.

Does this make the backend multi-threaded?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Craig Ringer
On 12/02/2014 12:52 AM, David E. Wheeler wrote:
 Gotta say, I’m thrilled to see movement on this front, and especially pleased 
 to see how consensus seems to be building around an abstracted interface to 
 keep options open. This platform-specific example really highlights the need 
 for it (I had no idea that there was separate and more up-to-date collation 
 support in Core Foundation than in the UNIX layer of OS X).

It'd also potentially let us make use of Windows' native locale APIs,
which AFAIK receive considerably more love on that platform than their
POSIX back-country cousins.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2014-12-02 Thread Kouhei Kaigai
 On Tue, Nov 25, 2014 at 3:44 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  Today, I had a talk with Hanada-san to clarify which can be a common
  portion of them and how to implement it. Then, we concluded both of
  features can be shared most of the infrastructure.
  Let me put an introduction of join replacement by foreign-/custom-scan
 below.
 
  Its overall design intends to inject foreign-/custom-scan node instead
  of the built-in join logic (based on the estimated cost). From the
  viewpoint of core backend, it looks like a sub-query scan that
  contains relations join internally.
 
  What we need to do is below:
 
  (1) Add a hook add_paths_to_joinrel()
  It gives extensions (including FDW drivers and custom-scan providers)
  chance to add alternative paths towards a particular join of
  relations, using ForeignScanPath or CustomScanPath, if it can run instead
 of the built-in ones.
 
  (2) Informs the core backend varno/varattno mapping One thing we need
  to pay attention is, foreign-/custom-scan node that performs instead
  of the built-in join node must return mixture of values come from both
  relations. In case when FDW driver fetch a remote record (also, fetch
  a record computed by external computing resource), the most reasonable
  way is to store it on ecxt_scantuple of ExprContext, then kicks
  projection with varnode that references this slot.
  It needs an infrastructure that tracks relationship between original
  varnode and the alternative varno/varattno. We thought, it shall be
  mapped to INDEX_VAR and a virtual attribute number to reference
  ecxt_scantuple naturally, and this infrastructure is quite helpful for
 both of ForegnScan/CustomScan.
  We'd like to add List *fdw_varmap/*custom_varmap variable to both of plan
 nodes.
  It contains list of the original Var node that shall be mapped on the
  position according to the list index. (e.g, the first varnode is
  varno=INDEX_VAR and
  varattno=1)
 
  (3) Reverse mapping on EXPLAIN
  For EXPLAIN support, above varnode on the pseudo relation scan needed
  to be solved. All we need to do is initialization of dpns-inner_tlist
  on
  set_deparse_planstate() according to the above mapping.
 
  (4) case of scanrelid == 0
  To skip open/close (foreign) tables, we need to have a mark to
  introduce the backend not to initialize the scan node according to
  table definition, but according to the pseudo varnodes list.
  As earlier custom-scan patch doing, scanrelid == 0 is a
  straightforward mark to show the scan node is not combined with a
 particular real relation.
  So, it also need to add special case handling around foreign-/custom-scan
 code.
 
  We expect above changes are enough small to implement basic join
  push-down functionality (that does not involves external computing of
  complicated expression node), but valuable to support in v9.5.
 
  Please comment on the proposition above.
 
 I don't really have any technical comments on this design right at the moment,
 but I think it's an important area where PostgreSQL needs to make some
 progress sooner rather than later, so I hope that we can get something
 committed in time for 9.5.
 
I tried to implement the interface portion, as attached.
Hanada-san may be under development of postgres_fdw based on this interface
definition towards the next commit fest.

Overall design of this patch is identical with what I described above.
It intends to allow extensions (FDW driver or custom-scan provider) to
replace a join by a foreign/custom-scan which internally contains a result
set of relations join externally computed. It looks like a relation scan
on the pseudo relation.

One we need to pay attention is, how setrefs.c fixes up varno/varattno
unlike regular join structure. I could find IndexOnlyScan already has
similar infrastructure that redirect references of varnode to a certain
column on ecxt_scantuple of ExprContext using a pair of INDEX_VAR and
alternative varattno.

This patch put a new field: fdw_ps_tlist of ForeignScan, and
custom_ps_tlist of CustomScan. It is extension's role to set a pseudo-
scan target-list (so, ps_tlist) of the foreign/custom-scan that replaced
a join.
If it is not NIL, set_plan_refs() takes another strategy to fix up them.
It calls fix_upper_expr() to map varnodes of expression-list on INDEX_VAR
according to the ps_tlist, then extension is expected to put values/isnull
pair on ss_ScanTupleSlot of scan-state according to the  ps_tlist preliminary
constructed.

Regarding to the primary hook to add alternative foreign/custom-scan
path instead of built-in join paths, I added the following hook on
add_paths_to_joinrel().

  /* Hook for plugins to get control in add_paths_to_joinrel() */
  typedef void (*set_join_pathlist_hook_type) (PlannerInfo *root,
   RelOptInfo *joinrel,
   RelOptInfo *outerrel,
   RelOptInfo *innerrel,

Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Peter Geoghegan
On Tue, Dec 2, 2014 at 10:07 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 On 12/02/2014 12:52 AM, David E. Wheeler wrote:
 Gotta say, I’m thrilled to see movement on this front, and especially 
 pleased to see how consensus seems to be building around an abstracted 
 interface to keep options open. This platform-specific example really 
 highlights the need for it (I had no idea that there was separate and more 
 up-to-date collation support in Core Foundation than in the UNIX layer of OS 
 X).

 It'd also potentially let us make use of Windows' native locale APIs,
 which AFAIK receive considerably more love on that platform than their
 POSIX back-country cousins.

Not to mention the fact that a MultiByteToWideChar() call could be
saved, and sortsupport for text would just work on Windows.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Kouhei Kaigai
 On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
  Likely for most aggregates, like count, sum, max, min, bit_and and
  bit_or the merge function would be the same as the transition
  function, as the state type is just the same as the input type. It
  would only be aggregates like avg(), stddev*(), bool_and() and
  bool_or() that would need a new merge function made... These would be
  no more complex than the transition functions... Which are just a few
 lines of code anyway.
 
  We'd simply just not run parallel query if any aggregates used in the
  query didn't have a merge function.
 
  When I mentioned this, I didn't mean to appear to be placing a road
  block.I was just bringing to the table the information that COUNT(*) +
  COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n)
 does not.
 
 Sorry, late reply, but, FYI, I don't think our percentile functions can't
 be parallelized in the same way:
 
   test= \daS *percent*
 List of
 aggregate functions
  Schema   |  Name   |  Result data type  |
 Argument data types  | Description
   +-++--
 +-
 
pg_catalog | percent_rank| double precision   | VARIADIC
 any ORDER BY VARIADIC any   | fractional rank of hypothetical row
pg_catalog | percentile_cont | double precision   | double
 precision ORDER BY double precision   | continuous distribution percentile
pg_catalog | percentile_cont | double precision[] | double
 precision[] ORDER BY double precision | multiple continuous percentiles
pg_catalog | percentile_cont | interval   | double
 precision ORDER BY interval   | continuous distribution
 percentile
pg_catalog | percentile_cont | interval[] | double
 precision[] ORDER BY interval | multiple continuous percentiles
pg_catalog | percentile_disc | anyelement | double
 precision ORDER BY anyelement | discrete percentile
pg_catalog | percentile_disc | anyarray   | double
 precision[] ORDER BY anyelement   | multiple discrete percentiles
 
Yep, it seems to me the type of aggregate function that is not obvious
to split into multiple partitions.
I think, it is valuable even if we can push-down a part of aggregate
functions which is well known by the core planner.
For example, we know count(*) = sum(nrows), we also know avg(X) can
be rewritten to enhanced avg() that takes both of nrows and partial
sum of X. We can utilize these knowledge to break-down aggregate
functions.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] printing table in asciidoc with psql

2014-12-02 Thread Michael Paquier
On Mon, Nov 17, 2014 at 7:48 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2014-11-07 22:37 GMT+01:00 Alvaro Herrera alvhe...@2ndquadrant.com:


 I did \o /tmp/tst, then
 \dS
 create table eh | oh ();
 \dS

 and then filtered the output file to HTML.  The CREATE TABLE tag ended
 up in the same line as the title of the following table.  I think
 there's a newline is missing somewhere.


 The good news is that the | in the table name was processed correctly;
 but I noticed that the table title is not using the escaped print, so I
 would imagine that if I put a | in the title, things would go wrong.
 (I don't know how to put titles on tables other than editing the
 hardcoded titles for \ commands in psql).

 Another thing is that spaces around the | seem gratuituous and produce
 bad results.  I tried select * from pg_class which results in a very
 wide table, and then the HTML output contains some cells with the values
 in the second line; this makes all rows taller than they must be,
 because some cells use the first line and other cells in the same row
 use the second line for the text.  I hand-edited the asciidoc and
 removed the spaces around | which makes the result nicer.  (Maybe
 removing the trailing space is enough.)


 I see a trailing spaces, but I don't see a described effect. Please, can you
 send some more specific test case?

This formatting problem is trivial to reproduce:
=# create table foo ();

CREATE TABLE
Time: 9.826 ms
=# \d

.List of relations
[options=header,cols=l,l,
l,l,frame=none]
|
^l| Schema ^l| Name ^l| Type ^l| Owner
| public | foo | table | ioltas
|


(1 row)


I just tested this patch, and yes I agree with Alvaro that it would be
good to minimize the extra spaces around the table separators '|'. Now
we need to be careful as well, and I think that we should just remove
the separators on the right of the separators as cells values
controlling for example spans would result in incorrect output, stuff
like that:
5 2.2+^.^
9 2+

Also, something a bit surprising is that this format produces always
one newline for each command, for example in the case of a DDL:
=# create table foo ();

CREATE TABLE
I think that this extra space should be removed as well, no?

This patch has been marked as Waiting on Author for a couple of
weeks, and the problems mentioned before have not been completely
addressed, hence marking this patch as returned with feedback. It
would be nice to see progress for the next CF.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] printing table in asciidoc with psql

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 3:52 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 This patch has been marked as Waiting on Author for a couple of
 weeks, and the problems mentioned before have not been completely
 addressed, hence marking this patch as returned with feedback. It
 would be nice to see progress for the next CF.
Btw, here is a resource showing table formatting in asciidoc:
http://asciidoc.org/newtables.html
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2

2014-12-02 Thread Michael Paquier
On Tue, Nov 18, 2014 at 12:33 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 17, 2014 at 10:31 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2014-11-17 10:21:04 -0500, Robert Haas wrote:
 Andres, where are we with this patch?

 1. You're going to commit it, but haven't gotten around to it yet.

 2. You're going to modify it some more and repost, but haven't gotten
 around to it yet.

 3. You're willing to see it modified if somebody else does the work,
 but are out of time to spend on it yourself.

 4. Something else?

 I'm working on it. Amit had found a hang on PPC that I couldn't
 reproduce on x86. Since then I've reproduced it and I think yesterday I
 found the problem. Unfortunately it always took a couple hours to
 trigger...

 I've also made some, in my opinion, cleanups to the patch since
 then. Those have the nice side effect of making the size of struct
 LWLock smaller, but that wasn't actually the indended effect.

 I'll repost once I've verified the problem is fixed and I've updated all
 commentary.

 The current problem is that I seem to have found a problem that's also
 reproducible with master :(. After a couple of hours a
 pgbench -h /tmp -p 5440 scale3000 -M prepared -P 5 -c 180 -j 60 -T 2 -S
 against a
 -c max_connections=200 -c shared_buffers=4GB
 cluster seems to hang on PPC. With all the backends waiting in buffer
 mapping locks. I'm now making sure it's really master and not my patch
 causing the problem - it's just not trivial with 180 processes involved.

 Ah, OK.  Thanks for the update.
Ping? This patch is in a stale state for a couple of weeks and still
marked as waiting on author for this CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 3:23 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
  Likely for most aggregates, like count, sum, max, min, bit_and and
  bit_or the merge function would be the same as the transition
  function, as the state type is just the same as the input type. It
  would only be aggregates like avg(), stddev*(), bool_and() and
  bool_or() that would need a new merge function made... These would be
  no more complex than the transition functions... Which are just a few
 lines of code anyway.
 
  We'd simply just not run parallel query if any aggregates used in the
  query didn't have a merge function.
 
  When I mentioned this, I didn't mean to appear to be placing a road
  block.I was just bringing to the table the information that COUNT(*) +
  COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n)
 does not.

 Sorry, late reply, but, FYI, I don't think our percentile functions can't
 be parallelized in the same way:

   test= \daS *percent*
 List of
 aggregate functions
  Schema   |  Name   |  Result data type  |
 Argument data types  | Description
   +-++--
 +-
 
pg_catalog | percent_rank| double precision   | VARIADIC
 any ORDER BY VARIADIC any   | fractional rank of hypothetical row
pg_catalog | percentile_cont | double precision   | double
 precision ORDER BY double precision   | continuous distribution percentile
pg_catalog | percentile_cont | double precision[] | double
 precision[] ORDER BY double precision | multiple continuous percentiles
pg_catalog | percentile_cont | interval   | double
 precision ORDER BY interval   | continuous distribution
 percentile
pg_catalog | percentile_cont | interval[] | double
 precision[] ORDER BY interval | multiple continuous percentiles
pg_catalog | percentile_disc | anyelement | double
 precision ORDER BY anyelement | discrete percentile
pg_catalog | percentile_disc | anyarray   | double
 precision[] ORDER BY anyelement   | multiple discrete percentiles

 Yep, it seems to me the type of aggregate function that is not obvious
 to split into multiple partitions.
 I think, it is valuable even if we can push-down a part of aggregate
 functions which is well known by the core planner.
 For example, we know count(*) = sum(nrows), we also know avg(X) can
 be rewritten to enhanced avg() that takes both of nrows and partial
 sum of X. We can utilize these knowledge to break-down aggregate
 functions.
Postgres-XC (Postgres-XL) has implemented such parallel aggregate
logic some time ago using a set of sub functions and a finalization
function to do the work.
My 2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers