Re: [HACKERS] On partitioning

2014-12-06 Thread Amit Kapila
On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
 langote_amit...@lab.ntt.co.jp wrote:

  I wonder if your suggestion of pg_node_tree plays well here. This then
could be a list of CONSTs or some such... And I am thinking it's a concern
only for range partitions, no? (that is, a multicolumn partition key)

 I guess you could list or hash partition on multiple columns, too.

How would you distinguish values in list partition for multiple
columns? I mean for range partition, we are sure there will
be either one value for each column, but for list it could
be multiple and not fixed for each partition, so I think it will not
be easy to support the multicolumn partition key for list
partitions.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-06 Thread Amit Kapila
On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Do we really need to support dml or pg_dump for individual partitions?

 I think we do.  It's quite reasonable for a DBA (or developer or
 whatever) to want to dump all the data that's in a single partition;
 for example, maybe they have the table partitioned, but also spread
 across several servers.  When the data on one machine grows too big,
 they want to dump that partition, move it to a new machine, and drop
 the partition from the old machine.  That needs to be easy and
 efficient.

 More generally, with inheritance, I've seen the ability to reference
 individual inheritance children be a real life-saver on any number of
 occasions.  Now, a new partitioning system that is not as clunky as
 constraint exclusion will hopefully be fast enough that people don't
 need to do it very often any more.  But I would be really cautious
 about removing the option.  That is the equivalent of installing a new
 fire suppression system and then boarding up the emergency exit.
 Yeah, you *hope* the new fire suppression system is good enough that
 nobody will ever need to go out that way any more.  But if you're
 wrong, people will die, so getting rid of it isn't prudent.  The
 stakes are not quite so high here, but the principle is the same.


Sure, I don't feel we should not provide anyway to take dump
for individual partition but not at level of independent table.
May be something like --table table_name
--partition partition_name.

In general, I think we should try to avoid exposing that partitions are
individual tables as that might hinder any future enhancement in that
area (example if we someone finds a different and better way to
arrange the partition data, then due to the currently exposed syntax,
we might feel blocked).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-12-06 Thread Amit Kapila
On Mon, Dec 1, 2014 at 12:18 PM, Dilip kumar dilip.ku...@huawei.com wrote:

 On 24 November 2014 11:29, Amit Kapila Wrote,


I have verified that all previous comments are addressed and
the new version is much better than previous version.


 here we are setting each target once and doing for all the tables..


Hmm, theoretically I think new behaviour could lead to more I/O in
certain cases as compare to existing behaviour.  The reason for more I/O
is that in the new behaviour, while doing Analyze for a particular table at
different targets, in-between it has Analyze of different table as well,
so the pages in shared buffers or OS cache for a particular table needs to
be reloded again for a new target whereas currently it will do all stages
of Analyze for a particular table in one-go which means that each stage
of Analyze could get benefit from the pages of a table loaded by previous
stage.  If you agree, then we should try to avoid this change in new
behaviour.


 Please provide you opinion.

I have few questions regarding function GetIdleSlot()

+ static int
+ GetIdleSlot(ParallelSlot *pSlot, int max_slot, const char *dbname,
+ const
char *progname, bool completedb)
{
..
+ /*
+  * Some of the slot are free, Process the results for slots whichever
+  * are free
+  */
+
+ do
+ {
+ SetCancelConn(pSlot[0].connection);
+
+ i = select_loop(maxFd,
slotset);
+
+ ResetCancelConn();
+
+ if (i  0)
+ {
+ /*
+
 * This can only happen if user has sent the cancel request using
+  *
Ctrl+C, Cancel is handled by 0th slot, so fetch the error result.
+  */
+
+
GetQueryResult(pSlot[0].connection, dbname, progname,
+
completedb);
+ return NO_SLOT;
+ }
+
+ Assert(i != 0);
+
+
for (i = 0; i  max_slot; i++)
+ {
+ if (!FD_ISSET(pSlot[i].sock,
slotset))
+ continue;
+
+ PQconsumeInput(pSlot[i].connection);
+
if (PQisBusy(pSlot[i].connection))
+ continue;
+
+
pSlot[i].isFree = true;
+
+ if (!GetQueryResult(pSlot[i].connection, dbname,
progname,
+ completedb))
+
return NO_SLOT;
+
+ if (firstFree  0)
+ firstFree = i;
+
}
+ }while(firstFree  0);
}

I wanted to understand what exactly the above loop is doing.

a.
first of all the comment on top of it says Some of the slot
are free, ..., if some slot is free, then why do you want
to process the results? (Do you mean to say that *None* of
the slot is free?)

b.
IIUC, you have called function select_loop(maxFd, slotset)
to check if socket descriptor is readable, if yes then why
in do..while loop the same maxFd is checked always, don't
you want to check different socket descriptors?  I am not sure
if I am missing something here

c.
After checking the socket descriptor for maxFd why you want
to run run the below for loop for all slots?
for (i = 0; i  max_slot; i++)


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2014-12-06 Thread Stephen Frost
* Amit Kapila (amit.kapil...@gmail.com) wrote:
 1. As the patch currently stands, it just shares the relevant
 data (like relid, target list, block range each worker should
 perform on etc.) to the worker and then worker receives that
 data and form the planned statement which it will execute and
 send the results back to master backend.  So the question
 here is do you think it is reasonable or should we try to form
 the complete plan for each worker and then share the same
 and may be other information as well like range table entries
 which are required.   My personal gut feeling in this matter
 is that for long term it might be better to form the complete
 plan of each worker in master and share the same, however
 I think the current way as done in patch (okay that needs
 some improvement) is also not bad and quite easier to implement.

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports.  That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

One of the big reasons why I was asking about performance data is that,
today, we can't easily split a single relation across multiple i/o
channels.  Sure, we can use RAID and get the i/o channel that the table
sits on faster than a single disk and possibly fast enough that a single
CPU can't keep up, but that's not quite the same.  The historical
recommendations for Hadoop nodes is around one CPU per drive (of course,
it'll depend on workload, etc, etc, but still) and while there's still a
lot of testing, etc, to be done before we can be sure about the 'right'
answer for PG (and it'll also vary based on workload, etc), that strikes
me as a pretty reasonable rule-of-thumb to go on.

Of course, I'm aware that this won't be as easy to implement..

 2. Next question related to above is what should be the
 output of ExplainPlan, as currently worker is responsible
 for forming its own plan, Explain Plan is not able to show
 the detailed plan for each worker, is that okay?

I'm not entirely following this.  How can the worker be responsible for
its own plan when the information passed to it (per the above
paragraph..) is pretty minimal?  In general, I don't think we need to
have specifics like this worker is going to do exactly X because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do.  I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
this is what I'm gonna have my workers do along w/ how many workers,
etc, and then it goes and does it.  Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

 3. Some places where optimizations are possible:
 - Currently after getting the tuple from heap, it is deformed by
 worker and sent via message queue to master backend, master
 backend then forms the tuple and send it to upper layer which
 before sending it to frontend again deforms it via slot_getallattrs(slot).

If this is done as I was proposing above, we might be able to avoid
this, but I don't know that it's a huge issue either way..  The bigger
issue is getting the filtering pushed down.

 - Master backend currently receives the data from multiple workers
 serially.  We can optimize in a way that it can check other queues,
 if there is no data in current queue.

Yes, this is pretty critical.  In fact, it's one of the recommendations
I made previously about how to change the Append node to parallelize
Foreign Scan node work.

 - Master backend is just responsible for coordination among workers
 It shares the required information to workers and then fetch the
 data processed by each worker, by using some more logic, we might
 be able to make master backend also fetch data from heap rather than
 doing just co-ordination among workers.

I don't think this is really necessary...

 I think in all above places we can do some optimisation, however
 we can do that later as well, unless they hit the performance badly for
 cases which people care most.

I agree that we can improve the performance through various
optimizations later, but it's important to get the general structure and
design right or we'll end up having to reimplement a lot of it.

 4. Should parallel_seqscan_degree value be 

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-06 Thread Michael Paquier
On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
  On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier 
 michael.paqu...@gmail.com
  wrote:
 
  
  
  
   On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed rahilasyed...@gmail.com
   wrote:
  
   I attempted quick review and could not come up with much except this
  
   +   /*
   +* Calculate the amount of FPI data in the record. Each backup
 block
   +* takes up BLCKSZ bytes, minus the hole length.
   +*
   +* XXX: We peek into xlogreader's private decoded backup blocks
 for
   the
   +* hole_length. It doesn't seem worth it to add an accessor macro
 for
   +* this.
   +*/
   +   fpi_len = 0;
   +   for (block_id = 0; block_id = record-max_block_id; block_id++)
   +   {
   +   if (XLogRecHasCompressedBlockImage(record, block_id))
   +   fpi_len += BLCKSZ - record-blocks[block_id].compress_len;
  
  
   IIUC, fpi_len in case of compressed block image should be
  
   fpi_len = record-blocks[block_id].compress_len;
  
   Yep, true. Patches need a rebase btw as Heikki fixed a commit related
 to
   the stats of pg_xlogdump.
  
 
  In any case, any opinions to switch this patch as Ready for committer?

 Needing a rebase is a obvious conflict to that... But I guess some wider
 looks afterwards won't hurt.


Here are rebased versions, which are patches 1 and 2. And I am switching as
well the patch to Ready for Committer. The important point to consider
for this patch is the use of the additional 2-bytes as uint16 in the block
information structure to save the length of a compressed block, which may
be compressed without its hole to achieve a double level of compression
(image compressed without its hole). We may use a simple flag on one or two
bits using for example a bit from hole_length, but in this case we would
need to always compress images with their hole included, something more
expensive as the compression would take more time.

Robert wrote:
 What I would suggest is instrument the backend with getrusage() at
 startup and shutdown and have it print the difference in user time and
 system time.  Then, run tests for a fixed number of transactions and
 see how the total CPU usage for the run differs.
That's a nice idea, which is done with patch 3 as a simple hack calling
twice getrusage at the beginning of PostgresMain and before proc_exit,
calculating the difference time and logging it for each process (used as
well log_line_prefix with %p).

Then I just did a small test with a load of a pgbench-scale-100 database on
fresh instances:
1) Compression = on:
Stop LSN: 0/487E49B8
getrusage: proc 11163: LOG:  user diff: 63.071127, system diff: 10.898386
pg_xlogdump: FPI size: 122296653 [90.52%]
2) Compression = off
Stop LSN: 0/4E54EB88
Result: proc 11648: LOG:  user diff: 43.855212, system diff: 7.857965
pg_xlogdump: FPI size: 204359192 [94.10%]
And the CPU consumption is showing quite some difference... I'd expect as
well pglz_compress to show up high in a perf profile for this case (don't
have the time to do that now, but a perf record -a -g would be fine I
guess).
Regards,
-- 
Michael
From 63264c7bc1fc22c647b73f67a41224f6b6ad3150 Mon Sep 17 00:00:00 2001
From: Michael Paquier michael@otacoo.com
Date: Tue, 25 Nov 2014 14:05:59 +0900
Subject: [PATCH 1/3] Move pg_lzcompress.c to src/common

Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. pglz_decompress contained a call
to elog to emit an error message in case of corrupted data. This function
is changed to return a status code to let its callers return an error
instead. Compression function is changed similarly to make the whole set
consistent.
---
 src/backend/access/heap/tuptoaster.c  |  11 +-
 src/backend/utils/adt/Makefile|   4 +-
 src/backend/utils/adt/pg_lzcompress.c | 779 -
 src/common/Makefile   |   3 +-
 src/common/pg_lzcompress.c| 784 ++
 src/include/utils/pg_lzcompress.h |  19 +-
 src/tools/msvc/Mkvcbuild.pm   |   3 +-
 7 files changed, 813 insertions(+), 790 deletions(-)
 delete mode 100644 src/backend/utils/adt/pg_lzcompress.c
 create mode 100644 src/common/pg_lzcompress.c

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index ce44bbd..9af456f 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -142,7 +142,8 @@ heap_tuple_untoast_attr(struct varlena * attr)
 
 			attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ);
 			SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ);
-			pglz_decompress(tmp, VARDATA(attr));
+			if (pglz_decompress(tmp, VARDATA(attr)) != PGLZ_OK)
+elog(ERROR, compressed data is corrupted);
 			pfree(tmp);
 		}
 	}
@@ -167,7 +168,8 @@ heap_tuple_untoast_attr(struct varlena * attr)
 
 		attr = (struct 

Re: [HACKERS] superuser() shortcuts

2014-12-06 Thread Peter Eisentraut
On 12/4/14 3:32 PM, Stephen Frost wrote:
 On reflection, this seemed odd because of how the code was written but
 perhaps it was intentional after all.  In general, superuser should be
 able to bypass permissions restrictions and I don't see why this case
 should be different.

 In general, I don't think we want to allow giving away of objects by
 unprivileged users.  We don't allow that to be done for tables and I'm
 surprised to hear that it's possible to give domains away.

 Superuser should be able to bypass the restriction, BUT the object given
 away by the superuser to an unprivileged user should NOT be able to be
 further given away by that unprivileged user.

Clearly, this issue is a bit more complex than a simple code cleanup.
So I'm going to set this patch as returned with feedback.

My suggestion for moving forward would be to define a general security
policy for the ALTER OWNER cases, and then fix those properly.

The changes for integration the superuser check into the replication
role check should perhaps be tackled as part of a general refactoring of
capability checks.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-06 Thread Stephen Frost
Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
 Attached is an updated patch.

Awesome, thanks!

 diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
 *** pg_extension_ownercheck(Oid ext_oid, Oid
 *** 5051,5102 
   }
   
   /*
 !  * Check whether specified role has CREATEROLE privilege (or is a superuser)
*
 !  * Note: roles do not have owners per se; instead we use this test in
 !  * places where an ownership-like permissions test is needed for a role.
 !  * Be sure to apply it to the role trying to do the operation, not the
 !  * role being operated on!  Also note that this generally should not be
 !  * considered enough privilege if the target role is a superuser.
 !  * (We don't handle that consideration here because we want to give a
 !  * separate error message for such cases, so the caller has to deal with 
 it.)
*/

The comment above is pretty big and I don't think we want to completely
remove it.  Can you add it as another 'Note' in the 'has_role_attribute'
comment and reword it accordingly?

 *** AlterRole(AlterRoleStmt *stmt)
 *** 508,513 
 --- 512,518 
   DefElem*dvalidUntil = NULL;
   DefElem*dbypassRLS = NULL;
   Oid roleid;
 + RoleAttr attributes;

Whitespace issue that should be fixed- attributes should line up with
roleid.

 --- 1405,1421 
  FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, 
 active, xmin, catalog_xmin, restart_lsn)
LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
   pg_roles| SELECT pg_authid.rolname,
 ! pg_check_role_attribute(pg_authid.oid, 'SUPERUSER'::text) AS rolsuper,
 ! pg_check_role_attribute(pg_authid.oid, 'INHERIT'::text) AS rolinherit,
 ! pg_check_role_attribute(pg_authid.oid, 'CREATEROLE'::text) AS 
 rolcreaterole,
 ! pg_check_role_attribute(pg_authid.oid, 'CREATEDB'::text) AS rolcreatedb,
 ! pg_check_role_attribute(pg_authid.oid, 'CATUPDATE'::text) AS 
 rolcatupdate,
 ! pg_check_role_attribute(pg_authid.oid, 'CANLOGIN'::text) AS rolcanlogin,
 ! pg_check_role_attribute(pg_authid.oid, 'REPLICATION'::text) AS 
 rolreplication,
 ! pg_check_role_attribute(pg_authid.oid, 'BYPASSRLS'::text) AS 
 rolbypassrls,
   pg_authid.rolconnlimit,
   ''::text AS rolpassword,
   pg_authid.rolvaliduntil,
   s.setconfig AS rolconfig,
   pg_authid.oid
  FROM (pg_authid

It occurs to me that in this case (and a few others..), we're doing a
lot of extra work- each call to pg_check_role_attribute() is doing a
lookup on the oid just to get back to what the rolattr value on this row
is.  How about a function which takes rolattr and the text
representation of the attribute and returns if the bit is set for that
rolattr value?  Then you'd pass pg_authid.rolattr into the function
calls above instead of the role's oid.

I don't see any changes to the regression test files, were they
forgotten in the patch?  I would think that at least the view definition
changes would require updates to the regression tests, though perhaps
nothing else.

Overall, I'm pretty happy with the patch and would suggest moving on to
writing up the documentation changes to go along with the code changes.
I'll continue to play around with it but it all seems pretty clean to
me and will allow us to easily add the additiaonl role attributes being
discussed.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [v9.5] Custom Plan API

2014-12-06 Thread Simon Riggs
On 27 November 2014 at 20:48, Simon Riggs si...@2ndquadrant.com wrote:
 On 27 November 2014 at 10:33, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 The reason why documentation portion was not yet committed is, sorry, it
 is due to quality of documentation from the standpoint of native English
 speaker.
 Now, I'm writing up a documentation stuff according to the latest code base,
 please wait for several days and help to improve.

 Happy to help with that.

 Please post to the Wiki first so we can edit it communally.

I've corrected a spelling mistake, but it reads OK at moment.


 The example contrib module was not committed and I am advised no longer
 works.

 May I submit the contrib/ctidscan module again for an example?

 Yes please. We have other contrib modules that exist as tests, so this
 seems reasonable to me.

I can't improve the docs without the example code. Is that available now?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2014-12-06 Thread Noah Misch
On Thu, Dec 04, 2014 at 10:00:14AM +0530, Ashutosh Bapat wrote:
 On Thu, Dec 4, 2014 at 9:05 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp 
 wrote:
  (2014/12/03 19:35), Ashutosh Bapat wrote:
  On Tue, Dec 2, 2014 at 8:29 AM, Etsuro Fujita
  fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote:
   IIUC, even the transactions over the local and the *single* remote
  server are not guaranteed to be executed atomically in the current
  form.  It is possible that the remote transaction succeeds and the
  local one fails, for example, resulting in data inconsistency
  between the local and the remote.
 
 
   IIUC, while committing transactions involving a single remote server,
  the steps taken are as follows
  1. the local changes are brought to PRE-COMMIT stage, which means that
  the transaction *will* succeed locally after successful completion of
  this phase,
  2. COMMIT message is sent to the foreign server
  3. If step two succeeds, local changes are committed and successful
  commit is conveyed to the client
  4. if step two fails, local changes are rolled back and abort status is
  conveyed to the client
  5. If step 1 itself fails, the remote changes are rolled back.
  This is as per one phase commit protocol which guarantees ACID for
  single foreign data source. So, the changes involving local and a single
  foreign server seem to be atomic and consistent.
 
 
  Really?  Maybe I'm missing something, but I don't think the current
  implementation for committing transactions has such a mechanism stated in
  step 1.  So, I think it's possible that the local transaction fails in
  step3 while the remote transaction succeeds, as mentioned above.
 
 
 PFA a script attached which shows this. You may want to check the code in
 pgfdw_xact_callback() for actions taken by postgres_fdw on various events.
 CommitTransaction() for how those events are generated. The code there
 complies with the sequence above.

While postgres_fdw delays remote commit to eliminate many causes for having
one server commit while another aborts, other causes remain.  The local
transaction could still fail due to WAL I/O problems or a system crash.  2PC
is the reliable answer, but that was explicitly out of scope for the initial
postgres_fdw write support.  Does this inheritance patch add any atomicity
problem that goes away when one breaks up the inheritance hierarchy and
UPDATEs each table separately?  If not, this limitation is okay.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-06 Thread Adam Brightwell
Stephen,

The comment above is pretty big and I don't think we want to completely
 remove it.  Can you add it as another 'Note' in the 'has_role_attribute'
 comment and reword it accordingly?


Ok, agreed.  Will address.

Whitespace issue that should be fixed- attributes should line up with
 roleid.


Ok.  Will address.

It occurs to me that in this case (and a few others..), we're doing a
 lot of extra work- each call to pg_check_role_attribute() is doing a
 lookup on the oid just to get back to what the rolattr value on this row
 is.  How about a function which takes rolattr and the text
 representation of the attribute and returns if the bit is set for that
 rolattr value?  Then you'd pass pg_authid.rolattr into the function
 calls above instead of the role's oid.


Makes sense, I'll put that together.


 I don't see any changes to the regression test files, were they
 forgotten in the patch?  I would think that at least the view definition
 changes would require updates to the regression tests, though perhaps
 nothing else.


Hmmm... :-/ The regression tests that changed were in
'src/test/regress/expected/rules.out' and should be near the bottom of the
patch.


 Overall, I'm pretty happy with the patch and would suggest moving on to
 writing up the documentation changes to go along with the code changes.
 I'll continue to play around with it but it all seems pretty clean to
 me and will allow us to easily add the additiaonl role attributes being
 discussed.


Sounds good.  I'll start on those changes next.

Thanks,
Adam


-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-06 Thread Stephen Frost
* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
  I don't see any changes to the regression test files, were they
  forgotten in the patch?  I would think that at least the view definition
  changes would require updates to the regression tests, though perhaps
  nothing else.
 
 Hmmm... :-/ The regression tests that changed were in
 'src/test/regress/expected/rules.out' and should be near the bottom of the
 patch.

Hah, looked just like changes to the system_views, sorry for the
confusion. :)

  Overall, I'm pretty happy with the patch and would suggest moving on to
  writing up the documentation changes to go along with the code changes.
  I'll continue to play around with it but it all seems pretty clean to
  me and will allow us to easily add the additiaonl role attributes being
  discussed.
 
 Sounds good.  I'll start on those changes next.

Great!

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] inherit support for foreign tables

2014-12-06 Thread David Fetter
On Thu, Dec 04, 2014 at 12:35:54PM +0900, Etsuro Fujita wrote:
 (2014/12/03 19:35), Ashutosh Bapat wrote:
 On Tue, Dec 2, 2014 at 8:29 AM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote:
 
 This is not exactly extension of non-inheritance case. non-inheritance
 case doesn't show two remote SQLs under the same plan node. May be you
 can rename the label Remote SQL as Remote UPDATE/INSERT/DELETE (or
 something to that effect) for the DML command and the Foreign plan node
 should be renamed to Foreign access node or something to indicate that
 it does both the scan as well as DML. I am not keen about the actual
 terminology, but I think a reader of plan shouldn't get confused.
 
 We can leave this for committer's judgement.
 
 Thanks for the proposal!  I think that would be a good idea.  But I think
 there would be another idea.  An example will be shown below.  We show the
 update commands below the ModifyTable node, not above the corresponding
 ForeignScan nodes, so maybe less confusing.  If there are no objections of
 you and others, I'll update the patch this way.
 
 postgres=# explain verbose update parent set a = a * 2 where a = 5;
  QUERY PLAN
 -
  Update on public.parent  (cost=0.00..280.77 rows=25 width=10)
On public.ft1
  Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1
   ^^
It occurs to me that the command generated by the FDW might well not
be SQL at all, as is the case with file_fdw and anything else that
talks to a NoSQL engine.

Would it be reasonable to call this Remote command or something
similarly generic?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-06 Thread Tomas Vondra
On 2.12.2014 02:52, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 01:33, Tom Lane wrote:
 What I suspect you're looking at here is the detritus of creation
 of a huge number of memory contexts. mcxt.c keeps its own state
 about existing contents in TopMemoryContext. So, if we posit that
 those contexts weren't real small, there's certainly room to
 believe that there was some major memory bloat going on recently.
 
 Aha! MemoryContextCreate allocates the memory for the new context
 from TopMemoryContext explicitly, so that it survives resets of the
 parent context. Is that what you had in mind by keeping state about
 existing contexts?
 
 Right.
 
 That'd probably explain the TopMemoryContext size, because
 array_agg() creates separate context for each group. So if you have
 1M groups, you have 1M contexts. Although I don's see how the size
 of those contexts would matter?
 
 Well, if they're each 6K, that's your 6GB right there.

FWIW, I tried to repeat the test with the array_agg() patch, submitted
to the next CF:

  http://www.postgresql.org/message-id/5479faea.3070...@fuzzy.cz

and the memory context stats after the first iteration look like this:

master (unpatched)
--
TopMemoryContext: 204878128 total in 25011 blocks; 204010192 free
(750034 chunks); 867936 used

patched with array_agg
--
TopMemoryContext: 78128 total in 11 blocks; 10144 free (34 chunks);
67984 used

Also, the RSS issue (which was the initial subject of this thread)
pretty much disappeared. So indeed, this seems to be caused by the
'islands' referenced by TopMemoryContext.

I wonder whether it'd be worth looking for places doing something
similar to array_agg(), potentially creating many contexts in parallel,
and maybe modifying them as in the patch.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-12-06 Thread Kouhei Kaigai
Simon,

  Yes please. We have other contrib modules that exist as tests, so this
  seems reasonable to me.
 
 I can't improve the docs without the example code. Is that available now?

Please wait for a few days. The ctidscan module is not adjusted for the
latest interface yet.

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: Simon Riggs [mailto:si...@2ndquadrant.com]
 Sent: Sunday, December 07, 2014 12:37 AM
 To: Kaigai Kouhei(海外 浩平)
 Cc: Robert Haas; Thom Brown; Kohei KaiGai; Tom Lane; Alvaro Herrera; Shigeru
 Hanada; Stephen Frost; Andres Freund; PgHacker; Jim Mlodgenski; Peter
 Eisentraut
 Subject: Re: [HACKERS] [v9.5] Custom Plan API
 
 On 27 November 2014 at 20:48, Simon Riggs si...@2ndquadrant.com wrote:
  On 27 November 2014 at 10:33, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
  The reason why documentation portion was not yet committed is, sorry,
  it is due to quality of documentation from the standpoint of native
  English speaker.
  Now, I'm writing up a documentation stuff according to the latest
  code base, please wait for several days and help to improve.
 
  Happy to help with that.
 
  Please post to the Wiki first so we can edit it communally.
 
 I've corrected a spelling mistake, but it reads OK at moment.
 
 
  The example contrib module was not committed and I am advised no
  longer works.
 
  May I submit the contrib/ctidscan module again for an example?
 
  Yes please. We have other contrib modules that exist as tests, so this
  seems reasonable to me.
 
 I can't improve the docs without the example code. Is that available now?
 
 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)

2014-12-06 Thread Tomas Vondra
Hi!

This was initially posted to pgsql-performance in this thread:

  http://www.postgresql.org/message-id/5472416c.3080...@fuzzy.cz

but pgsql-hackers seems like a more appropriate place for further
discussion.

Anyways, attached is v3 of the patch implementing the adaptive ndistinct
estimator. Just like the previous version, the original estimate is the
one stored/used, and the alternative one is just printed, to make it
possible to compare the results.

Changes in this version:

1) implementing compute_minimal_stats

   - So far only the 'scalar' (more common) case was handled.

   - The algorithm requires more detailed input data, the MCV-based
 stats insufficient, so the code hashes the values and then
 determines the f1, f2, ..., fN coefficients by sorting and
 walking the array of hashes.

2) handling wide values properly (now are counted into f1)

3) compensating for NULL values when calling optimize_estimate

   - The estimator has no notion of NULL values, so it's necessary to
 remove them both from the total number of rows, and sampled rows.

4) some minor fixes and refactorings


I also repeated the tests comparing the results to the current estimator
- full results are at the end of the post.

The one interesting case is the 'step skew' with statistics_target=10,
i.e. estimates based on mere 3000 rows. In that case, the adaptive
estimator significantly overestimates:

values   currentadaptive
--
106   99 107
1068 6449190
1006  38 6449190
10006327   42441

I don't know why I didn't get these errors in the previous runs, because
when I repeat the tests with the old patches I get similar results with
a 'good' result from time to time. Apparently I had a lucky day back
then :-/

I've been messing with the code for a few hours, and I haven't found any
significant error in the implementation, so it seems that the estimator
does not perform terribly well for very small samples (in this case it's
3000 rows out of 10.000.000 (i.e. ~0.03%).

However, I've been able to come up with a simple way to limit such
errors, because the number of distinct values is naturally bounded by

(totalrows / samplerows) * ndistinct

where ndistinct is the number of distinct values in the sample. This
essentially means that if you slice the table into sets of samplerows
rows, you get different ndistinct values.

BTW, this also fixes the issue reported by Jeff Janes on 21/11.

With this additional sanity check, the results look like this:

values   currentadaptive
--
106   99 116
1068   23331
1006  38   96657
10006327   12400

Which is much better, but clearly still a bit on the high side.

So either the estimator really is a bit unstable for such small samples
(it tends to overestimate a bit in all the tests), or there's a bug in
the implementation - I'd be grateful if someone could peek at the code
and maybe compare it to the paper describing the estimator. I've spent a
fair amount of time analyzing it, but found nothing.

But maybe the estimator really is unstable for such small samples - in
that case we could probably use the current estimator as a fallback.
After all, this only happens when someone explicitly decreases the
statistics target to 10 - with the default statistics target it's damn
accurate.

kind regards
Tomas


statistics_target = 10
==

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

kcurrent  adaptive
---
1  10231 11259
2   6327  8543
3   4364  7707
4   3436  7052
5   2725  5868
6   2223  5071
7   1979  5011
8   1802  5017
9   1581  4546

c) step skew (different numbers of values)

values   currentadaptive
--
106   99 107
1068 6449190
1006  38 6449190
10006327   42441

   patched:

values   currentadaptive
--
106   99 116
1068   23331
1006  38   96657
10006327   12400


statistics_target = 100
===

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k  current adaptive
-
11001110655
2 964110944
3 883710846
4 831510992
5 765410760
6 716210524
7 665010375
8 626810275
9 5871  

[HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

2014-12-06 Thread Tomas Vondra
Hi,

back when we were discussing the hashjoin patches (now committed),
Robert proposed that maybe it'd be a good idea to sometimes increase the
number of tuples per bucket instead of batching.

That is, while initially sizing the hash table - if the hash table with
enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
work_mem, try a bit higher load factor before starting to batch.

Attached patch is an initial attempt to implement this - it's a bit
rough on the edges, but hopefully enough to judge the benefit of this.

The default load factor is 1. The patch tries to degrade this to 2, 4 or
8 in attempt to fit the hash table into work mem. If it doesn't work, it
starts batching with the default load factor. If the batching is
required while the hashjoin is running, the load factor is switched back
to the default one (if we're batching, there's no point in keeping the
slower hash table).

The patch also modifies explain output, to show the load factor.

The testing I've done so far are rather disappointing, though.


create table a as select i from generate_series(1,100) s(i);
create table b as select i from generate_series(1,100) s(i);

analyze a;
analyze b;

select a.i, b.i from a join b on (a.i = b.i);

work_mem   batchestuples per bucket  duration
-
64 MB11585 ms
46 MB12639 ms
43 MB14794 ms
40 MB18   1075 ms
39 MB21623 ms

So, even increasing the load factor to 2 is slower than batching.

Of course, with other examples the results may be different. For example
with a much larger outer table:

create table a as select mod(i,100) i
from generate_series(1,1000) s(i);
analyze a;

work_mem   batchestuples per bucket  duration
-
64 MB11   3904 ms
46 MB12   4692 ms
43 MB14   6499 ms
40 MB18   9264 ms
39 MB21   4054 ms

Same results. Of course, for huge outer tables it will make a
difference. For example on a ~8GB outer table (on a machine with 8GB of
RAM), the batching causes enough I/O to make it slower than ntup=2 (50s
vs. 80s, for example).

I'm not sure what's the best way forward - clearly, for cases that fit
into RAM (temp files into page cache), batching is faster. For tables
large enough to cause a lot of I/O, it may make a difference - but I'm
not sure how to identify these cases.

So ISTM implementing this requires a reliable way to identify which case
we're dealing with - if the outer table is large enough (prefer
increasing load factor) or not (prefer batching). Until then keeping the
current simple/predictible approach is probably better.

Of course, this also depends on additional variables (e.g. is the system
memory-stressed?).

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 332f04a..06e612a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1936,26 +1936,33 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 			ExplainPropertyLong(Original Hash Batches,
 hashtable-nbatch_original, es);
 			ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es);
+			ExplainPropertyLong(Hash Tuples Per Bucket, hashtable-ntup_per_bucket, es);
+			ExplainPropertyLong(Original Tuples Per Bucket,
+hashtable-ntup_per_bucket, es);
+			ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es);
 		}
 		else if (hashtable-nbatch_original != hashtable-nbatch ||
- hashtable-nbuckets_original != hashtable-nbuckets)
+ hashtable-nbuckets_original != hashtable-nbuckets ||
+ hashtable-ntup_per_bucket_original != hashtable-ntup_per_bucket)
 		{
 			appendStringInfoSpaces(es-str, es-indent * 2);
 			appendStringInfo(es-str,
-			 Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
+			 Buckets: %d (originally %d)  Batches: %d (originally %d)  Tuples: %d (originally %d)  Memory Usage: %ldkB\n,
 			 hashtable-nbuckets,
 			 hashtable-nbuckets_original,
 			 hashtable-nbatch,
 			 hashtable-nbatch_original,
+			 hashtable-ntup_per_bucket,
+			 hashtable-ntup_per_bucket_original,
 			 spacePeakKb);
 		}
 		else
 		{
 			appendStringInfoSpaces(es-str, es-indent * 2);
 			appendStringInfo(es-str,
-		   Buckets: %d  Batches: %d  Memory Usage: %ldkB\n,
+		   Buckets: %d  Batches: %d  Tuples: %d  Memory Usage: %ldkB\n,
 			 hashtable-nbuckets, hashtable-nbatch,
-			 spacePeakKb);
+			 hashtable-ntup_per_bucket, 

Re: [HACKERS] Testing DDL deparsing support

2014-12-06 Thread Bruce Momjian
On Tue, Dec  2, 2014 at 03:13:07PM -0300, Alvaro Herrera wrote:
 Robert Haas wrote:
  On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick i...@2ndquadrant.com wrote:
 
   A simple schedule to demonstrate this is available; execute from the
   src/test/regress/ directory like this:
  
   ./pg_regress \
 --temp-install=./tmp_check \
 --top-builddir=../../.. \
 --dlpath=. \
 --schedule=./schedule_ddl_deparse_demo
  
  I haven't read the code, but this concept seems good to me.
 
 Excellent, thanks.
 
  It has the unfortunate weakness that a difference could exist during
  the *middle* of the regression test run that is gone by the *end* of
  the run, but our existing pg_upgrade testing has the same weakness, so
  I guess we can view this as one more reason not to be too aggressive
  about having regression tests drop the unshared objects they create.
 
 Agreed.  Not dropping objects also helps test pg_dump itself; the normal
 procedure there is run the regression tests, then pg_dump the regression
 database.  Objects that are dropped never exercise their corresponding
 pg_dump support code, which I think is a bad thing.  I think we should
 institute a policy that regression tests must keep the objects they
 create; maybe not all of them, but at least a sample large enough to
 cover all interesting possibilities.

This causes creation DDL is checked if it is used in the regression
database, but what about ALTER and DROP?  pg_dump doesn't issue those,
except in special cases like inheritance.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers