Re: [HACKERS] Inaccuracy in VACUUM's tuple count estimates

2014-06-08 Thread Amit Kapila
On Sat, Jun 7, 2014 at 1:28 AM, Andres Freund 
wrote:
>
> On 2014-06-06 15:44:25 -0400, Tom Lane wrote:
> > I figured it'd be easy enough to get a better estimate by adding another
> > counter to count just LIVE and INSERT_IN_PROGRESS tuples (thus
effectively
> > assuming that in-progress inserts and deletes will both commit).  I did
> > that, and found that it helped Tim's test case not at all :-(.  A bit of
> > sleuthing revealed that HeapTupleSatisfiesVacuum actually returns
> > INSERT_IN_PROGRESS for any tuple whose xmin isn't committed, regardless
of
> > whether the transaction has since marked it for deletion:
> >
> > /*
> >  * It'd be possible to discern between INSERT/DELETE in
progress
> >  * here by looking at xmax - but that doesn't seem
beneficial for
> >  * the majority of callers and even detrimental for some.
We'd
> >  * rather have callers look at/wait for xmin than xmax. It's
> >  * always correct to return INSERT_IN_PROGRESS because
that's
> >  * what's happening from the view of other backends.
> >  */
> > return HEAPTUPLE_INSERT_IN_PROGRESS;
>
> That's only the case of a couple of days ago. I really wasn't sure
> wheter to go that way or discern the two cases. That changed in the wake
> of:
> http://www.postgresql.org/message-id/20140530143150.GA11051@localhost

Won't this change impact the calculation of number of live
rows for analyze (acquire_sample_rows() considers the

HEAPTUPLE_DELETE_IN_PROGRESS tuples as liverows

for tuples updated by transactions other than current transaction)?


Even if we think that estimates are okay, the below comment

in acquire_same_rows() doesn't seem to suggest it.


/*
 * We count delete-in-progress rows as still live, using
 * the same reasoning given above; but we don't bother to
 * include them in the sample.
 *
..
*/


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-08 Thread Ian Barwick



On 09/06/14 14:47, David G Johnston wrote:

Ian Barwick wrote

Hi,

The JDBC API provides the getGeneratedKeys() method as a way of retrieving
primary key values without the need to explicitly specify the primary key
column(s). This is a widely-used feature, however the implementation has
significant
performance drawbacks.

Currently this feature is implemented in the JDBC driver by appending
"RETURNING *" to the supplied statement. However this means all columns of
affected rows will be returned to the client, which causes significant
performance problems, particularly on wide tables. To mitigate this, it
would
be desirable to enable the JDBC driver to request only the primary key
value(s).


Seems like a good idea.



  ERROR:  Relation does not have any primary key(s)


"Relation does not have a primary key."
or
"Relation has no primary key." (preferred)

By definition it cannot have more than one so it must have none.


Ah yes, amazing what a fresh pair of eyes does :). The plural is
the vestige of an earlier iteration which said something about
the relation not having any primary key column(s).

Will fix, thanks.

Regards

Ian Barwick


--
 Ian Barwick   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-08 Thread David G Johnston
David G Johnston wrote
> 
> Ian Barwick wrote
>> Hi,
>> 
>> The JDBC API provides the getGeneratedKeys() method as a way of
>> retrieving
>> primary key values without the need to explicitly specify the primary key
>> column(s). This is a widely-used feature, however the implementation has
>> significant
>> performance drawbacks.
>> 
>> Currently this feature is implemented in the JDBC driver by appending
>> "RETURNING *" to the supplied statement. However this means all columns
>> of
>> affected rows will be returned to the client, which causes significant
>> performance problems, particularly on wide tables. To mitigate this, it
>> would
>> be desirable to enable the JDBC driver to request only the primary key
>> value(s).
> Seems like a good idea.
>>  ERROR:  Relation does not have any primary key(s)
> "Relation does not have a primary key."
> or
> "Relation has no primary key." (preferred)
> 
> By definition it cannot have more than one so it must have none.
> 
> It could have multiple unique constraints but I do not believe they are
> considered if not tagged as primary.

Also,

I did see where you account for auto-updatable views but what about complex
views with instead of triggers?

These can still be the target of DML queries but are not guaranteed (though
can possibly) to return a well-defined primary key.  At worse an explicit
error about the view itself, not the apparent lack of primary key, should be
emitted.

David J.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/RETURNING-PRIMARY-KEY-syntax-extension-tp5806462p5806464.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-08 Thread David G Johnston
Ian Barwick wrote
> Hi,
> 
> The JDBC API provides the getGeneratedKeys() method as a way of retrieving
> primary key values without the need to explicitly specify the primary key
> column(s). This is a widely-used feature, however the implementation has
> significant
> performance drawbacks.
> 
> Currently this feature is implemented in the JDBC driver by appending
> "RETURNING *" to the supplied statement. However this means all columns of
> affected rows will be returned to the client, which causes significant
> performance problems, particularly on wide tables. To mitigate this, it
> would
> be desirable to enable the JDBC driver to request only the primary key
> value(s).

Seems like a good idea.


>  ERROR:  Relation does not have any primary key(s)

"Relation does not have a primary key."
or
"Relation has no primary key." (preferred)

By definition it cannot have more than one so it must have none.

It could have multiple unique constraints but I do not believe they are
considered if not tagged as primary.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/RETURNING-PRIMARY-KEY-syntax-extension-tp5806462p5806463.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-08 Thread Ian Barwick

Hi,

The JDBC API provides the getGeneratedKeys() method as a way of retrieving
primary key values without the need to explicitly specify the primary key
column(s). This is a widely-used feature, however the implementation has 
significant
performance drawbacks.

Currently this feature is implemented in the JDBC driver by appending
"RETURNING *" to the supplied statement. However this means all columns of
affected rows will be returned to the client, which causes significant
performance problems, particularly on wide tables. To mitigate this, it would
be desirable to enable the JDBC driver to request only the primary key value(s).

One possible solution would be to have the driver request the primary key for
a table, but this could cause a race condition where the primary key could 
change,
and even if it does not, it would entail extra overhead.

A more elegant and universal solution, which would allow the JDBC driver to
request the primary key in a single request, would be to extend the RETURNING
clause syntax with the option PRIMARY KEY. This resolves during parse
analysis into the columns of the primary key, which can be done unambiguously
because the table is already locked by that point and the primary key cannot 
change.

A patch is attached which implements this, and will be added to the next 
commitfest.
A separate patch will be submitted to the JDBC project. Example usage shown 
below.


Regards

Ian Barwick

/* -- */
postgres=# CREATE TABLE foo (id SERIAL PRIMARY KEY);
CREATE TABLE

postgres=# INSERT INTO foo VALUES(DEFAULT) RETURNING PRIMARY KEY;
 id

  1
(1 row)

INSERT 0 1

postgres=# CREATE TABLE bar (id1 INT NOT NULL, id2 INT NOT NULL, PRIMARY 
KEY(id1, id2));
CREATE TABLE
postgres=# INSERT INTO bar VALUES(1,2) RETURNING PRIMARY KEY;
 id1 | id2
-+-
   1 |   2
(1 row)

INSERT 0 1

postgres=# INSERT INTO bar VALUES(2,1),(2,2) RETURNING PRIMARY KEY;
 id1 | id2
-+-
   2 |   1
   2 |   2
(2 rows)

INSERT 0 2

postgres=# CREATE TABLE no_pkey (id SERIAL NOT NULL);
CREATE TABLE
postgres=# INSERT INTO no_pkey VALUES(DEFAULT) RETURNING id;
 id

  1
(1 row)

INSERT 0 1
postgres=# INSERT INTO no_pkey VALUES(DEFAULT) RETURNING PRIMARY KEY;
ERROR:  Relation does not have any primary key(s)

/* -- */

--
 Ian Barwick   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 74ea907..45295d1 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -25,7 +25,7 @@ PostgreSQL documentation
 DELETE FROM [ ONLY ] table_name [ * ] [ [ AS ] alias ]
 [ USING using_list ]
 [ WHERE condition | WHERE CURRENT OF cursor_name ]
-[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
+[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] | PRIMARY KEY ]
 
  
 
@@ -182,6 +182,17 @@ DELETE FROM [ ONLY ] table_name [ *
  
 

+
+   
+PRIMARY KEY
+
+ 
+  Returns the table's primary key column(s) after each row is deleted.
+  Cannot be combined with an output_expression.
+ 
+
+   
+
   
  
 
@@ -208,7 +219,9 @@ DELETE count
clause, the result will be similar to that of a SELECT
statement containing the columns and values defined in the
RETURNING list, computed over the row(s) deleted by the
-   command.
+   command. PRIMARY KEY can be specified to return the
+   primary key value(s) for each deleted row. An error will be raised
+   if the table does not have a primary key.
   
  
 
diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
index a3cccb9..9fbd859 100644
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -24,7 +24,7 @@ PostgreSQL documentation
 [ WITH [ RECURSIVE ] with_query [, ...] ]
 INSERT INTO table_name [ ( column_name [, ...] ) ]
 { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
-[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
+[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] | PRIMARY KEY ]
 
  
 
@@ -65,7 +65,9 @@ INSERT INTO table_name [ ( RETURNING list is identical to that of the output list
-   of SELECT.
+   of SELECT. Alternatively, PRIMARY KEY will
+   return the  primary key value(s) for each inserted row. An error will
+   be raised if the table does not have a primary key.
   
 
   
@@ -186,6 +188,17 @@ INSERT INTO table_name [ ( 
 

+
+   
+PRIMARY KEY
+
+ 
+  Returns the table's primary key column(s) after each row is inserted.
+  Cannot be combined with an output_expression.
+ 
+
+   
+
   
  
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml

Re: [HACKERS] Scaling shared buffer eviction

2014-06-08 Thread Amit Kapila
On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner  wrote:
> Amit Kapila  wrote:
>
> > I have improved the patch  by making following changes:
> > a. Improved the bgwriter logic to log for xl_running_xacts info and
> >removed the hibernate logic as bgwriter will now work only when
> >there is scarcity of buffer's in free list. Basic idea is when the
> >number of buffers on freelist drops below the low threshold, the
> >allocating backend sets the latch and bgwriter wakesup and begin
> >adding buffer's to freelist until it reaches high threshold and then
> >again goes back to sleep.
>
> The numbers from your benchmarks are very exciting, but the above
> concerns me.  My tuning of the bgwriter in production has generally
> *not* been aimed at keeping pages on the freelist, but toward
> preventing shared_buffers from accumulating a lot of dirty pages,
> which were leading to cascades of writes between caches and thus to
> write stalls.  By pushing dirty pages into the (*much* larger) OS
> cache, and letting write combining happen there, where the OS could
> pace based on the total number of dirty pages instead of having
> some hidden and appearing rather suddenly, latency spikes were
> avoided while not causing any noticeable increase in the number of
> OS writes to the RAID controller's cache.
>
> Essentially I was able to tune the bgwriter so that a dirty page
> was always push out to the OS cache within three seconds, which led
> to a healthy balance of writes between the checkpoint process and
> the bgwriter.

I think it would have been better if bgwriter does writes based on
the amount of buffer's that get dirtied to achieve the balance of
writes.

> Backend processes related to user connections still
> performed about 30% of the writes, and this work shows promise
> toward bringing that down, which would be great; but please don't
> eliminate the ability to prevent write stalls in the process.

I agree that for some cases as explained by you, the current bgwriter
logic does satisfy the need, however there are other cases as well
where actually it doesn't help much, one of such cases I am trying to
improve (ease backend buffer allocations) and other may be when
there is constant write activity for which I am not sure how much it
really helps. Part of the reason for trying to make bgwriter respond
mainly to ease backend allocations is the previous discussion for
the same, refer below link:
http://www.postgresql.org/message-id/ca+tgmoz7dvhc4h-ffjmzcff6vwynfoeapz021vxw61uh46r...@mail.gmail.com

However if we want to retain current property of bgwriter, we can do
the same by one of below ways:
a. Have separate processes for writing dirty buffers and moving buffers
to freelist.
b. In the current bgwriter, separate the two works based on the need.
The need can be decided based on whether bgwriter has been waked
due to shortage of buffers on free list or if it has been waked due to
BgWriterDelay.

Now as populating freelist and balance writes by writing dirty buffers
are two separate responsibilities, so not sure if doing that by one
process is a good idea.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] wrapping in extended mode doesn't work well with default pager

2014-06-08 Thread Noah Misch
On Fri, May 23, 2014 at 10:10:23AM -0400, Alvaro Herrera wrote:
> Sergey Muraviov wrote:
> > I found some new bugs and fix them.
> > And I had to make many changes.
> 
> This version fixes some bugs I had noticed in expanded mode too.  For 
> instance,
> the original looked like this (five lines plus header):
> 
> -[ RECORD 49 
> ]-+-
>  pg_identify_object | (rule,,,"""_RETURN"" on 
> pg_catalog.pg_available_extension_versions") 
> 
>  pg_identify_object | 
> (view,pg_catalog,pg_available_extension_versions,pg_catalog.pg_availabl
> e.
> |._extension_versions)
>  
> 
> 
> whereas it's correctly only three lines plus header with this patch
> applied.

I had noticed a similar-looking behavior change with aligned format and
expanded display, so I gave psql-wrapped-expanded-fix-v4.patch a spin with
this test case:

\pset expanded on
\pset format aligned
select repeat('a', 2) union all select repeat('a', 500);

The patch did not restore 9.3 behavior for that one.  Starting with commit
6513633, the first line of letters is space-padded on the right to the width
of the second line of letters.  To illustrate, I have attached raw psql output
from both commit 6513633 and its predecessor.  Also note that
psql-wrapped-expanded-fix-v4.patch expands each [ RECORD x ] header from 509
bytes to 511 bytes; 509 is the longstanding width.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
Expanded display (expanded) is on.
Output format (format) is aligned.
-[ RECORD 1 
]
 repeat | aa





   
-[ RECORD 2 
]
 repeat | 

 

Expanded display (expanded) is on.
Output format (format) is aligned.
-[ RECORD 1 
]
repeat | aa
-[ RECORD 2 
]
repeat | 
aaa

Re: [HACKERS] updated emacs configuration

2014-06-08 Thread Noah Misch
On Wed, Aug 07, 2013 at 07:57:53AM -0400, Peter Eisentraut wrote:
> On 7/2/13 8:42 PM, Peter Eisentraut wrote:
> > Updated files with changes:
> > 
> > - adjusted fill-column to 78, per Noah
> > - added c-file-style, per Andrew
> > - support both "postgresql" and "postgres" directory names
> > - use defun instead of lambda, per Dimitri
> > - put Perl configuration back into emacs.samples, for Tom
> > 
> > I also added configuration of c-auto-align-backslashes as well as label
> > and statement-case-open to c-offsets-alist.  With those changes, the
> > result of indent-region is now very very close to pgindent, with the
> > main exception of the end-of-line de-indenting that pgindent does, which
> > nobody likes anyway.
> 
> Did anyone have any outstanding concerns about this latest version?  I
> thought it looked ready to commit.

After upgrading to GNU Emacs 23.4.1 from a version predating directory-local
variables, I saw switch/case indentation go on the fritz.  My hooks were
issuing (c-set-style "postgresql"), but ".dir-locals.el" set it back to plain
"bsd" style.  The most-reasonable fix I found was to locally add c-file-style
to ignored-local-variables.  c-file-style is the only setting appearing in
both emacs.samples and .dir-locals.el with non-identical values, so it alone
calls for this treatment.  Behavior looks sane in both Emacs 21.4.1 and Emacs
24.3.1, the version extrema I have at hand.  Emacs 24.3.1 is fine without this
due to recent c-before-hack-hook changes, but this does no harm.  Shall I add
this workaround to emacs.samples?

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/src/tools/editors/emacs.samples b/src/tools/editors/emacs.samples
index e469d55..a510afd 100644
--- a/src/tools/editors/emacs.samples
+++ b/src/tools/editors/emacs.samples
@@ -30,7 +30,12 @@
 (add-hook 'c-mode-hook
   (defun postgresql-c-mode-hook ()
 (when (string-match "/postgres\\(ql\\)?/" buffer-file-name)
-  (c-set-style "postgresql"
+  (c-set-style "postgresql")
+  ;; Don't override the style we just set with the style in
+  ;; `dir-locals-file'.  Emacs 23.4.1 needs this; it is obsolete,
+  ;; albeit harmless, by Emacs 24.3.1.
+  (set (make-local-variable 'ignored-local-variables)
+   (append '(c-file-style) ignored-local-variables)
 
 
 ;;; Perl files

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] backup_label revisited

2014-06-08 Thread Noah Misch
On Wed, Jun 04, 2014 at 06:17:29PM +0100, Greg Stark wrote:
> On Mon, Jun 2, 2014 at 2:10 PM, Fujii Masao  wrote:
> > What about the case where we restore the backup to another server and
> > start the recovery? In this case, ISTM inode can be the same. No?
> 
> Hm, I was about to comment that that seems very unlikely. Even on the
> same server if you delete the old database root and then unpack a
> backup it could possibly reuse the exact same inode again. But it's
> really not likely to happen.
> 
> However in the brave new world of filesystem snapshots there is the
> possibility someone has "restored" a backup by opening one of their
> snapshots read-write. In which case the backup-label would have the
> same inode number. That would still be fine if the snapshot is
> consistent but if they have tablespaces and those tablespaces were
> snapshotted separately then it wouldn't be ok.
>
> I think that's a fatal flaw unless anyone can see a way to distinguish
> a filesystem from a snapshot of the filesystem.

Implementations of the "dump" program likely have that property of preserving
inode metadata while not promising a consistent snapshot.  If we keep such
backup methods fully supported, I agree with your conclusion.

PostgreSQL can't, in general, distinguish an almost-snapshot from a master
that crashed during a backup.  But the administrator can track the difference.
If you use pg_start_backup(), your init.d script that gets control after a
crash can have 'rm -f "$PGDATA"/backup_label'.  Use a different script to
recover hot backups.  Perhaps what ought to change is the documentation and
contrib/start-scripts?  Maybe even add a --definitely-not-a-backup postmaster
option, so scripts need not hardcode knowledge of a semi-internal fact like
the name of the file to delete.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MinGW/Cygwin build snags

2014-06-08 Thread Noah Misch
On Sun, Jun 08, 2014 at 06:04:46PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Since src/makefiles/Makefile.win32 does not set BUILDING_DLL for src/port,
> > PGDLLIMPORT is set improperly for code to be linked directly into the 
> > backend.
> > Makefile.win32 does set BUILDING_DLL for src/common.  (Makefile.cygwin has 
> > the
> > same discrepancy, though I haven't checked whether it causes an actual build
> > failure there.  The MSVC build system sets BUILDING_DLL for both src/port 
> > and
> > src/common files.)  This affects any reference to a data symbol from 
> > src/port.
> > The fix is straightforward enough: cause Makefile.{win32,cygwin} to treat
> > src/port like src/common.
> 
> I wonder whether these cases shouldn't distinguish between the "frontend"
> and "backend" builds of src/port/ and src/common/.  In particular, it
> seems odd that we're getting this type of failure in the backend build.

Good question; they need not distinguish.  !BUILDING_DLL means that the code
being compiled will access backend symbols through dynamic linking to
postgres.exe.  Server modules, such as plpgsql, build with !BUILDING_DLL, and
normal backend code builds with BUILDING_DLL.  The setting is irrelevant for
"frontend"/non-server code, which does not link to backend symbols at all.

> > Second, src/template/{win32,cygwin} completely replaces LDFLAGS, so 
> > overriding
> > LDFLAGS on the "configure" command line is ineffective.  Those scripts 
> > should
> > instead add to the existing LDFLAGS, like other templates do for CPPFLAGS.
> > Several other templates completely override CFLAGS; that's undesirable for 
> > the
> > same reason.  I don't have ready access to those affected configurations, so
> > I'm leaving the CFLAGS damage alone.
> 
> +1 for doing something about CFLAGS while we're at it.

Scratch that; configure.in has logic to discard template script CFLAGS changes
if the user had specified CFLAGS.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] NUMA packaging and patch

2014-06-08 Thread Kevin Grittner
I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA.  The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables.  These were all run
on a single connection.  As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck.  On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance).  I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load.  The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once.  I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial.  Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on.  This is determined by a the "cpuset" associated with the
process which reads or writes the disk page.  Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior.  Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory.  On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations.  I will illustrate
the process on my Ubuntu machine.  Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done.  (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources.  This will normally be all cores and all RAM.  This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks "file" of the cpuset.  This can be a shell from which
# pg_ctl is run, at least for bash shells.  It could also be
# written by the postmaster itself, essentially as an extra pid
# file.  Possible snippet from a service script:

echo $$ >/dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes.  This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created.  It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list.  (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/configure b/configure
index ed1ff0a..79a0dea 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libnuma
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -831,6 +832,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libnuma
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1518,6 +1520,7 @@ Optional Packages:
   --with-ossp-uuidobsolete spelling of --with-uuid=ossp
   --with-libxml   build with XML support
   --with-libxslt  use XSLT support when building contrib/xml2
+  --with-libnuma  use libnuma f

Re: [HACKERS] MinGW/Cygwin build snags

2014-06-08 Thread Tom Lane
Noah Misch  writes:
> First, when I tried to add an Assert() call to a file in src/port, a MinGW-w64
> build failed like this:

> Creating library file: libpostgres.a
> ../../src/port/libpgport_srv.a(fls_srv.o): In function `fls':
> /home/nm/src/pg/mingw-postgresql/src/port/fls.c:63: undefined reference to 
> `__imp_assert_enabled'

FWIW, I think we had consensus to remove the assert_enabled variable
entirely.  Not that it wouldn't be good to fix this, but Assert per se
won't need it after we do that.

> Since src/makefiles/Makefile.win32 does not set BUILDING_DLL for src/port,
> PGDLLIMPORT is set improperly for code to be linked directly into the backend.
> Makefile.win32 does set BUILDING_DLL for src/common.  (Makefile.cygwin has the
> same discrepancy, though I haven't checked whether it causes an actual build
> failure there.  The MSVC build system sets BUILDING_DLL for both src/port and
> src/common files.)  This affects any reference to a data symbol from src/port.
> The fix is straightforward enough: cause Makefile.{win32,cygwin} to treat
> src/port like src/common.

I wonder whether these cases shouldn't distinguish between the "frontend"
and "backend" builds of src/port/ and src/common/.  In particular, it
seems odd that we're getting this type of failure in the backend build.

> Second, src/template/{win32,cygwin} completely replaces LDFLAGS, so overriding
> LDFLAGS on the "configure" command line is ineffective.  Those scripts should
> instead add to the existing LDFLAGS, like other templates do for CPPFLAGS.
> Several other templates completely override CFLAGS; that's undesirable for the
> same reason.  I don't have ready access to those affected configurations, so
> I'm leaving the CFLAGS damage alone.

+1 for doing something about CFLAGS while we're at it.

> Both of these changes fix bugs, but I plan not to back-patch.

Agreed; the lack of complaints to date suggests that we should leave
well enough alone in the back branches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] MinGW/Cygwin build snags

2014-06-08 Thread Noah Misch
First, when I tried to add an Assert() call to a file in src/port, a MinGW-w64
build failed like this:

Creating library file: libpostgres.a
../../src/port/libpgport_srv.a(fls_srv.o): In function `fls':
/home/nm/src/pg/mingw-postgresql/src/port/fls.c:63: undefined reference to 
`__imp_assert_enabled'

Since src/makefiles/Makefile.win32 does not set BUILDING_DLL for src/port,
PGDLLIMPORT is set improperly for code to be linked directly into the backend.
Makefile.win32 does set BUILDING_DLL for src/common.  (Makefile.cygwin has the
same discrepancy, though I haven't checked whether it causes an actual build
failure there.  The MSVC build system sets BUILDING_DLL for both src/port and
src/common files.)  This affects any reference to a data symbol from src/port.
The fix is straightforward enough: cause Makefile.{win32,cygwin} to treat
src/port like src/common.


Second, src/template/{win32,cygwin} completely replaces LDFLAGS, so overriding
LDFLAGS on the "configure" command line is ineffective.  Those scripts should
instead add to the existing LDFLAGS, like other templates do for CPPFLAGS.
Several other templates completely override CFLAGS; that's undesirable for the
same reason.  I don't have ready access to those affected configurations, so
I'm leaving the CFLAGS damage alone.  A fix for the Makefile.win32 half of the
original problem appeared as part of a larger patch:
http://www.postgresql.org/message-id/flat/86sk8845pl@ds4.des.no


Both of these changes fix bugs, but I plan not to back-patch.  Builds that
work today won't see any change, and I found no other user reports.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/src/makefiles/Makefile.cygwin b/src/makefiles/Makefile.cygwin
index bd83e5f..bb2efed 100644
--- a/src/makefiles/Makefile.cygwin
+++ b/src/makefiles/Makefile.cygwin
@@ -28,6 +28,10 @@ ifneq (,$(findstring src/common,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
 
+ifneq (,$(findstring src/port,$(subdir)))
+override CPPFLAGS+= -DBUILDING_DLL
+endif
+
 ifneq (,$(findstring timezone,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
diff --git a/src/makefiles/Makefile.win32 b/src/makefiles/Makefile.win32
index b18621b..9cb84f2 100644
--- a/src/makefiles/Makefile.win32
+++ b/src/makefiles/Makefile.win32
@@ -27,6 +27,10 @@ ifneq (,$(findstring src/common,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
 
+ifneq (,$(findstring src/port,$(subdir)))
+override CPPFLAGS+= -DBUILDING_DLL
+endif
+
 ifneq (,$(findstring timezone,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
diff --git a/src/template/cygwin b/src/template/cygwin
index e484fe6..b6ea0de 100644
--- a/src/template/cygwin
+++ b/src/template/cygwin
@@ -6,4 +6,4 @@ SRCH_LIB="/usr/local/lib"
 # pg_toupper() etc. in both libpq and pgport
 # we'd prefer to use --disable-auto-import to match MSVC linking behavior,
 # but support for it in Cygwin is too haphazard
-LDFLAGS="-Wl,--allow-multiple-definition -Wl,--enable-auto-import"
+LDFLAGS="$LDFLAGS -Wl,--allow-multiple-definition -Wl,--enable-auto-import"
diff --git a/src/template/win32 b/src/template/win32
index dc5b77e..7da9719 100644
--- a/src/template/win32
+++ b/src/template/win32
@@ -3,4 +3,4 @@
 # --allow-multiple-definition is required to link pg_dump because it finds
 # pg_toupper() etc. in both libpq and pgport
 # --disable-auto-import is to ensure we get MSVC-like linking behavior
-LDFLAGS="-Wl,--allow-multiple-definition -Wl,--disable-auto-import"
+LDFLAGS="$LDFLAGS -Wl,--allow-multiple-definition -Wl,--disable-auto-import"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] configure does not check for bison or flex

2014-06-08 Thread David G Johnston
Eric L wrote
> I am installing postgresql from source on 64 bit Ubuntu 14.04 and when I
> run the ./configure script, it is successful, but when I run make it fails
> with an error:
> 
> "ERROR: `flex' is missing on your system.  It is needed to create the file
> `bootscanner.c'.  You can either get flex from a GNU mirror site or
> download an official distribution of PostgreSQL, which contains
> pre-packaged flex output"
> I get this same error for "bison".  I can install it with "sudo apt-get
> install bison" and "sudo apt-get install flex".  However shouldn't the
> ./configure script be checking for these first?
> Eric

Google:  postgresql bison

http://postgresql.1045698.n5.nabble.com/bison-flex-and-configure-td5789215.html

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/configure-does-not-check-for-bison-or-flex-tp5806447p5806449.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] configure does not check for bison or flex

2014-06-08 Thread Eric L
I am installing postgresql from source on 64 bit Ubuntu 14.04 and when I run 
the ./configure script, it is successful, but when I run make it fails with an 
error:

"ERROR: `flex' is missing on your system.  It is needed to create the file 
`bootscanner.c'.  You can either get flex from a GNU mirror site or download an 
official distribution of PostgreSQL, which contains pre-packaged flex output"
I get this same error for "bison".  I can install it with "sudo apt-get install 
bison" and "sudo apt-get install flex".  However shouldn't the ./configure 
script be checking for these first?
Eric
  

Re: [HACKERS] Securing "make check" (CVE-2014-0067)

2014-06-08 Thread Noah Misch
On Sun, Mar 23, 2014 at 07:04:20PM -0400, Noah Misch wrote:
> On Thu, Mar 06, 2014 at 11:52:22PM -0500, Noah Misch wrote:
> > On Thu, Mar 06, 2014 at 12:44:34PM -0500, Tom Lane wrote:
> > > I'm inclined to suggest that we should put the socket under $CWD by
> > > default, but provide some way for the user to override that choice.
> > > If they want to put it in /tmp, it's on their head as to how secure
> > > that is.  On most modern platforms it'd be fine.
> > 
> > I am skeptical about the value of protecting systems with non-sticky /tmp, 
> > but
> > long $CWD isn't of great importance, either.  I'm fine with your suggestion.
> > Though the $CWD or one of its parents could be world-writable, that would
> > typically mean an attacker could just replace the test cases directly.
> 
> Here's the patch.  The temporary data directory makes for a convenient socket
> directory; initdb already gives it mode 0700, and we have existing
> arrangements to purge it when finished.  One can override the socket directory
> by defining PG_REGRESS_SOCK_DIR in the environment.

Socket path length limitations thwarted that patch:
http://www.postgresql.org/message-id/flat/e1wtnv2-00047s...@gemulon.postgresql.org

Here's an update that places the socket in a temporary subdirectory of /tmp.
The first attached patch adds NetBSD mkdtemp() to libpgport.  The second,
principal, patch uses mkdtemp() to implement this design in pg_regress.  The
corresponding change to contrib/pg_upgrade/test.sh is based on the "configure"
script's arrangements for its temporary directory.

NetBSD's mkdtemp() has assertions, and I initially mapped its assertion macro
to our Assert().  However, a bug in our MinGW build process causes build
failures if an Assert() call appears in libpgport.  I will post about that in
a new thread.  The affected assertions were uncompelling, so I dropped them.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/configure b/configure
index ed1ff0a..f8232db 100755
--- a/configure
+++ b/configure
@@ -11650,6 +11650,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "mkdtemp" "ac_cv_func_mkdtemp"
+if test "x$ac_cv_func_mkdtemp" = xyes; then :
+  $as_echo "#define HAVE_MKDTEMP 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" mkdtemp.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS mkdtemp.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random"
 if test "x$ac_cv_func_random" = xyes; then :
   $as_echo "#define HAVE_RANDOM 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index 80df1d7..c95e2cd 100644
--- a/configure.in
+++ b/configure.in
@@ -1357,7 +1357,7 @@ else
   AC_CHECK_FUNCS([fpclass fp_class fp_class_d class], [break])
 fi
 
-AC_REPLACE_FUNCS([crypt fls getopt getrusage inet_aton random rint srandom 
strerror strlcat strlcpy])
+AC_REPLACE_FUNCS([crypt fls getopt getrusage inet_aton mkdtemp random rint 
srandom strerror strlcat strlcpy])
 
 case $host_os in
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5ff9e41..4fb7288 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -330,6 +330,9 @@
 /* Define to 1 if the system has the type `MINIDUMP_TYPE'. */
 #undef HAVE_MINIDUMP_TYPE
 
+/* Define to 1 if you have the `mkdtemp' function. */
+#undef HAVE_MKDTEMP
+
 /* Define to 1 if you have the  header file. */
 #undef HAVE_NETINET_IN_H
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index e6e3c8d..58777ca 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -249,6 +249,9 @@
 /* Define to 1 if the system has the type `MINIDUMP_TYPE'. */
 #define HAVE_MINIDUMP_TYPE 1
 
+/* Define to 1 if you have the `mkdtemp' function. */
+/* #undef HAVE_MKDTEMP */
+
 /* Define to 1 if you have the  header file. */
 #define HAVE_NETINET_IN_H 1
 
diff --git a/src/include/port.h b/src/include/port.h
index c9226f3..3d97481 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -462,6 +462,9 @@ extern int  pg_check_dir(const char *dir);
 /* port/pgmkdirp.c */
 extern int pg_mkdir_p(char *path, int omode);
 
+/* port/mkdtemp.c */
+extern char *mkdtemp(char *path);
+
 /* port/pqsignal.c */
 typedef void (*pqsigfunc) (int signo);
 extern pqsigfunc pqsignal(int signo, pqsigfunc func);
diff --git a/src/port/mkdtemp.c b/src/port/mkdtemp.c
new file mode 100644
index 000..a5e991f
--- /dev/null
+++ b/src/port/mkdtemp.c
@@ -0,0 +1,293 @@
+/*-
+ *
+ * mkdtemp.c
+ *   create a mode-0700 temporary directory
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/port/mkdtemp.c
+ *
+ * This code was taken from NetBSD to provide an implementation for platforms
+ * that lack it.  (Among compatibly-licensed implementations, the OpenBSD
+ * version better resists denial-of-service attacks.  However, it has a
+ * cryptograp

Re: [HACKERS] Scaling shared buffer eviction

2014-06-08 Thread Kevin Grittner
Amit Kapila  wrote:

> I have improved the patch  by making following changes:
> a. Improved the bgwriter logic to log for xl_running_xacts info and
>    removed the hibernate logic as bgwriter will now work only when
>    there is scarcity of buffer's in free list. Basic idea is when the
>    number of buffers on freelist drops below the low threshold, the
>    allocating backend sets the latch and bgwriter wakesup and begin
>    adding buffer's to freelist until it reaches high threshold and then
>    again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me.  My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls.  By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Allowing NOT IN to use ANTI joins

2014-06-08 Thread David Rowley
Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

A simple example of this (without a subquery) is:

select 1 where 3 not in (1, 2, null); returns 0 rows because 3 <> NULL is
unknown.

The attached patch allows an ANTI-join plan to be generated in cases like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);

SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

To generate a plan like:
   QUERY PLAN
-
 Hash Anti Join  (cost=64.00..137.13 rows=1070 width=8)
   Hash Cond: (a.b_id = b.id)
   ->  Seq Scan on a  (cost=0.00..31.40 rows=2140 width=8)
   ->  Hash  (cost=34.00..34.00 rows=2400 width=4)
 ->  Seq Scan on b  (cost=0.00..34.00 rows=2400 width=4)

But if we then do:
ALTER TABLE b ALTER COLUMN id DROP NOT NULL;

The plan will go back to the current behaviour of:

 QUERY PLAN
-
 Seq Scan on a  (cost=40.00..76.75 rows=1070 width=8)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
 ->  Seq Scan on b  (cost=0.00..34.00 rows=2400 width=4)

Comments are welcome

Regards

David Rowley


not_in_anti_join_v0.4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-08 Thread Amit Kapila
On Fri, Jun 6, 2014 at 5:31 PM, Gurjeet Singh  wrote:
> On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila 
wrote:

> > Buffer saver process itself can crash while saving or restoring
> > buffers.
>
> True. That may lead to partial list of buffers being saved. And the
> code in Reader process tries hard to read only valid data, and punts
> at the first sight of data that doesn't make sense or on ERROR raised
> from Postgres API call.

Inspite of Reader Process trying hard, I think we should ensure by
some other means that file saved by buffer saver is valid (may be
first write in tmp file and then rename it or something else).

> > IIUC on shutdown request, postmaster will send signal to BG Saver
> > and BG Saver will save the buffers and then postmaster will send
> > signal to checkpointer to shutdown.  So before writing Checkpoint
> > record, BG Saver can crash (it might have saved half the buffers)
>
> Case handled as described above.
>
> > or may BG saver saves buffers, but checkpointer crashes (due to
> > power outage or any such thing).
>
> Checkpointer process' crash seems to be irrelevant to Postgres
> Hibernator's  workings.

Yeap, but if it crashes before writing checkpoint record, it will lead to
recovery which is what we are considering.

> I think you are trying to argue the wording in my claim "save-files
> are created only on clean shutdowons; not on a crash or immediate
> shutdown", by implying that a crash may occur at any time during and
> after the BufferSaver processing. I agree the wording can be improved.

Not only wording, but in your above mail Case 2 and 1b would need to
load buffer's and perform recovery as well, so we need to decide which
one to give preference.

So If you agree that we should have consideration for recovery data
along with saved files data, then I think we have below options to
consider:

1. Have an provision for user to specify which data (recovery or
previous cached blocks) should be considered more important
and then load buffers before or after recovery based on that
input.

2. Always perform before recovery and mention in docs that users
can expect more time for servers to start in case they enable this
extension along with the advantages of the same.

3. Always perform after recovery and mention in docs that enabling
this extension might discard cached data by recovery or initial few
operations done by user.

4. Have an exposed API by BufMgr module such that Buffer loader
will only consider buffers in freelist to load buffers.

Based on opinion of others, I think we can decide on one of these
or if any other better way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com