date:20101130

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 01.12.2010 03:35, Bruce Momjian wrote:

Heikki Linnakangas wrote:

Let's recap what happens when a VM bit is set: You set the
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it
usually isn't), and then set the bit in the VM while keeping the heap
page locked.


What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.


Hmm, you'd somehow have to know if a checkpoint has happened since the 
flag was set. And it might be a long wait, which makes it less 
attractive for index-only scans.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idle git question: how come so many "objects"?

2010-11-30 Thread Martijn van Oosterhout

On Wed, Dec 01, 2010 at 01:03:26AM -0500, Tom Lane wrote:
> So I just made a commit that touched four files in all six active
> branches, and I see:
> 
> $ git push   
> Counting objects: 172, done.
> Compressing objects: 100% (89/89), done.
> Writing objects: 100% (89/89), 17.07 KiB, done.
> Total 89 (delta 80), reused 0 (delta 0)
> To ssh://g...@gitmaster.postgresql.org/postgresql.git
>35a3def..8a6eb2e  REL8_1_STABLE -> REL8_1_STABLE
>cfb6ac6..b0e2092  REL8_2_STABLE -> REL8_2_STABLE
>301a822..0d45e8c  REL8_3_STABLE -> REL8_3_STABLE
>61f8618..6bd3753  REL8_4_STABLE -> REL8_4_STABLE
>09425f8..0a85bb2  REL9_0_STABLE -> REL9_0_STABLE
>c0b5fac..225f0aa  master -> master
> 
> Now I realize that in addition to the four files there's a "tree" object
> and a "commit" object, but that still only adds up to 36 objects that
> should be created in this transaction.  How does it get to 172?  And
> then where do the 89 and 80 numbers come from?

IIRC, each directory also counts as an object. So if you change a file
in a/b/c/d you get 5 commit objects, one for the file and four for the
directories.

What the delta is for I don't know. Perhaps some of the diffs were the
same between branches and these got merged?

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first. 
>   - Charles de Gaulle


signature.asc
Description: Digital signature

[HACKERS] Improved JDBC driver part 2

2010-11-30 Thread Radosław Smogura

Hello,

Maybe you are interested about this what I done with JDBC

=== Original driver (Text mode) ===
* Memory *
1. Memory usage improvments when using result set input streams (no uneeded 
memory copy) - needs few touches for bigger performance.
2. Memory usage improvments for large data, should be no problem to load 1GB 
bytea[] when have only 300MB of memory ("threshold" size still hardcoded).

* JDBC 4 *
1. XML are now correctly transformed before send to server - previous version 
used normal text-file transformations that is not enaugh.
2. In all modes (text/binary) XMLs are sended in binary mode, so driver don't 
need to do special transformation (does it require libxml?), until character 
streams are used.
3. JDBC4 exception throwing.
4. XML objects are readable only once, you can't reuse it, update form result 
set (silently set to null on RS.updateRow() - shouldn't be silent) returns 
null till refreshRow(), but you can write to them after load. 
5.Target XML behavior is streaming behavior to don't repeat problems with 
bytea.

* JDBC 4.1 *
1. Just started.

* Others *
1. Few additional test cases. Few utils for XML checking (string equals is too 
less) no good, but better.
2. Fixed bug, causing inproper time(stamps) encoding for WITH TIME ZONE fields, 
after changing default time zone.

=== Binary mode ===
1. Read for almost all data types with arrays.
2. Write for few.
3. Much more restrictive checking when casting form one type to other.
4. Exceptions when casting from one type to other inproper type.
5. Still ResultSet.getString() for XML will return XML - this spec. prohibited 
(X - base type conversion, x - possible conversion, no x - no base and 
possible = no conversion).
6. No getPriviliges for metadata - no binary output for ACL!!!
7. Many, many tests passed.
8. Data reading is faster for all reads (checked with profiler, against 
original driver).

Driver is here http://www.rsmogura.net/pgsql/pgjdbc_exp_20101130_C.tar.gz is 
currently JDK 6 compatible (will be not), compressed patch takes about 136kb 
gziped.

Kind regards & have a nice day
--
Radosław Smogura
http://www.softperience.eu

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Idle git question: how come so many "objects"?

2010-11-30 Thread Tom Lane

So I just made a commit that touched four files in all six active
branches, and I see:

$ git push   
Counting objects: 172, done.
Compressing objects: 100% (89/89), done.
Writing objects: 100% (89/89), 17.07 KiB, done.
Total 89 (delta 80), reused 0 (delta 0)
To ssh://g...@gitmaster.postgresql.org/postgresql.git
   35a3def..8a6eb2e  REL8_1_STABLE -> REL8_1_STABLE
   cfb6ac6..b0e2092  REL8_2_STABLE -> REL8_2_STABLE
   301a822..0d45e8c  REL8_3_STABLE -> REL8_3_STABLE
   61f8618..6bd3753  REL8_4_STABLE -> REL8_4_STABLE
   09425f8..0a85bb2  REL9_0_STABLE -> REL9_0_STABLE
   c0b5fac..225f0aa  master -> master

Now I realize that in addition to the four files there's a "tree" object
and a "commit" object, but that still only adds up to 36 objects that
should be created in this transaction.  How does it get to 172?  And
then where do the 89 and 80 numbers come from?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED - core functionality

2010-11-30 Thread Shigeru HANADA

On Wed, 1 Dec 2010 12:30:46 +0900
Hitoshi Harada  wrote:
> This is another topic, but it would be useful if CREATE FOREIGN TABLE
> can omit column definitions since fdw usually knows what should be
> there in the definitions. I some times mistyped the column names
> between remote and local and resulted in fail on execution.

The SQL/MED standard includes "IMPORT FOREIGN SCHEMA schema FROM
SERVER server" syntax which imports definitions of remote tables in
the specified schema into local PostgreSQL, and you can optionally specify
the list of target tables with "LIMIT TO table_list" option.

This syntax would make defining foreign tables easier, but it needs to
enhance both FDW API and core parser.

Regards,
--
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Tom Lane

Josh Berkus  writes:
> On 11/30/10 7:09 PM, Tom Lane wrote:
>> Josh Berkus  writes:
>>> Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?
>> 
>> We should wait for the outcome of the discussion about whether to change
>> the default wal_sync_method before worrying about this.

> Are we considering backporting that change?

> If so, this would be another argument in favor of changing the default.

Well, no, actually it's the same (only) argument.  We'd never consider
back-patching such a change if our hand weren't being forced by kernel
changes :-(

As things stand, though, I think the only thing that's really open for
discussion is how wide to make the scope of the default-change: should
we just do it across the board, or try to limit it to some subset of the
platforms where open_datasync is currently the default.  And that's a
decision that ought to be informed by some performance testing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED - core functionality

2010-11-30 Thread Hitoshi Harada

2010/12/1 Itagaki Takahiro :
> On Wed, Dec 1, 2010 at 12:30, Hitoshi Harada  wrote:
>> From a user's view, this is very long way to see a simplest foreign
>> table. I know it is based on the standard, but I really want a
>> shortcut. Especially, I don't understand why CREATE USER MAPPING FOR
>> current_user SERVER  is needed for default use case. If you
>> forget CREATE USER MAPPING and do CREATE FOREIGN TABLE, it raises an
>> error. User mapping is useful if the local user and remote user should
>> be mapped but I imagine in most cases they are the same.
>> postgresql_fdw can tell the remote user by conninfo string, in
>> addition.
>
> How do you connect to the remote server when password is required?
> I think we cannot pass through passwords to the remote server
> even if the same user is used on the local and remote servers.

Sure, it is limited on trust authentication only. If you need more
complicated connection, do USER MAPPING. But it's not clear to me that
it should be required in any case.

Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] profiling connection overhead

2010-11-30 Thread Peter Eisentraut

On tis, 2010-11-30 at 15:49 -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
> >> Rolling in calloc in place of
> >> malloc/memset made no particular difference either, which says that
> >> Fedora 13's glibc does not have any optimization for that case as I'd
> >> hoped.
> 
> > glibc's calloc is either mmap of /dev/zero or malloc followed by memset.
> 
> Hmm.  I would have expected to see a difference then.  Do you know what
> conditions are needed to cause the mmap to be used?

Check out the mallopt(3) man page.  It contains a few tunable malloc
options that may be useful for your investigation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED - core functionality

2010-11-30 Thread Itagaki Takahiro

On Wed, Dec 1, 2010 at 12:30, Hitoshi Harada  wrote:
> From a user's view, this is very long way to see a simplest foreign
> table. I know it is based on the standard, but I really want a
> shortcut. Especially, I don't understand why CREATE USER MAPPING FOR
> current_user SERVER  is needed for default use case. If you
> forget CREATE USER MAPPING and do CREATE FOREIGN TABLE, it raises an
> error. User mapping is useful if the local user and remote user should
> be mapped but I imagine in most cases they are the same.
> postgresql_fdw can tell the remote user by conninfo string, in
> addition.

How do you connect to the remote server when password is required?
I think we cannot pass through passwords to the remote server
even if the same user is used on the local and remote servers.

However, SERVER and USER MAPPING might be useless for file_fdw because
SERVER is always the local file system and USER is always the OS's user
who started the postmaster. I'm not sure how we should treat cases
where those settings don't have any configurations.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] profiling connection overhead

2010-11-30 Thread Jeff Janes

On 11/28/10, Robert Haas  wrote:
>
> In a close race, I don't think we should get bogged down in
> micro-optimization here, both because micro-optimizations may not gain
> much and because what works well on one platform may not do much at
> all on another.  The more general issue here is what to do about our
> high backend startup costs.  Beyond trying to recycle backends for new
> connections, as I've previous proposed and with all the problems it
> entails,

Is there a particular discussion of that matter you could point me to?

> the only thing that looks promising here is to try to somehow
> cut down on the cost of populating the catcache and relcache, not that
> I have a very clear idea how to do that.  This has to be a soluble
> problem because other people have solved it.

Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in.   I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.

> To some degree we're a
> victim of our own flexible and extensible architecture here, but I
> find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"?  Are there fixable
short-comings that it has which could make the issue less of an issue?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith


Jeff Janes wrote:

Have you tested out this "absorb during syncing phase" code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?
  


Yes; that's how it was developed.  It helped to have just the extra 
absorb work without the pauses, but that alone wasn't enough to really 
improve things on the server we ran into this problem badly on.



I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?
  


The difference is that once an fsync call is made, dirty data is much 
more likely to be forced out.  It's the one thing that bypasses all 
other ways the kernel might try to avoid writing the data--both the 
dirty ratio guidelines and the congestion control logic--and forces 
those writes to happen as soon as they can be scheduled.  If you graph 
the amount of data shown "Dirty:" by /proc/meminfo over time, once the 
sync calls start happening it's like a descending staircase pattern, 
dropping a little bit as each sync fires. 


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Tom Lane

Andrew Dunstan  writes:
> On 11/30/2010 10:09 PM, Tom Lane wrote:
>> We should wait for the outcome of the discussion about whether to change
>> the default wal_sync_method before worrying about this.

> we've just had a significant PGX customer encounter this with the latest 
> Postgres on Redhat's freshly released flagship product. Presumably the 
> default wal_sync_method will only change prospectively.

I don't think so.  The fact that Linux is changing underneath us is a
compelling reason for back-patching a change here.  Our older branches
still have to be able to run on modern OS versions.  I'm also fairly
unclear on what you think a fix would look like if it's not effectively
a change in the default.

(Hint: this *will* be changing, one way or another, in Red Hat's version
of 8.4, since that's what RH is shipping in RHEL6.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread KaiGai Kohei

(2010/11/30 21:26), Simon Riggs wrote:
> On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote:
> 
>> I still see little reason to make LOCK TABLE permissions different for
>> column-level vs. table-level UPDATE privileges
> 
> Agreed.
> 
> This is the crux of the debate. Why should this inconsistency be allowed
> to continue?
> 
> Are there covert channel issues here, KaiGai?
> 
Existing database privilege mechanism (and SELinux, etc...) is not designed
to handle covert channel attacks, basically.
For example, if a user session with column-level UPDATE privilege tries
to update a certain column for each seconds depending on the contents of
other table X, other session can probably know the contents of table X
using iteration of LOCK command without SELECT permission.
It is a typical timing channel attack, but it is not a problem that we
should try to tackle, is it?

Sorry, I don't have a credible idea to solve this inconsistency right now.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] unlogged tables

2010-11-30 Thread Andy Colson


I have played around a little more, and think I found a problem.

If given enough time, an unlogged table makes it to disk, and a restart wont 
clear the data.  If I insert a bunch of stuff, commit, and quickly restart PG, 
it table is cleared.  If I let it sit for a while, it stays.

Based on that, I have a pgbench_accounts table (unlogged) that after a restart 
has data in it.


andy=# select aid, bid, abalance from pgbench_accounts where abalance = 3305;
   aid   | bid | abalance
-+-+--
 3790226 |  38 | 3305
  274130 |   3 | 3305
 2169892 |  22 | 3305
  705321 |   8 | 3305
 4463145 |  45 | 3305

I dropped the index, and added a new one, then restart PG.  Now it seems the 
index is empty/unusable.

andy=# select aid, bid, abalance from pgbench_accounts where aid = 3790226;
 aid | bid | abalance
-+-+--
(0 rows)


andy=# select pg_indexes_size('pgbench_accounts');
 pg_indexes_size
-
   16384




Lets recreate it:

andy=# drop index bob;
DROP INDEX
Time: 13.829 ms
andy=# create index bob on pgbench_accounts(aid, bid);
CREATE INDEX
Time: 17215.859 ms
andy=# select aid, bid, abalance from pgbench_accounts where aid = 3790226;
   aid   | bid | abalance
-+-+--
 3790226 |  38 | 3305
(1 row)

Time: 0.712 ms

andy=# select pg_indexes_size('pgbench_accounts');
 pg_indexes_size
-
   179716096




I also did kill -9 on all the postgres* processes, while they were busy 
inserting records, to try to corrupt the database.  But could not seem to.  
Setting fsync off also did not give me errors, but I assume because I was using 
unlogged tables, and they were all getting cleared anyway, I never saw them.

With fsync off and normal tables, I got bad looking things in my logs and 
vacuum:

LOG:  unexpected pageaddr 1/AB1D6000 in log file 1, segment 187, offset 1925120
WARNING:  relation "access" page 28184 is uninitialized --- fixing
etc...


AND last, I tried to update my git repo and see if the patches still work. They 
do not.

There was much discussion on the syntax:

create unlogged table vs create temp xxx table vs something else.

There was much discussion on how persistent the tables should be.  And some on 
backups.

At this point, though, I find myself at an end, not sure what else to do until 
the dust settles.

Oh, also, I wanted to add:

There is \h help:  +1
but I can find no way of determining the "tempness"/"unloggedness" of a table 
via \d*
The only way I found was to "pg_dump -s"


I will attempt to link this to the website, and mark it as returned to author.

-Andy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED - core functionality

2010-11-30 Thread Hitoshi Harada

2010/11/25 Shigeru HANADA :
> Hi hackers,
>
> Attached is a patch that adds core functionality of SQL/MED.  This
> patch provides:
>
>    "SQL/MED - file_fdw"       : FDW for external PostgreSQL
>    "SEL/MED - postgresql_fdw" : FDW for server-side file (CSV, TEXT)
>

I've tried SQL/MED with postgresql_fdw last night, and found myself
confusing the long setup procedure. A simplest way to use it AFAIK is:

1.CREATE FOREIGN DATA WRAPPER ... (or run install sql script)
2.CREATE SERVER ... FOREIGN DATA WRAPPER ...
3.CREATE USER MAPPING FOR ...
4.CREATE FOREIGN TALBE( ... )

From a user's view, this is very long way to see a simplest foreign
table. I know it is based on the standard, but I really want a
shortcut. Especially, I don't understand why CREATE USER MAPPING FOR
current_user SERVER  is needed for default use case. If you
forget CREATE USER MAPPING and do CREATE FOREIGN TABLE, it raises an
error. User mapping is useful if the local user and remote user should
be mapped but I imagine in most cases they are the same.
postgresql_fdw can tell the remote user by conninfo string, in
addition.

This is another topic, but it would be useful if CREATE FOREIGN TABLE
can omit column definitions since fdw usually knows what should be
there in the definitions. I some times mistyped the column names
between remote and local and resulted in fail on execution.

If these are discussed before, I apologize for noise.

Basically, I would love to see this in the next release. Good work!

Regards,



-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Andrew Dunstan




On 11/30/2010 10:09 PM, Tom Lane wrote:

Josh Berkus  writes:

Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.




Tom,

we've just had a significant PGX customer encounter this with the latest 
Postgres on Redhat's freshly released flagship product. Presumably the 
default wal_sync_method will only change prospectively. But this will 
feel to every user out there who encounters it like a bug in our code, 
and it needs attention. It was darn difficult to diagnose, and many 
people will just give up in disgust if they encounter it.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Josh Berkus

On 11/30/10 7:09 PM, Tom Lane wrote:
> Josh Berkus  writes:
>> Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?
> 
> We should wait for the outcome of the discussion about whether to change
> the default wal_sync_method before worrying about this.

Are we considering backporting that change?

If so, this would be another argument in favor of changing the default.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Tom Lane

Josh Berkus  writes:
> Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where are we on Standby Promotion?

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 9:03 PM, Bruce Momjian  wrote:
> Josh Berkus wrote:
>> Fujii, Simon, Greg, etc.:
>>
>> Has anyone submitted or committed a patch to make Standby Promotion*
>> easier, at this point?  We discussed it earlier in the dev cycle, but I
>> can't find anything which has actually been submitted.
>>
>> * that is, promotion to be a new master of other existing standbys
>
> Nope, I haven't seen anything except a request that someone do testing.

I thought we agreed Josh was in charge of writing the code for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Josh Berkus

Hackers,

Some of you might already be aware that this combination produces a
fatal startup crash in PostgreSQL:

1. Create an Ext3 or Ext4 partition and mount it with data=journal on a
server with linux kernel 2.6.30 or later.
2. Initdb a PGDATA on that partition
3. Start PostgreSQL with the default config from that PGDATA

This was reported a ways back:
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=567113

To explain: calling O_DIRECT on an ext3 or ext4 partition with
data=journalled causes a crash.  However, recent Linux kernels now
report support for O_DIRECT when we compile PostgreSQL, so we use it by
default.  This results in a "crash by default" situation with new
Linuxes if anyone sets data=journal.

We just encountered this again with another user.  With RHEL6 out now,
this seems likely to become a fairly common crash report.

Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
 wrote:
>> Does the current code cope with the corruption?
>
> It's not corruption, but "intended degradation". Yes, the current code copes
> with it, that's how GiST survives a crash. However, even with the current
> code, VACUUM will nag if it finds any invalid tuples with this message:
>
> ereport(NOTICE,
>        (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash
> recovery",
>
> That's harmless, in the sense that all scans and inserts work fine, but
> scans might need to do more work than if the invalid tuple wasn't there.
>
> I don't think we need to go out of our way to support such degraded indexes
> in 9.1. If you see such notices in your logs, you should REINDEX anyway,
> before of after pg_upgrade. Let's just make sure that you get a reasonable
> error message in 9.1 if a scan or insert encounters such a tuple.

I just don't want to take a risk of giving people unexpected wrong
answers.  It's not clear to me whether that's a risk here or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Bruce Momjian

Heikki Linnakangas wrote:
> > Does the current code cope with the corruption?
> 
> It's not corruption, but "intended degradation". Yes, the current code 
> copes with it, that's how GiST survives a crash. However, even with the 
> current code, VACUUM will nag if it finds any invalid tuples with this 
> message:
> 
> ereport(NOTICE,
>   (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash 
> recovery",
> 
> That's harmless, in the sense that all scans and inserts work fine, but 
> scans might need to do more work than if the invalid tuple wasn't there.
> 
> I don't think we need to go out of our way to support such degraded 
> indexes in 9.1. If you see such notices in your logs, you should REINDEX 
> anyway, before of after pg_upgrade. Let's just make sure that you get a 
> reasonable error message in 9.1 if a scan or insert encounters such a tuple.
> 
> There is a section on this in the docs, BTW: 
> http://www.postgresql.org/docs/9.0/static/gist-recovery.html

OK, administrators will be prompted during normal operation --- seems
there is nothing extra for pg_upgrade to do here.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where are we on Standby Promotion?

2010-11-30 Thread Bruce Momjian

Josh Berkus wrote:
> Fujii, Simon, Greg, etc.:
> 
> Has anyone submitted or committed a patch to make Standby Promotion*
> easier, at this point?  We discussed it earlier in the dev cycle, but I
> can't find anything which has actually been submitted.
> 
> * that is, promotion to be a new master of other existing standbys

Nope, I haven't seen anything except a request that someone do testing.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Bruce Momjian

Daniel Loureiro wrote:
> > 3. This doesn't work tremendously well for inheritance trees, where
> > ModifyTable acts as sort of an implicit Append node.  You can't just
> > funnel all the tuples through one Sort or Limit node because they aren't
> > all the same rowtype.  (Limit might perhaps not care, but Sort will.)
> > But you can't have a separate Sort/Limit for each table either, because
> > that would give the wrong behavior.  Another problem with funneling all
> > the rows through one Sort/Limit is that ModifyTable did need to know
> > which table each row came from, so it can apply the modify to the right
> > table.
> 
> So I guess that I have choose the wrong hack to start.
> 
> Just for curiosity, why the result of "WHERE" filter (in
> SELECT/DELETE/UPDATE) is not put in memory, i.e. an array of ctid, like an
> buffer and then executed by SELECT/DELETE/UPDATE at once ?

Informix dbaccess would prompt a user for confirmation if it saw a
DELETE with no WHERE.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Bruce Momjian

Heikki Linnakangas wrote:
> On 30.11.2010 18:33, Tom Lane wrote:
> > Robert Haas  writes:
> >> Oh, but it's worse than that.  When you XLOG a WAL record for each of
> >> those pages, you're going to trigger full-page writes for all of them.
> >>   So now you've turned 1GB of data to write into 2+ GB of data to
> >> write.
> >
> > No, because only the first mod of each VM page would trigger a full page
> > write, at least assuming a reasonable ordering of the operations.
> 
> If you change the LSN on the heap pages, you have to write full page 
> images of those as well.
> 
> Let's recap what happens when a VM bit is set: You set the 
> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it 
> usually isn't), and then set the bit in the VM while keeping the heap 
> page locked.

What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Jeff Janes

On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith  wrote:

...

> One change that turned out be necessary rather than optional--to get good
> performance from the system under tuning--was to make regular background
> writer activity, including fsync absorb checks, happen during these sync
> pauses.  The existing code ran the checkpoint sync work in a pretty tight
> loop, which as I alluded to in an earlier patch today can lead to the
> backends competing with the background writer to get their sync calls
> executed.  This squashes that problem if the background writer is setup
> properly.

Have you tested out this "absorb during syncing phase" code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?

I wonder if all the improvement you see might not be due entirely to
the absorb between syncs, and none or very little from
the sleep itself.

I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?

The healing power of clearing out the absorb queue seems much more obvious.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Jeff Davis

On Tue, 2010-11-30 at 15:52 -0500, Robert Haas wrote:
> On Tue, Nov 30, 2010 at 2:45 PM, Jeff Davis  wrote:
> > On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote:
> >>
> >> Could you possibly have ModifyTable -> Limit -> MergeAppend?
> >
> > Before MergeAppend knows which tuple to produce, it needs to see the
> > tuples (at least the first one from each of its children), meaning that
> > it needs to pull them through ModifyTable; and at that point it's
> > already too late.
> >
> 
> You seem to be imagining the MergeAppend node on top

Yes, I assumed that the tuples flowed in the direction of the arrows ;)

Now that I think about it, your representation makes some sense given
our EXPLAIN output.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Alvaro Herrera

Excerpts from Daniel Loureiro's message of mar nov 30 15:04:17 -0300 2010:

> So I guess that I have choose the wrong hack to start.

So it seems :-D

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> You seem to be imagining the MergeAppend node on top, but I had it in
> the other order in my mind.  The ModifyTable node would be the
> outermost plan node, pulling from the Limit, which would deliver the
> first n table rows from the MergeAppend, which would be reponsible for
> getting it from the various child tables.

That's just a variation of the Sort/Limit/ModifyTable approach.  It
doesn't fix the problem of how ModifyTable knows which table each row
came from, and it doesn't fix the problem of the rows not being all the
same rowtype.  (In fact it makes the latter worse, since now MergeAppend
has to be included in whatever kluge you invent to work around it.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Dimitri Fontaine

Andres Freund  writes:
> On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote:
>> I don't buy the argument either; why would you put a LIMIT there and
>> delete one row by accident when you could put a BEGIN; in front and not
>> do any damage at all?
> Because the delete of the whole table may take awfully long?

Then you just C-c and that's your ROLLBACK. Been there, seen that (a
developer came to me sweating over maybe-lost data — his chance was that
forgetting the WHERE clause, it did take long enough for him to C-c by
reflex, the oops moment).

But more to the point, I don't see that we're this much on the policy
side of things rather than on the mechanism side. This feature has real
appealing usages (cheap work queues, anti-deadlock, huge data purges
with no production locking — you do that in little steps in a loop).

To summarize, people that are arguing against are saying they will not
themselves put time on the feature more than anything else, I think. I
don't see us refusing a good implementation on the grounds that misuse
is possible.

After all, advisory locks are session based, to name another great foot
gun. If you don't think it's big enough, think about web environments
and pgbouncer in transaction pooling mode. Loads of fun.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane

Marko Tiikkaja  writes:
> While reading this thread, I thought of two things I think we could do 
> if this feature was implemented:

>   1. Sort large UPDATE/DELETEs so it is done in heap order

> This is actually a TODO item.  I imagine it would be possible to do 
> something like:
> DELETE FROM foo USING (...) ORDER BY ctid;
> with this patch to help this case.

Well, that's strictly an implementation detail; it is not a reason to
expose ORDER BY to the user, and even less of a reason to invent LIMIT.
It also hasn't got any of the problems we were discussing with
inheritance situations, since it'd be perfectly OK (in fact probably
desirable) to sort each table's rows separately.

>   2. Reducing deadlocks in big UPDATE/DELETEs

> One problem that sometimes occurs when doing multiple multi-row UPDATEs 
> or DELETEs concurrently is that the transactions end up working on the 
> same rows, but in a different order.  One could use an ORDER BY clause 
> to make sure the transactions don't deadlock.

That, on the other hand, seems like potentially a valid use-case.  Note
that the user-given order would have to override any internal attempt to
order by ctid for this to be usable.

I had thought of a slightly different application, which could be
summarized with this example:

UPDATE sometab SET somecol = nextval('seq') ORDER BY id;

with the expectation that somecol's values would then fall in the same
order as the id column.  Unfortunately, that won't actually *work*
reliably, the reason being that ORDER BY is applied after targetlist
computation.  I think enough people would get burnt this way that we'd
have popular demand to make ORDER BY work differently in UPDATE than it
does in SELECT, which seems rather ugly not only from the definitional
side but the implementation side.

(DELETE escapes this issue because it has no user-definable elements in
its targetlist, which is another way that DELETE is simpler here.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNNGIST next step: adjusting indexAM API

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 2:50 PM, Tom Lane  wrote:
> In the current KNNGIST patch, the indexable ORDER BY clauses are
> transmitted to the executor by cramming them in with the index qual
> conditions (the IndexScan plan node's indexqual list), from whence
> they become part of the ScanKey array passed to the index AM.
> Robert complained that this was an ingenious way to minimize the
> number of lines touched by the patch but utterly ugly from any other
> standpoint, and I quite agree.  An ORDER BY clause is a completely
> different thing from a WHERE qual, so mixing them together doesn't
> seem like a good idea.
>
> However, if we hold to that principle then we need to modify the indexAM
> API to pass the ordering operators separately.  This is no big deal as
> far as the built-in AMs are concerned, particularly because 3 of the 4
> need only assert that the additional list is empty.  The only reason it
> would be a problem is if there were third-party index AMs that would be
> affected to a larger degree; but I don't know of any.  Does anyone have
> an objection to that?

None here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Josh Berkus


> Maybe, but it's hard to argue that the current implementation--just
> doing all of the sync calls as fast as possible, one after the other--is
> going to produce worst-case behavior in a lot of situations.  Given that
> it's not a huge amount of code to do better, I'd rather do some work in
> that direction, instead of presuming the kernel authors will ever make
> this go away.  Spreading the writes out as part of the checkpoint rework
> in 8.3 worked better than any kernel changes I've tested since then, and
> I'm not real optimisic about this getting resolved at the system level. 
> So long as the database changes aren't antagonistic toward kernel
> improvements, I'd prefer to have some options here that become effective
> as soon as the database code is done.

Besides, even if kernel/FS authors did improve things, the improvements
would not be available on production platforms for years.  And, for that
matter, while Linux and BSD are pretty responsive to our feedback,
Apple, Microsoft and Oracle are most definitely not.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 2:45 PM, Jeff Davis  wrote:
> On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote:
>> > 3. This doesn't work tremendously well for inheritance trees, where
>> > ModifyTable acts as sort of an implicit Append node.  You can't just
>> > funnel all the tuples through one Sort or Limit node because they aren't
>> > all the same rowtype.  (Limit might perhaps not care, but Sort will.)
>> > But you can't have a separate Sort/Limit for each table either, because
>> > that would give the wrong behavior.  Another problem with funneling all
>> > the rows through one Sort/Limit is that ModifyTable did need to know
>> > which table each row came from, so it can apply the modify to the right
>> > table.
>>
>> Could you possibly have ModifyTable -> Limit -> MergeAppend?
>
> Before MergeAppend knows which tuple to produce, it needs to see the
> tuples (at least the first one from each of its children), meaning that
> it needs to pull them through ModifyTable; and at that point it's
> already too late.
>
> Also, assuming LIMIT K, MergeAppend will have N children, meaning N
> limits, meaning an effective limit of K*N rather than K.
>
> Can you be a little more specific about what you mean?

You seem to be imagining the MergeAppend node on top, but I had it in
the other order in my mind.  The ModifyTable node would be the
outermost plan node, pulling from the Limit, which would deliver the
first n table rows from the MergeAppend, which would be reponsible for
getting it from the various child tables.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] profiling connection overhead

2010-11-30 Thread Tom Lane

Peter Eisentraut  writes:
> On mÃ¥n, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
>> Rolling in calloc in place of
>> malloc/memset made no particular difference either, which says that
>> Fedora 13's glibc does not have any optimization for that case as I'd
>> hoped.

> glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

Hmm.  I would have expected to see a difference then.  Do you know what
conditions are needed to cause the mmap to be used?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andrew Dunstan




On 11/30/2010 03:16 PM, Andres Freund wrote:

On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote:

On 11/30/2010 02:12 PM, Kevin Grittner wrote:

Daniel Loureirowrote:

to me the key its security - its a anti-DBA-with-lack-of-attention
feature.

Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.

I agree, that argument is completely misconceived. If the DBA is paying
enough attention to use LIMIT, s/he should be paying enough attention
not to do damage in the first place. If that were the only argument in
its favor I'd be completely against the feature.

I don't buy the argument either; why would you put a LIMIT there and
delete one row by accident when you could put a BEGIN; in front and not
do any damage at all?

Because the delete of the whole table may take awfully long?




I don't see that that has anything to do with restricting damage. LIMIT 
might be useful for the reason you give, but not as any sort of 
protection against DBA carelessness. That's what the discussion above is 
about.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Alastair Turner

On Tue, Nov 30, 2010 at 9:24 PM, Marko Tiikkaja
 wrote:
>> On 11/30/2010 02:12 PM, Kevin Grittner wrote:
>>>
>>> Daniel Loureiro   wrote:
>>>
 to me the key its security - its a anti-DBA-with-lack-of-attention
 feature.
>>>
>>> Well, it seems pretty weak to me for that purpose.  You still trash
>>> data, and you don't have any immediate clue as to what.
>>
>> I agree, that argument is completely misconceived. If the DBA is paying
>> enough attention to use LIMIT, s/he should be paying enough attention
>> not to do damage in the first place. If that were the only argument in
>> its favor I'd be completely against the feature.
>
> I don't buy the argument either; why would you put a LIMIT there and delete
> one row by accident when you could put a BEGIN; in front and not do any
> damage at all?
>
It is valuable as a DBA carelessness/typo catcher only if it is
imposed by default (in line with Kevin's point), and only if it rolls
back rather than reduces the number of affected rows (as per Marko).

We have implemented a damage limitation solution similar to this with
triggers on an MSSQL database, and it has worked for the specific
environment it's in. The safety net is basically that the DBA has to
set an environment variable before a very large delete or update
operation. If the operation is recognised as being beyond the
threshold size the enviroment variable is checked - if it is set the
transaction passes and the variable is reset, if not the transaction
is rolled back.

It should be possible to implement something along these lines in
triggers, all that would be needed is a structure for defining the
(optional) limits on potentially destructive operations. More flexible
options or options based on the number of rows in a table will rapidly
increase the performance impact of the triggers - but may make them
more useful.

I'm not sure if there is a way to persist data (like a row count)
between per row triggers so that the operation could be aborted at the
limit rather than only once all the rows had been updated (potentially
a big peformance gain).

Alastair "Bell" Turner

Technical Lead
^F5

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith


Ron Mayer wrote:

Might smoother checkpoints be better solved by talking
to the OS vendors & virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.
  


Maybe, but it's hard to argue that the current implementation--just 
doing all of the sync calls as fast as possible, one after the other--is 
going to produce worst-case behavior in a lot of situations.  Given that 
it's not a huge amount of code to do better, I'd rather do some work in 
that direction, instead of presuming the kernel authors will ever make 
this go away.  Spreading the writes out as part of the checkpoint rework 
in 8.3 worked better than any kernel changes I've tested since then, and 
I'm not real optimisic about this getting resolved at the system level.  
So long as the database changes aren't antagonistic toward kernel 
improvements, I'd prefer to have some options here that become effective 
as soon as the database code is done.


I've attached an updated version of the initial sync spreading patch 
here, one that applies cleanly on top of HEAD and over top of the sync 
instrumentation patch too.  The conflict that made that hard before is 
gone now.


Having the pg_stat_bgwriter.buffers_backend_fsync patch available all 
the time now has made me reconsider how important one potential bit of 
refactoring here would be.  I managed to catch one of the situations 
where really popular relations were being heavily updated in a way that 
was competing with the checkpoint on my test system (which I can happily 
share the logs of), with the instrumentation patch applied but not the 
spread sync one:


LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm 
time=10132.83 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 
msec
DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm 
time=0.001000 msec

DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm 
time=0.001000 msec
DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm 
time=0.001000 msec
LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, 
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s


Note here how the checkpoint was hung on trying to get 16448_fsm written 
out, but the backends were issuing constant competing fsync calls to 
these other relations.  This is very similar to the production case this 
patch was written to address, which I hadn't been able to share a good 
example of yet.  That's essentially what it looks like, except with the 
contention going on for minutes instead of seconds.


One of the ideas Simon and I had been considering at one point was 
adding some better de-duplication logic to the fsync absorb code, which 
I'm reminded by the pattern here might be helpful independently of other 
improvements.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#def

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andres Freund

On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote:
> > On 11/30/2010 02:12 PM, Kevin Grittner wrote:
> >> Daniel Loureiro   wrote:
> >>> to me the key its security - its a anti-DBA-with-lack-of-attention
> >>> feature.
> >> 
> >> Well, it seems pretty weak to me for that purpose.  You still trash
> >> data, and you don't have any immediate clue as to what.
> > 
> > I agree, that argument is completely misconceived. If the DBA is paying
> > enough attention to use LIMIT, s/he should be paying enough attention
> > not to do damage in the first place. If that were the only argument in
> > its favor I'd be completely against the feature.
> 
> I don't buy the argument either; why would you put a LIMIT there and
> delete one row by accident when you could put a BEGIN; in front and not
> do any damage at all?
Because the delete of the whole table may take awfully long?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Jeff Janes

On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith  wrote:

Hi Greg,

Thanks for the update.

> This might be ready for some proper review now.  I know there's at least one
> blatant bug still in here I haven't found yet, related to how the averages
> are computed.

Yes, the blatant bug:

average_sync_time = CheckpointStats.ckpt_longest_sync /
CheckpointStats.ckpt_sync_rels;

That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync.

> I saw this once:
>
> LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
> sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s
>
> After an immediate checkpoint, so at least one path not quite right yet.

Not clear what the right thing to do here is.  I guess we should
special case the div by zero to yield zero for the average?

The patch is in unified diff rather than context diff.  Doesn't bother
me, but the wiki on doing a review says...

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Where are we on Standby Promotion?

2010-11-30 Thread Josh Berkus

Fujii, Simon, Greg, etc.:

Has anyone submitted or committed a patch to make Standby Promotion*
easier, at this point?  We discussed it earlier in the dev cycle, but I
can't find anything which has actually been submitted.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

* that is, promotion to be a new master of other existing standbys

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Josh Berkus

Alexey,

> Here is the proposal to add synonyms to PostgreSQL. Initial goal is to add 
> synonyms
> for relations (tables, views, sequences) and an infrastructure to allow 
> synonyms
> for other database objects in the future. 

Can you explain, for our benefit, the use case for this?  Specifically,
what can be done with synonyms which can't be done with search_path and
VIEWs?

I ask partly because I've migrated some Oracle databases to PostgreSQL,
and did not find replacing the functionality of synonyms to be at all
difficult.  Presumably you've run into a case which was difficult?

BTW, I have a specific use case for *column* synonyms which isn't
currently covered by our existing tools.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for seg picksplit function

2010-11-30 Thread Yeb Havinga


On 2010-11-16 09:57, Alexander Korotkov wrote:
On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas > wrote:


The loop that begins here:

   for (i = 0; i < maxoff; i++)
   {
   /* First half of segs goes to the left datum. */
   if (i < seed_2)

...looks like it should perhaps be broken into two separate loops.
That might also help tweak the logic in a way that eliminates this:

seg.c: In function ‘gseg_picksplit’:
seg.c:327: warning: ‘datum_r’ may be used uninitialized in this
function
seg.c:326: warning: ‘datum_l’ may be used uninitialized in this
function

I restored original version of that loop.

But on a broader note, I'm not very certain the sorting algorithm is
sensible.  For example, suppose you have 10 segments that are exactly
'0' and 20 segments that are exactly '1'.  Maybe I'm misunderstanding,
but it seems like this will result in a 15/15 split when we almost
certainly want a 10/20 split.  I think there will be problems in more
complex cases as well.  The documentation says about the less-than and
greater-than operators that "These operators do not make a lot of
sense for any practical purpose but sorting."

I think almost any split algorithm has corner cases when it's results 
don't look very good. I think the way to understand significance of 
these corner cases for real life is to perform sufficient testing 
on datasets which is close to real life. I'm not feeling power to 
propose enough of test datasets and estimate their significance for 
real life cases, and I need help in this field.

I think it is time to mark this patch ready for committer:

The unintuitive result thus far is that sorting outperforms the R-tree 
bounding boxes style index, as Alexander has demonstrated with several 
different distributions on 20-11 (uniform, natural (is that a bell 
curve?), many distinct values)


My personal opinion is that I like the single loop for walking over the 
sort array (aka gbt_num_picksplit) more than the two different ones, but 
I'm in the minority here.


Two remarks on this patch also apply to other picksplit functions 
currently around:
1) the *first = *last = FirstOffsetNumber assignment, as noted by 
Alvaro, is not necessary for anymore, and Oleg confirmed this is true 
since PostgreSQL > 7.x. 2) loops over something other than the 
entryvector better not use FirstOffsetNumber, OffsetNumberNext, as 
indicated by Tom.


If this patch is committed, it might be an idea to change the other 
occurences as well.


regards,
Yeb Havinga

[HACKERS] KNNGIST next step: adjusting indexAM API

2010-11-30 Thread Tom Lane

In the current KNNGIST patch, the indexable ORDER BY clauses are
transmitted to the executor by cramming them in with the index qual
conditions (the IndexScan plan node's indexqual list), from whence
they become part of the ScanKey array passed to the index AM.
Robert complained that this was an ingenious way to minimize the
number of lines touched by the patch but utterly ugly from any other
standpoint, and I quite agree.  An ORDER BY clause is a completely
different thing from a WHERE qual, so mixing them together doesn't
seem like a good idea.

However, if we hold to that principle then we need to modify the indexAM
API to pass the ordering operators separately.  This is no big deal as
far as the built-in AMs are concerned, particularly because 3 of the 4
need only assert that the additional list is empty.  The only reason it
would be a problem is if there were third-party index AMs that would be
affected to a larger degree; but I don't know of any.  Does anyone have
an objection to that?

(Another thing that might be worth changing, as long as we have to touch
the beginscan and rescan APIs anyway, is to refactor the handling of
the initial set of scan keys.  It never made any sense to me for
RelationGetIndexScan to call index_rescan: that seems to accomplish
little except making it difficult for AM beginscan routines to do things
in a sane order.  I'm inclined to take that out and let the AM call
rescan internally if it wants to.)

Lastly, I'm pretty un-thrilled with the way that the KNNGIST patch
implements the interface to the opclass-specific hook functions.
Seems like it would be cleaner to leave the Consistent function alone
and invent a new, separate hook function for processing ORDER BY.
Is there a strong reason for having both things done in one call,
or was that just done as a byproduct of trying to cram all the data
into one ScanKey array?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Marko Tiikkaja

While reading this thread, I thought of two things I think we could do 
if this feature was implemented:


 1. Sort large UPDATE/DELETEs so it is done in heap order

This is actually a TODO item.  I imagine it would be possible to do 
something like:


DELETE FROM foo USING (...) ORDER BY ctid;

with this patch to help this case.

 2. Reducing deadlocks in big UPDATE/DELETEs

One problem that sometimes occurs when doing multiple multi-row UPDATEs 
or DELETEs concurrently is that the transactions end up working on the 
same rows, but in a different order.  One could use an ORDER BY clause 
to make sure the transactions don't deadlock.


Thoughts?


Regards,
Marko Tiikkaja

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Jeff Davis

On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote:
> > 3. This doesn't work tremendously well for inheritance trees, where
> > ModifyTable acts as sort of an implicit Append node.  You can't just
> > funnel all the tuples through one Sort or Limit node because they aren't
> > all the same rowtype.  (Limit might perhaps not care, but Sort will.)
> > But you can't have a separate Sort/Limit for each table either, because
> > that would give the wrong behavior.  Another problem with funneling all
> > the rows through one Sort/Limit is that ModifyTable did need to know
> > which table each row came from, so it can apply the modify to the right
> > table.
> 
> Could you possibly have ModifyTable -> Limit -> MergeAppend?

Before MergeAppend knows which tuple to produce, it needs to see the
tuples (at least the first one from each of its children), meaning that
it needs to pull them through ModifyTable; and at that point it's
already too late.

Also, assuming LIMIT K, MergeAppend will have N children, meaning N
limits, meaning an effective limit of K*N rather than K.

Can you be a little more specific about what you mean?

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] profiling connection overhead

2010-11-30 Thread Peter Eisentraut

On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
> Rolling in calloc in place of
> malloc/memset made no particular difference either, which says that
> Fedora 13's glibc does not have any optimization for that case as I'd
> hoped.

glibc's calloc is either mmap of /dev/zero or malloc followed by memset.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Marko Tiikkaja


On 11/30/2010 02:12 PM, Kevin Grittner wrote:

Daniel Loureiro   wrote:


to me the key its security - its a anti-DBA-with-lack-of-attention
feature.


Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.


I agree, that argument is completely misconceived. If the DBA is paying
enough attention to use LIMIT, s/he should be paying enough attention
not to do damage in the first place. If that were the only argument in
its favor I'd be completely against the feature.


I don't buy the argument either; why would you put a LIMIT there and 
delete one row by accident when you could put a BEGIN; in front and not 
do any damage at all?



Regards,
Marko Tiikkaja

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andrew Dunstan




On 11/30/2010 02:12 PM, Kevin Grittner wrote:

Daniel Loureiro  wrote:


to me the key its security - its a anti-DBA-with-lack-of-attention
feature.


Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.


I agree, that argument is completely misconceived. If the DBA is paying 
enough attention to use LIMIT, s/he should be paying enough attention 
not to do damage in the first place. If that were the only argument in 
its favor I'd be completely against the feature.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Kevin Grittner

Daniel Loureiro  wrote:

> to me the key its security - its a anti-DBA-with-lack-of-attention
> feature.

Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.  If you
wanted protection from that you'd want more of an "assert limit"
that would fail if the affected row count was above what you
specified.

For me the best solution is to develop good habits.  I first type my
statement as "SELECT * FROM ..." and after reviewing the results
arrow up and replace "SELECT *" with "DELETE".  If there's enough
volatility or complexity to make that insufficient insurance, I
begin a transaction.  That way I can not only review row counts but
run queries against the modified data to confirm correct
modification before issuing a COMMIT (or ROLLBACK).

The batching of updates so that vacuums can make space available for
re-use is more compelling to me, but still pretty iffy, since the
work-arounds aren't that hard to find.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> Well, a user with full-table UPDATE privileges can trash the whole
> thing, so, from a security perspective, letting them lock is only
> allowing them to deny access to data they could have just as easily
> trashed.  A user with only single-column UPDATE privileges cannot
> trash the whole table, though, so you are allowing them to deny read
> access to data they may not themselves have permission either to read
> or to update.

> Admittedly, this seems a bit more rickety in light of your point that
> they can still lock all the rows... but then that only stops writes,
> not reads. I'm less convinced that I'm right about this than I was 3
> days ago.  But I'm still not convinced that the above argument is
> wrong, either.

I think what your complaint really boils down to is that LOCK TABLE
is granting excessive permissions to someone who only has table-level
UPDATE privilege.  If we were designing that from scratch today, I am
sure we'd have made it tighter, say that UPDATE alone wouldn't give you
more than RowExclusive lock.  However, given the lack of complaints
about this from the field, I can't get very excited about a
non-backward-compatible change to tighten LOCK's privilege checking.

I find the argument that column-level update should give weaker locking
privileges than full-table update to be pretty thin.  That isn't the
case at the row level; why should it be true at the table level?

However, the other side of the "lack of complaints" argument is that
few people seem to care about whether LOCK TABLE responds to column
level privileges, either.  AFAICS this patch is not in response to any
user request but just because Josh thought things were inconsistent.
Which they are, but at a deeper level than this.  If we apply this
patch, then we'll be expanding the set of cases where LOCK is granting
privilege too freely, and thus creating more not less
backwards-compatibility problem if we are ever to make it saner.

On the whole I agree with Robert --- let's just adjust the
documentation, and not enlarge privileges in a way we might regret
later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Cédric Villemain

2010/11/30 Greg Smith :
> Jeff Janes wrote:
>>
>> For the individual file sync times emitted under debug1, it would be
>> very handy if the file being synced was identified, for example
>> "relation base/16384/16523". Rather than being numbered sequentially
>> within a given checkpoint.
>>
>
> I was numbering them sequentially so that it's straightforward to graph the
> sync times in an external analysis tool, but the relation data is helpful
> too.  New patch reflecting all upthread suggestions is attached.  The output
> looks like this now at DEBUG1:
>
> LOG:  checkpoint starting: xlog
> DEBUG:  checkpoint sync: number=1 file=base/16424/11645 time=11589.549000
> msec
> DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
> DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
> DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
> DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
> DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000
> msec
> DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
> DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000
> msec
> DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
> DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000
> msec
> DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
> DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000
> msec
> DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
> DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
> DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000
> msec
> DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000
> msec
> LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log
> file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s,
> total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s
>
> I kept the units for the DEBUG level ones in msec because that's a better
> scale for the common really short syncs during good behavior.  But the
> summary info in seconds now appears at the end of the existing "checkpoint
> complete" message, so only one line to parse for those looking to analyze
> the gross checkpoint data.  That looks to work well enough for finding
> situations like the big ext3 spikes.  You can easily see one in this example
> by the fact that "longest=11.589 s" is almost the entirety of "sync=11.715
> s".  That's the really key thing there's currently no visibility into,
> that's made obvious with this patch.

wonderfull.

>
> This might be ready for some proper review now.  I know there's at least one
> blatant bug still in here I haven't found yet, related to how the averages
> are computed.  I saw this once:
>
> LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
> sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s
>
> After an immediate checkpoint, so at least one path not quite right yet.
>
> --
> Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Daniel Loureiro

to me the key its security - its a anti-DBA-with-lack-of-attention feature.
If i forget the "WHERE" statement, I will delete some valid tuples and
messed up the bd, but its less-than-worst that exclude all the table. A DBA
who never forgot an "WHERE" in an "DELETE" is not an DBA. Just kidding, but
this happens often enough.

is there another option to implement this ? Its possible to be done by
plugins/extension (in a Firefox browser style) ?

Sds,
--
Daniel Loureiro
--

2010/11/30 Andrew Dunstan 

>
>
> On 11/30/2010 09:57 AM, Csaba Nagy wrote:
>
>
> So it is really an ideological thing and not lack of demand or
> implementation attempts... I for myself can't write working C code
> anyway, so I got my peace with the workaround - I wish you good luck
> arguing Tom :-)
>
>
>
>
> We need a convincing use case for it. So far the only one that's seemed at
> all convincing to me is the one about deleting in batches. But that might be
> enough.
>
> As for it being illogical, I don't think it's any more so than
>
> DELETE FROM foo WHERE random() < 0.1;
>
> and you can do that today.
>
> cheers
>
> andrew
>
>

Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 7:26 AM, Simon Riggs  wrote:
> On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote:
>
>> I still see little reason to make LOCK TABLE permissions different for
>> column-level vs. table-level UPDATE privileges
>
> Agreed.
>
> This is the crux of the debate. Why should this inconsistency be allowed
> to continue?

Well, a user with full-table UPDATE privileges can trash the whole
thing, so, from a security perspective, letting them lock is only
allowing them to deny access to data they could have just as easily
trashed.  A user with only single-column UPDATE privileges cannot
trash the whole table, though, so you are allowing them to deny read
access to data they may not themselves have permission either to read
or to update.

Admittedly, this seems a bit more rickety in light of your point that
they can still lock all the rows... but then that only stops writes,
not reads. I'm less convinced that I'm right about this than I was 3
days ago.  But I'm still not convinced that the above argument is
wrong, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Daniel Loureiro

> 3. This doesn't work tremendously well for inheritance trees, where
> ModifyTable acts as sort of an implicit Append node.  You can't just
> funnel all the tuples through one Sort or Limit node because they aren't
> all the same rowtype.  (Limit might perhaps not care, but Sort will.)
> But you can't have a separate Sort/Limit for each table either, because
> that would give the wrong behavior.  Another problem with funneling all
> the rows through one Sort/Limit is that ModifyTable did need to know
> which table each row came from, so it can apply the modify to the right
> table.

So I guess that I have choose the wrong hack to start.

Just for curiosity, why the result of "WHERE" filter (in
SELECT/DELETE/UPDATE) is not put in memory, i.e. an array of ctid, like an
buffer and then executed by SELECT/DELETE/UPDATE at once ?

Greets,
--
Daniel Loureiro

[HACKERS]

2010-11-30 Thread rickytato rickytato

Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Simon Riggs

On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote:

> I still see little reason to make LOCK TABLE permissions different for
> column-level vs. table-level UPDATE privileges 

Agreed.

This is the crux of the debate. Why should this inconsistency be allowed
to continue?

Are there covert channel issues here, KaiGai?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Heikki Linnakangas  writes:
> On 30.11.2010 19:22, Tom Lane wrote:
>> But having said that, I wonder whether we need a full-page image for
>> a WAL-logged action that is known to involve only setting a single bit
>> and updating LSN.

> You have to write a full-page image if you update the LSN, because 
> otherwise the next update that comes along will not write a full page image.

Um.  Drat.  I was thinking about the replay side, where I think it would
actually work --- but you're right, it would break the logic on the
generation side.  Unless you want to put in some kind of flag saying
"this was only a visibility bit update, any bigger update still needs
to write an FPI".

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 12:25 PM, Robert Haas  wrote:
> On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane  wrote:
>> But having said that, I wonder whether we need a full-page image for
>> a WAL-logged action that is known to involve only setting a single bit
>> and updating LSN.  Would omitting the FPI be any more risky than what
>> happens now (ie, the page does get written back to disk at some point,
>> without any image from which it can be rewritten if the write fails...)
>
> That's pretty much exactly what Heikki proposed 35 minutes ago, and
> you objected 6 minutes later.  I still think it might work.

Oh, I see the difference now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane  wrote:
> But having said that, I wonder whether we need a full-page image for
> a WAL-logged action that is known to involve only setting a single bit
> and updating LSN.  Would omitting the FPI be any more risky than what
> happens now (ie, the page does get written back to disk at some point,
> without any image from which it can be rewritten if the write fails...)

That's pretty much exactly what Heikki proposed 35 minutes ago, and
you objected 6 minutes later.  I still think it might work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 19:22, Tom Lane wrote:

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)


You have to write a full-page image if you update the LSN, because 
otherwise the next update that comes along will not write a full page image.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane  wrote:
>> It's ridiculous to claim that that "doubles the cost of VACUUM".  In the
>> worst case, it will add 25% to the cost of setting an all-visible bit on
>> a page where there is no other work to do.  (You already are writing out
>> the heap page and the VM page,

> True.

>> plus a WAL image of the heap page, so a

> False.  That is exactly what we are NOT doing now and what we must
> find a way to avoid doing.

I do not accept that argument.  You can't make an omelette without
breaking eggs, and the cost of index-only scans is going to be that
it costs more to get the visibility bits set in the first place.

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> We're not going to double the cost of VACUUM to get index-only scans.
>> And that's exactly what will happen if you do full-page writes of
>> every heap page to set a single bit.
>
> It's ridiculous to claim that that "doubles the cost of VACUUM".  In the
> worst case, it will add 25% to the cost of setting an all-visible bit on
> a page where there is no other work to do.  (You already are writing out
> the heap page and the VM page,

True.

> plus a WAL image of the heap page, so a

False.  That is exactly what we are NOT doing now and what we must
find a way to avoid doing.

> WAL image of the VM page adds 25%.  But only if you did not set any
> other bits on the same VM page, which is probably not a real common
> case.)  Given that VACUUM has a lot of other cleanup besides visibility
> bit setting, I'm not convinced that this would even be noticeable.
>
> I think the burden is on people who are proposing complicated mechanisms
> to show that there's actually a strong need for them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> We're not going to double the cost of VACUUM to get index-only scans.
> And that's exactly what will happen if you do full-page writes of
> every heap page to set a single bit.

It's ridiculous to claim that that "doubles the cost of VACUUM".  In the
worst case, it will add 25% to the cost of setting an all-visible bit on
a page where there is no other work to do.  (You already are writing out
the heap page and the VM page, plus a WAL image of the heap page, so a
WAL image of the VM page adds 25%.  But only if you did not set any
other bits on the same VM page, which is probably not a real common
case.)  Given that VACUUM has a lot of other cleanup besides visibility
bit setting, I'm not convinced that this would even be noticeable.

I think the burden is on people who are proposing complicated mechanisms
to show that there's actually a strong need for them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Tom Lane

Alexey Klyukin  writes:
> On Nov 30, 2010, at 6:28 PM, Tom Lane wrote:
>> This is not going to work, at least not without making every type of
>> lookup consult pg_synonym too, which I think can be considered DOA
>> because of its performance impact on people who aren't even using the
>> feature.

> For those not using synonyms it would result in an extra syscache lookup for
> each schema from the search_path that doesn't contain the table with the
> specified name. If the table is specified with A FQN or contained in the first
> schema from the search_path no extra lookup would occur. Is it considered a
> big impact?

Yes.  It'll be slow and it will render code that's already unreasonably
complicated into an unreadable morass.  We are not going there.

(Just to be clear, it's not the table search case I'm worried about;
it's operator/function lookup that I think this would be completely
unacceptable for.  And if you're only going to support table/view
synonyms then you might as well put them in pg_class.)

> I think that would almost guarantee that synonyms for other kinds of objects
> (i.e. databases, such kind of synonyms were requested in the past) would never
> be added.

That's fine with me.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:59 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane  wrote:
>>> Ouch.  That seems like it could shoot down all these proposals.  There
>>> definitely isn't any way to make VM crash-safe if there is no WAL-driven
>>> mechanism for setting the bits.
>
>> Heikki's intent method works fine, because the WAL record only clears
>> the visibility map bits on redo; it never sets them.
>
> Uh, no, because he also had that final WAL record that would set the
> bits.

Well, as already discussed upthread, that WAL record causes some other
problems, so make it Heikki's intent method, without the final WAL
record that breaks things.

>> We could actually allow the slave to set the visibility map bits based
>> on its own xmin horizon.
>
> Not in a crash-safe way, which is exactly the problem here.

Brilliant selective quoting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:55 AM, Tom Lane  wrote:
> Heikki Linnakangas  writes:
>> Can we get away with not setting the LSN on the heap page, even though
>> we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page
>> can be flushed to disk before the WAL record, but I think that's fine
>> because it's OK to have the flag set in the heap page even if the VM bit
>> is not set.
>
> Why is that fine?  It's certainly not fine from the standpoint of
> someone wondering why his index-only scan performs so badly.
>
> I think all this hair-splitting about cases where it's okay to have one
> bit set and not the other is misguided.  To me, crash-safety of the VM
> means that its copy of the page-header bit is right.  Period.  Yes, it
> will cost something to ensure that; so what?  If we don't get more than
> enough compensating performance gain from index-only scans, the whole
> patch is going to end up reverted.

We're not going to double the cost of VACUUM to get index-only scans.
And that's exactly what will happen if you do full-page writes of
every heap page to set a single bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:49 AM, Heikki Linnakangas
 wrote:
> On 30.11.2010 18:33, Tom Lane wrote:
>>
>> Robert Haas  writes:
>>>
>>> Oh, but it's worse than that.  When you XLOG a WAL record for each of
>>> those pages, you're going to trigger full-page writes for all of them.
>>>  So now you've turned 1GB of data to write into 2+ GB of data to
>>> write.
>>
>> No, because only the first mod of each VM page would trigger a full page
>> write, at least assuming a reasonable ordering of the operations.
>
> If you change the LSN on the heap pages, you have to write full page images
> of those as well.
>
> Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE
> flag on the heap page (assuming it's not set already, it usually isn't), and
> then set the bit in the VM while keeping the heap page locked.
>
> Can we get away with not setting the LSN on the heap page, even though we
> set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be
> flushed to disk before the WAL record, but I think that's fine because it's
> OK to have the flag set in the heap page even if the VM bit is not set.

I don't immediately see why that wouldn't work.  As long as you bump
the LSN on the visibility map page, and also bump the LSN of the
visibility map page every time you clear a bit, I think you should be
OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane  wrote:
>> Ouch.  That seems like it could shoot down all these proposals.  There
>> definitely isn't any way to make VM crash-safe if there is no WAL-driven
>> mechanism for setting the bits.

> Heikki's intent method works fine, because the WAL record only clears
> the visibility map bits on redo; it never sets them.

Uh, no, because he also had that final WAL record that would set the
bits.

> We could actually allow the slave to set the visibility map bits based
> on its own xmin horizon.

Not in a crash-safe way, which is exactly the problem here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Alexey Klyukin

On Nov 30, 2010, at 6:28 PM, Tom Lane wrote:

> Alexey Klyukin  writes:
>> To support addition of new database objects types that can be referenced by
>> synonyms a new system catalog, pg_synonym, is to be added, with an oid to
>> support comments on synonym, and the following schema:
> 
> This is not going to work, at least not without making every type of
> lookup consult pg_synonym too, which I think can be considered DOA
> because of its performance impact on people who aren't even using the
> feature.

For those not using synonyms it would result in an extra syscache lookup for
each schema from the search_path that doesn't contain the table with the
specified name. If the table is specified with A FQN or contained in the first
schema from the search_path no extra lookup would occur. Is it considered a
big impact? The number of such lookups can be reduced if we traverse the
search_path for the tables first, and then look for the synonyms, although
that would change the lookup rules stated in this proposal

>  It's also quite unclear how you prevent duplicate names
> if the synonyms are in their own catalog.  (And no, the part of your
> proposal that says you're not preventing that isn't acceptable from
> a usability standpoint.)

What's wrong with the usability of that feature? The fact that the table with
the same FQN as a synonym masks the latter can be clearly stated in the
documentation. Are you expecting lots of people to name the synonym exactly
the same as one of the database tables and wonder why is the table and not the
synonym gets accessed? As an alternative, a warning during table/synonym
creation/renaming can be emitted if the name clash occurs.

> 
> You could reasonably support synonyms for tables/views by storing them
> in pg_class with a new relkind.  This doesn't cover synonyms for other
> object types, but since the total world demand for such a feature is
> approximately zero, that's not really a problem.

I think that would almost guarantee that synonyms for other kinds of objects
(i.e. databases, such kind of synonyms were requested in the past) would never
be added.

--
Alexey Klyukin  http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Heikki Linnakangas  writes:
> Can we get away with not setting the LSN on the heap page, even though 
> we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
> can be flushed to disk before the WAL record, but I think that's fine 
> because it's OK to have the flag set in the heap page even if the VM bit 
> is not set.

Why is that fine?  It's certainly not fine from the standpoint of
someone wondering why his index-only scan performs so badly.

I think all this hair-splitting about cases where it's okay to have one
bit set and not the other is misguided.  To me, crash-safety of the VM
means that its copy of the page-header bit is right.  Period.  Yes, it
will cost something to ensure that; so what?  If we don't get more than
enough compensating performance gain from index-only scans, the whole
patch is going to end up reverted.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> That's definitely sucky, but in some ways it would be more complicated
>> if they did, because I don't think all-visible on the master implies
>> all-visible on the standby.
>
> Ouch.  That seems like it could shoot down all these proposals.  There
> definitely isn't any way to make VM crash-safe if there is no WAL-driven
> mechanism for setting the bits.

Heikki's intent method works fine, because the WAL record only clears
the visibility map bits on redo; it never sets them.

> I guess what we need is a way to delay the application of such a WAL
> record on the slave until it's safe, which means the record also has to
> carry some indication of the youngest XMIN on the page.

I'm unexcited about inventing more ways to delay XLOG application on
the standby.  We have enough of those already.

We could actually allow the slave to set the visibility map bits based
on its own xmin horizon.  The only problem is that you wouldn't be
able to write the intent XLOG records.  I suppose you could have a
separate file which is just used to store the intent records, and
designate a range of very-high numbered LSNs to mean blocks of the
intent file rather than a position in the regular WAL stream.  VACUUM
is so much fun on the master, let's have it on the standby too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 18:33, Tom Lane wrote:

Robert Haas  writes:

Oh, but it's worse than that.  When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
write.


No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.


If you change the LSN on the heap pages, you have to write full page 
images of those as well.


Let's recap what happens when a VM bit is set: You set the 
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it 
usually isn't), and then set the bit in the VM while keeping the heap 
page locked.


Can we get away with not setting the LSN on the heap page, even though 
we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
can be flushed to disk before the WAL record, but I think that's fine 
because it's OK to have the flag set in the heap page even if the VM bit 
is not set.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 18:40, Tom Lane wrote:

Robert Haas  writes:

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.


Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.


Note that this is only a problem for *hot* standby. After failover, all 
the tuples that were visible to everyone in the master are also visible 
to all new transactions in the standby.


We dealt with this in 9.0 already, with the "killed" flag in index 
tuples and the PD_ALL_VISIBLE flag in heap scans. We simply don't 
believe them in hot standby mode, and check visibility even if the flag 
is set.



I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.


Something like that would certainly be nice. With index-only scans, it 
can be a big disappointment if you can't do an index-only scan in hot 
standby.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Oh, but it's worse than that.  When you XLOG a WAL record for each of
>> those pages, you're going to trigger full-page writes for all of them.
>>  So now you've turned 1GB of data to write into 2+ GB of data to
>> write.
>
> No, because only the first mod of each VM page would trigger a full page
> write, at least assuming a reasonable ordering of the operations.

I'm not worried about the full-page writes from updating the
visibility map - I'm worried about the full-page writes from updating
the heap.  It doesn't matter a whit if we fail to set a bit in the
visibility map. What matters is if we DO set the bit in the visibility
map but FAIL TO set the bit in the heap, because then a subsequent
update to the heap page won't check the visibility map and clear the
bit.  The *heap* updates are the ones that have to be guaranteed to
make it to disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> That's definitely sucky, but in some ways it would be more complicated
> if they did, because I don't think all-visible on the master implies
> all-visible on the standby.

Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.

I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Greg Smith


Jeff Janes wrote:

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
"relation base/16384/16523". Rather than being numbered sequentially
within a given checkpoint.
  


I was numbering them sequentially so that it's straightforward to graph 
the sync times in an external analysis tool, but the relation data is 
helpful too.  New patch reflecting all upthread suggestions is 
attached.  The output looks like this now at DEBUG1:


LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/16424/11645 
time=11589.549000 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm 
time=2.921000 msec

DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm 
time=1.654000 msec

DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm 
time=2.613000 msec

DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm 
time=2.638000 msec

DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm 
time=8.686000 msec
DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm 
time=5.984000 msec
LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log 
file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, 
total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s


I kept the units for the DEBUG level ones in msec because that's a 
better scale for the common really short syncs during good behavior.  
But the summary info in seconds now appears at the end of the existing 
"checkpoint complete" message, so only one line to parse for those 
looking to analyze the gross checkpoint data.  That looks to work well 
enough for finding situations like the big ext3 spikes.  You can easily 
see one in this example by the fact that "longest=11.589 s" is almost 
the entirety of "sync=11.715 s".  That's the really key thing there's 
currently no visibility into, that's made obvious with this patch.


This might be ready for some proper review now.  I know there's at least 
one blatant bug still in here I haven't found yet, related to how the 
averages are computed.  I saw this once:


LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log 
file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, 
total=0.001 s; sync files=0, longest=0.000 s, 
average=-9223372036854775808.-2147483 s


After an immediate checkpoint, so at least one path not quite right yet.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede6ceb..6f6eb3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint)
 {
 	long		write_secs,
 sync_secs,
-total_secs;
+total_secs,
+longest_secs,
+average_secs;
 	int			write_usecs,
 sync_usecs,
-total_usecs;
+total_usecs,
+longest_usecs,
+average_usecs;
+	double		average_sync_time;
 
 	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
 
@@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint)
 		CheckpointStats.ckpt_sync_end_t,
 		&sync_secs, &sync_usecs);
 
+	/*
+	 * Timing values returned from CheckpointStats are in milliseconds.
+	 * Convert to the second plus microsecond form that TimestampDifference
+	 * returns for homogeneous printing.
+	 */
+	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000);
+	longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000);
+
+	average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels;
+	average_secs = (long) (average_sync_time / 1000);
+	average_usecs = 1000 * (average_sync_time - average_secs * 1000);
+
 	if (restartpoint)
 		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
-			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
+			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
+			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
 			 CheckpointStats.ckpt_bufs_written,
 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 			 write_secs, write_usecs / 1000,
 			 sync_secs, sync_usecs / 1000,
-			 total_secs, total_usecs / 1000);
+			 total_secs, total_usecs / 1000,
+			 Checkpo

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Heikki Linnakangas  writes:
> On 30.11.2010 18:10, Tom Lane wrote:
>> I'm not convinced it works at all.  Consider write intent record,
>> checkpoint, set bit, crash before completing vacuum.  There will be
>> no second intent record at which you could clean up if things are
>> inconsistent.

> That's why you need to check the RedoRecPtr when you set the bit. If it 
> has changed, ie. a checkpoint has happened, the set bit step will write 
> a new intent record.

Oh, you explained the proposal poorly then.  I thought you meant recheck
and write another intent record just once, immediately before sending
the final xlog record.

It still seems rickety and not clearly correct, especially when you
start thinking about all the other constraints we have on xlog behavior
(eg, does this work while taking a base backup).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> Oh, but it's worse than that.  When you XLOG a WAL record for each of
> those pages, you're going to trigger full-page writes for all of them.
>  So now you've turned 1GB of data to write into 2+ GB of data to
> write.

No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 18:10, Tom Lane wrote:

Heikki Linnakangas  writes:

Yeah, I'm not terribly excited about any of these schemes. The "intent"
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.


I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.


That's why you need to check the RedoRecPtr when you set the bit. If it 
has changed, ie. a checkpoint has happened, the set bit step will write 
a new intent record.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Tom Lane

Alexey Klyukin  writes:
> To support addition of new database objects types that can be referenced by
> synonyms a new system catalog, pg_synonym, is to be added, with an oid to
> support comments on synonym, and the following schema:

This is not going to work, at least not without making every type of
lookup consult pg_synonym too, which I think can be considered DOA
because of its performance impact on people who aren't even using the
feature.  It's also quite unclear how you prevent duplicate names
if the synonyms are in their own catalog.  (And no, the part of your
proposal that says you're not preventing that isn't acceptable from
a usability standpoint.)

You could reasonably support synonyms for tables/views by storing them
in pg_class with a new relkind.  This doesn't cover synonyms for other
object types, but since the total world demand for such a feature is
approximately zero, that's not really a problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 18:22, Robert Haas wrote:

On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane  wrote:

How much is "quite a lot"?  Do we have any real reason to think that
this solution is unacceptable performance-wise?


Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.


Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per 
page.



But you did just dirty a gigabyte of data.


Good point.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas  wrote:
> On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane  wrote:
>> How much is "quite a lot"?  Do we have any real reason to think that
>> this solution is unacceptable performance-wise?
>
> Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
> you XLOG setting the bit on each page, you'll need to write 128K WAL
> records, each containing a 12-byte relfilenode and a 4-byte block
> offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.
>
> But you did just dirty a gigabyte of data.

Oh, but it's worse than that.  When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
 So now you've turned 1GB of data to write into 2+ GB of data to
write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 9:07 AM, Josh Kupershmidt  wrote:
> On Mon, Nov 29, 2010 at 10:06 PM, Robert Haas  wrote:
>> On Mon, Nov 29, 2010 at 9:37 PM, Josh Kupershmidt  wrote:
>>> I actually hadn't thought of that, for some reason.
>>>
>>> We used to similarly recommend that people handle TRUNCATE privileges
>>> with a security definer function. That doesn't mean GRANT TRUNCATE
>>> wasn't a sweet addition to 8.4.
>>
>> Hmm, glad you like it (I wrote that).  I'm just asking how far we
>> should go before we decide we catering to use cases that are too
>> narrow to warrant an extension of the permissions system.
>
> I am slightly opposed to adding GRANTs for LOCK TABLE, ANALYZE,
> VACUUM, etc. The GRANT help page is long enough already, and I doubt
> many users would use them, even though I might use GRANT LOCK TABLE
> myself.

You'd really probably want GRANT LOCK TABLE (SHARE), GRANT LOCK TABLE
(EXCLUSIVE), ...

It'd be sort of cool, but it doesn't seem worth the complexity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane  wrote:
> How much is "quite a lot"?  Do we have any real reason to think that
> this solution is unacceptable performance-wise?

Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

But you did just dirty a gigabyte of data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Heikki Linnakangas  writes:
> The trivial solution to this is to WAL-log setting the visibility map 
> bit, like we WAL-log any other operation. Lock the heap page, lock the 
> visibility map page, write WAL-record, and release locks. That works, 
> but the problem is that it creates quite a lot of new WAL traffic.

How much is "quite a lot"?  Do we have any real reason to think that
this solution is unacceptable performance-wise?

I'd also suggest that if you want to prevent torn-page syndrome on VM
pages (and if you want to rely on their LSN values, you do) then you
probably don't have any choice anyway.  VM pages will have to adhere to
the same write-full-page-on-first-mod-after-checkpoint rule as any other
page.  I'd guess that this will swamp any savings from cutesy schemes
for reducing the number of WAL records.

> We could reduce the WAL traffic by simply updating multiple pages at a 
> time. Lock N pages, lock the visibility map page, write one WAL record, 
> and release locks.

I don't think that will work, because you have to hold the lock on a
page from the time you check that it's all-visible to the time you apply
the update.  The loss of concurrency against updates would be pretty
bad, and I think you'd be creating significant risk of deadlocks from
holding multiple buffer locks at once.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 11:04 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane  wrote:
>>> Unfortunately, the UPDATE
>>> case would be an order of magnitude harder (think inheritance trees
>>> where the children aren't all alike).
>
>> I don't understand why there's anything more to this than sticking a
>> Limit node either immediately above or immediately below the
>> ModifyTable node.
>
> 1. You need to support ORDER BY too, otherwise I *will* be on the
> warpath against this as a foot-gun with no redeeming social value.

Will you be wielding a Tom-ahawk?

> 2. So what you need is Sort underneath Limit underneath ModifyTable.
> Putting them above it would be quite the wrong semantics.

OK.

> 3. This doesn't work tremendously well for inheritance trees, where
> ModifyTable acts as sort of an implicit Append node.  You can't just
> funnel all the tuples through one Sort or Limit node because they aren't
> all the same rowtype.  (Limit might perhaps not care, but Sort will.)
> But you can't have a separate Sort/Limit for each table either, because
> that would give the wrong behavior.  Another problem with funneling all
> the rows through one Sort/Limit is that ModifyTable did need to know
> which table each row came from, so it can apply the modify to the right
> table.

Could you possibly have ModifyTable -> Limit -> MergeAppend?

> Or I guess you could try to persuade us that DELETE/UPDATE with ORDER BY
> or LIMIT doesn't need to support inherited target tables.  I wouldn't
> bet on that proposal flying either.

I've spent enough time worrying about the fact that tables with
inheritance children don't behave as nicely as those that don't to
have any interest in going in the other direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Heikki Linnakangas  writes:
> On 30.11.2010 17:38, Tom Lane wrote:
>> Wouldn't it be easier and more robust to just consider VM bit changes to
>> be part of the WAL-logged actions?  That would include updating LSNs on
>> VM pages and flushing VM pages to disk during checkpoint based on their
>> LSN values.  All of these other schemes seem too complicated and not
>> provably correct.

> The vm bit can be set once all the tuples on the page become visible to 
> everyone. There is no WAL-logged action at that point we could piggyback on.

So you start emitting a WAL entry for the act of setting the VM bit
(and I guess the page header hint bit would be included in that too).

> Yeah, I'm not terribly excited about any of these schemes. The "intent" 
> record seems like the simplest one, but even that is quite different 
> from the traditional WAL-logging we do that it makes me slightly nervous.

I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Another proposal for table synonyms

2010-11-30 Thread Alexey Klyukin

Hello,

Here is the proposal to add synonyms to PostgreSQL. Initial goal is to add 
synonyms
for relations (tables, views, sequences) and an infrastructure to allow synonyms
for other database objects in the future. 

A thread with discussion of an old proposal by Jonah Harris is here: 
http://archives.postgresql.org/pgsql-hackers/2006-03/msg00519.php

-
Synonyms are database objects, which act as an alias for other objects. In
this proposal the synonyms for tables, views and sequences will be considered.

The new command, CREATE SYNONYM, defines a synonym. The syntax is:

CREATE SYNONYM synonym_name FOR object_type object_name
where
synonym_name: fully-qualified name (FQN) of the synonym
object_type: {TABLE | VIEW | SEQUENCE}. In the future, new object_types, such as
functions, can be added.
object_name:  FQN of a database table, view or sequence.

Another new command, DROP SYNONYM, is used for deleting an already existing
synonym without removing the object the synonym references. The syntax is:

DROP SYNONYM synonym_name
where synonym_name is a FQN of a synonym.

Comments will be supported on synonyms with the following command:
COMMENT ON SYNONYM synonym_name IS comment_text
where synonym_name is a FQN of a synonym.

To support addition of new database objects types that can be referenced by
synonyms a new system catalog, pg_synonym, is to be added, with an oid to
support comments on synonym, and the following schema:

synname  name  name of the synonym
synnamespace  oid  OID of the namespace that contains the synonym
synclassid  oid  OID of the system catalog that contains the  referenced object
synobjid   oid  OID of the referenced object

When resolving the synonym name, the usual search_path lookup rules apply,
i.e. first, the object of the appropriate type is looked into the schema, then
the synonym, afterwards the process iterates with the next schema from the
search_path. Note that the table synonym with the same FQN as an existing
table will be masked by that table.

To speedup the synonym name resolution a new syscache, SYNNAMENSPCLASS
{synname, synnamespace, synclassid} will be introduced. This cache will be
accessed if the query to the RELNAMENSP syscache will return no result, with
the DB object's catalog OID set to pg_class OID.

For table and view synonyms, INSERT/UPDATE/DELETE/SELECT will be supported.
For sequences SELECT will be supported. The commands will translate synonyms
to the referenced database objects on the parser stage.

All types of synonyms will be supported as table arguments/return value types,
as well as actual values (i.e. currval/nextval will accept a sequence
synonym).

The following DDL will work transparently with table synonyms (sequences and
views if the corresponding command applies to them): 
COPY, LOCK, TRUNCATE, EXPLAIN, EXPLAIN ANALYZE.

The following DDL commands will cause an error when called for tables
(sequences, views) synonyms:
ALTER {TABLE|VIEW|SEQUENCE}, 
ANALYZE, 
CLUSTER, 
COMMENT ON {TABLE | VIEW | SEQUENCE} .. IS, 
DROP {TABLE | VIEW | SEQUENCE}, 
GRANT,
REVOKE,
VACUUM.
For these commands additional checks for synonyms will be introduced on a
per-command basis.

A dependency of the referenced object on a synonym will be added when adding a
new synonym to forbid removing a referenced object without removing the
synonym first (without using CASCADE). On DROP SYNONYM the related dependency
will be removed.

--
Alexey Klyukin  http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Csaba Nagy

Hi Robert,

On Tue, 2010-11-30 at 09:19 -0500, Robert Haas wrote:
> That's a very elegant hack, but not exactly obvious to a novice user
> or, say, me.  So I think it'd be nicer to have the obvious syntax
> work.

I fully agree - but you first have to convince core hackers that this is
not just a foot-gun. This was discussed many times in the past, patches
were also offered (perhaps not complete one, but proving that there is
an itch getting scratched):

http://archives.postgresql.org/pgsql-patches/2002-09/msg00255.php

The reaction:

http://archives.postgresql.org/pgsql-patches/2002-09/msg00256.php

There are other discussions too, if I remember correctly Tom once
admitted that the core of implementing the feature would likely consist
in letting it work, as the infrastructure is there to do it but it is
actively disabled. I can't find the mail now though.

So it is really an ideological thing and not lack of demand or
implementation attempts... I for myself can't write working C code
anyway, so I got my peace with the workaround - I wish you good luck
arguing Tom :-)

Cheers,
Csaba.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane

Robert Haas  writes:
> On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane  wrote:
>> Unfortunately, the UPDATE
>> case would be an order of magnitude harder (think inheritance trees
>> where the children aren't all alike).

> I don't understand why there's anything more to this than sticking a
> Limit node either immediately above or immediately below the
> ModifyTable node.

1. You need to support ORDER BY too, otherwise I *will* be on the
warpath against this as a foot-gun with no redeeming social value.

2. So what you need is Sort underneath Limit underneath ModifyTable.
Putting them above it would be quite the wrong semantics.

3. This doesn't work tremendously well for inheritance trees, where
ModifyTable acts as sort of an implicit Append node.  You can't just
funnel all the tuples through one Sort or Limit node because they aren't
all the same rowtype.  (Limit might perhaps not care, but Sort will.)
But you can't have a separate Sort/Limit for each table either, because
that would give the wrong behavior.  Another problem with funneling all
the rows through one Sort/Limit is that ModifyTable did need to know
which table each row came from, so it can apply the modify to the right
table.

I don't offhand see a solution other than integrating the responsibility
for limit-counting and sorting into ModifyTable itself, making it into
an unholy union of ModifyTable+Limit+MergeAppend (with the individual
inputs required to deliver sorted outputs separately).  That's
sufficiently ugly, and probably bad for performance in the normal case,
that I don't think it's going to be acceptable for such a marginal
feature.

Or I guess you could try to persuade us that DELETE/UPDATE with ORDER BY
or LIMIT doesn't need to support inherited target tables.  I wouldn't
bet on that proposal flying either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


Here's one more idea:

The trivial solution to this is to WAL-log setting the visibility map 
bit, like we WAL-log any other operation. Lock the heap page, lock the 
visibility map page, write WAL-record, and release locks. That works, 
but the problem is that it creates quite a lot of new WAL traffic.


We could reduce the WAL traffic by simply updating multiple pages at a 
time. Lock N pages, lock the visibility map page, write one WAL record, 
and release locks. If N=10, for example, we only need to WAL-log a 
couple of bytes per page, so the WAL volume should be acceptable. The 
downside is that you need to keep more pages locked at the same time, 
but maybe that's not too bad.


This wouldn't require anything special, which means fewer hard-to-debug 
visibility & recovery bugs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas
 wrote:
>> It seems like you'll need to hold some kind of lock between the time
>> you examine RedoRecPtr and the time you actually examine the bit.
>> WALInsertLock in shared mode, maybe?
>
> It's enough to hold an exclusive lock on the visibility map page. You have
> to set the bit first, and then check RedoRecPtr, and if it changed, write
> the XLOG record before releasing the lock. If RedoRecPtr changes any time
> before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it
> changes after that, we're safe because the checkpoint will flush the updated
> heap page and visibility map page.

Brilliant.  I assume that we need to call GetRedoRecPtr() after taking
the exclusive lock on the page, though?

> Yeah, possibly. It also means that the set bits will not propagate to
> standby servers, though.

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 17:38, Tom Lane wrote:

Heikki Linnakangas  writes:

On 30.11.2010 06:57, Robert Haas wrote:

I can't say I'm totally in love with any of these designs.  Anyone
else have any ideas, or any opinions about which one is best?



Well, the design I've been pondering goes like this:


Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.


The vm bit can be set once all the tuples on the page become visible to 
everyone. There is no WAL-logged action at that point we could piggyback on.


Clearing the bit is already handled like that - replay of heap 
insert/update/delete records clear the visibility map bit.



Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.


Yeah, I'm not terribly excited about any of these schemes. The "intent" 
record seems like the simplest one, but even that is quite different 
from the traditional WAL-logging we do that it makes me slightly nervous.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane  wrote:
> Heikki Linnakangas  writes:
>> On 30.11.2010 06:57, Robert Haas wrote:
>>> I can't say I'm totally in love with any of these designs.  Anyone
>>> else have any ideas, or any opinions about which one is best?
>
>> Well, the design I've been pondering goes like this:
>
> Wouldn't it be easier and more robust to just consider VM bit changes to
> be part of the WAL-logged actions?  That would include updating LSNs on
> VM pages and flushing VM pages to disk during checkpoint based on their
> LSN values.  All of these other schemes seem too complicated and not
> provably correct.

What WAL-logged actions?

The problem case is where a page has no tuples or line pointers that
need to be removed, and all we need to do is mark it all-visible.  We
don't current WAL-log anything in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 17:32, Robert Haas wrote:

On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
  wrote:

Some care is needed with checkpoints. Setting visibility map bits in step 2
is safe because crash recovery will replay the intent XLOG record and clear
any incorrectly set bits. But if a checkpoint has happened after the intent
XLOG record was written, that's not true. This can be avoided by checking
RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
since the last intent XLOG record was written.


It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?


It's enough to hold an exclusive lock on the visibility map page. You 
have to set the bit first, and then check RedoRecPtr, and if it changed, 
write the XLOG record before releasing the lock. If RedoRecPtr changes 
any time before we check RedoRecPtr, we'll write the XLOG record so 
we're safe. If it changes after that, we're safe because the checkpoint 
will flush the updated heap page and visibility map page.



There's a small race condition in the way a visibility map bit is currently
cleared. When a heap page is updated, it is locked, the update is
WAL-logged, and the lock is released. The visibility map page is updated
only after that. If the final vacuum XLOG record is written just after
updating the heap page, but before the visibility map bit is cleared,
replaying the final XLOG record will set a bit that should not have been
set.


Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it.  Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over.  And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires.  Seems like we might be better off
writing that much less WAL.


Yeah, possibly. It also means that the set bits will not propagate to 
standby servers, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan  wrote:
>>> I can't say I'd be excited by this feature. In quite a few years of writing
>>> SQL I don't recall ever wanting such a gadget.
>
>> It's something I've wanted periodically, though not badly enough to do
>> the work to make it happen.
>
> It would certainly look like nothing but a crude hack if the feature is
> only available for DELETE and not UPDATE.

I'm not sure this is true, given Andrew's comment that the bulk
deletion argument is the only one he finds compelling, but I'd surely
be in favor of supporting both.

> Unfortunately, the UPDATE
> case would be an order of magnitude harder (think inheritance trees
> where the children aren't all alike).

I don't understand why there's anything more to this than sticking a
Limit node either immediately above or immediately below the
ModifyTable node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane

Heikki Linnakangas  writes:
> On 30.11.2010 06:57, Robert Haas wrote:
>> I can't say I'm totally in love with any of these designs.  Anyone
>> else have any ideas, or any opinions about which one is best?

> Well, the design I've been pondering goes like this:

Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.

Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas

On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
 wrote:
> Some care is needed with checkpoints. Setting visibility map bits in step 2
> is safe because crash recovery will replay the intent XLOG record and clear
> any incorrectly set bits. But if a checkpoint has happened after the intent
> XLOG record was written, that's not true. This can be avoided by checking
> RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
> since the last intent XLOG record was written.

It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?

> There's a small race condition in the way a visibility map bit is currently
> cleared. When a heap page is updated, it is locked, the update is
> WAL-logged, and the lock is released. The visibility map page is updated
> only after that. If the final vacuum XLOG record is written just after
> updating the heap page, but before the visibility map bit is cleared,
> replaying the final XLOG record will set a bit that should not have been
> set.

Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it.  Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over.  And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires.  Seems like we might be better off
writing that much less WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas


On 30.11.2010 16:23, Robert Haas wrote:

On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
  wrote:

On 30.11.2010 11:55, Heikki Linnakangas wrote:


On 27.11.2010 21:31, Bruce Momjian wrote:


Heikki Linnakangas wrote:


There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.


Oh sorry, I read your question as "9.0 *or* 9.1".

Only GiST indexes that have any "invalid" tuples in them n, as a result of a
crash, need to be reindexed. That's very rare in practice, so we shouldn't
invalidate all GiST indexes. I don't think there's any simple way to check
whether reindex is required, so I think we have to just document this.


It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?


It's not corruption, but "intended degradation". Yes, the current code 
copes with it, that's how GiST survives a crash. However, even with the 
current code, VACUUM will nag if it finds any invalid tuples with this 
message:


ereport(NOTICE,
	(errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash 
recovery",


That's harmless, in the sense that all scans and inserts work fine, but 
scans might need to do more work than if the invalid tuple wasn't there.


I don't think we need to go out of our way to support such degraded 
indexes in 9.1. If you see such notices in your logs, you should REINDEX 
anyway, before of after pg_upgrade. Let's just make sure that you get a 
reasonable error message in 9.1 if a scan or insert encounters such a tuple.


There is a section on this in the docs, BTW: 
http://www.postgresql.org/docs/9.0/static/gist-recovery.html


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 116 matches

Mail list logo