Re: [HACKERS] DOMAINs and CASTs

2012-04-24 Thread Jaime Casanova
On Tue, Nov 29, 2011 at 10:12 PM, Robert Haas  wrote:
> On Tue, Nov 29, 2011 at 11:11 AM, Jaime Casanova  
> wrote:
>> On Tue, Nov 29, 2011 at 10:42 AM, Tom Lane  wrote:
>>> Bruce Momjian  writes:
 Tom Lane wrote:
> Robert Haas  writes:
>> Well, if we apply this, it has the possibility to break existing
>> dumps.
>>>
>>> BTW, it occurs to me that we could dodge that objection, with much less
>>> work than Robert suggests, if we just made the message be a WARNING not
>>> an ERROR.  I think that'd do just as well in terms of what the message
>>> could usefully accomplish, ie, steer people away from doing things that
>>> won't work.  Still not sure that it's worth doing though,
>>>
>>
>> i'm fine with a WARNING
>
> Me too; I suggested it before (so did you).
>

are we going to put this warning in this release?

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-04-24 Thread Boszormenyi Zoltan

2012-04-23 15:08 keltezéssel, Marc Cousin írta:

On Mon, 2012-04-23 at 10:53 +0200, Boszormenyi Zoltan wrote:

2012-04-10 09:02 keltezéssel, Boszormenyi Zoltan írta:

2012-04-06 14:47 keltezéssel, Cousin Marc írta:

On 05/04/12 08:02, Boszormenyi Zoltan wrote:

2012-04-04 21:30 keltezéssel, Alvaro Herrera írta:

I think this patch is doing two things: first touching infrastructure
stuff and then adding lock_timeout on top of that.  Would it work to
split the patch in two pieces?


Sure. Attached is the split version.

Best regards,
Zoltán Böszörményi


Hi,

I've started looking at and testing both patches.

Technically speaking, I think the source looks much better than the
first version of lock timeout, and may help adding other timeouts in the
future. I haven't tested it in depth though, because I encountered the
following problem:

While testing the patch, I found a way to crash PG. But what's weird is
that it crashes also with an unpatched git version.

Here is the way to reproduce it (I have done it with a pgbench schema):

- Set a small statement_timeout (just to save time during the tests)

Session1:
=#BEGIN;
=#lock TABLE pgbench_accounts ;

Session 2:
=#BEGIN;
=#lock TABLE pgbench_accounts ;
ERROR:  canceling statement due to statement timeout
=# lock TABLE pgbench_accounts ;

I'm using \set ON_ERROR_ROLLBACK INTERACTIVE by the way. It can also be
done with a rollback to savepoint of course.

Session 2 crashes with this : TRAP : FailedAssertion(«
!(locallock->holdsStrongLockCount == 0) », fichier : « lock.c », ligne :
749).

It can also be done without a statement_timeout, and a control-C on the
second lock table.

I didn't touch anything but this. It occurs everytime, when asserts are
activated.

I tried it on 9.1.3, and I couldn't make it crash with the same sequence
of events. So maybe it's something introduced since ? Or is the assert
still valid ?

Cheers


Attached are the new patches. I rebased them to current GIT and
they are expected to be applied after Robert Haas' patch in the
"bug in fast-path locking" thread.

Now it survives the above scenario.

Best regards,
Zoltán Böszörményi

New patch attached, rebased to today's GIT.

Best regards,
Zoltán Böszörményi


Ok, I've done what was missing from the review (from when I had a bug in
locking the other day), so here is the full review. By the way, this
patch doesn't belong to current commitfest, but to the next one.


It was added to 2012-Next when I posted it, 2012-01 was already
closed for new additions.


Is the patch in context diff format?
Yes

Does it apply cleanly to the current git master?
Yes

Does it include reasonable tests, necessary doc patches, etc?
The new lock_timeout GUC is documented. There are regression tests.

Read what the patch is supposed to do, and consider:
Does the patch actually implement that?
Yes

Do we want that?
I do. Mostly for administrative jobs which could lock the whole
application. It would be much easier to run reindexes, vacuum full, etc…
without worrying about bringing application down because of lock
contention.

Do we already have it?
No.

Does it follow SQL spec, or the community-agreed behavior?
I don't know if there is a consensus on this new GUC. statement_timeout
is obviously not in the SQL spec.

Does it include pg_dump support (if applicable)?
Not applicable

Are there dangers?
Yes, as it rewrites all the timeout code. I feel it is much cleaner this
way, as there is a generic set of function managing all sigalarm code,
but it heavily touches this part.

Have all the bases been covered?
I tried all sql statements I could think of (select, insert, update,
delete, truncate, drop, create index, adding constraint, lock.

I tried having statement_timeout, lock_timeout and deadlock_timeout at
very short and close or equal values. It worked too.

Rollback to savepoint while holding locks dont crash PostgreSQL anymore.

Other timeouts such as archive_timeout and checkpoint_timeout still
work.

Does the feature work as advertised?
Yes

Are there corner cases the author has failed to consider?
I didn't find any.

Are there any assertion failures or crashes?
No.

Does the patch slow down simple tests?
No

If it claims to improve performance, does it?
Not applicable

Does it slow down other things?
No

Does it follow the project coding guidelines?
I think so

Are there portability issues?
No, all the portable code (acquiring locks and manipulating sigalarm) is
the same as before.

Will it work on Windows/BSD etc?
It should. I couldn't test it though.

Are the comments sufficient and accurate?
Yes

Does it do what it says, correctly?
Yes

Does it produce compiler warnings?
No

Can you make it crash?
Not anymore

Is everything done in a way that fits together coherently with other
features/modules?
Yes, I think so. The new way of handling sigalarm seems more robust to
me.

Are there interdependencies that can cause problems?
I don't see any.

Regards,

Marc


Thanks for the review.

Best regards,
Zoltán B

Re: [HACKERS] B-tree page deletion boundary cases

2012-04-24 Thread Nikhil Sontakke
> > Was wondering if there's a similar bug which gets triggered while using
> > VACUUM FULL. See for instance this thread:
> >
> >
> http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html
> >
> > This issue has been reported on-off from time to time and in most cases
> > VACUUM or VACUUM FULL appears to be involved. We have usually attributed
> it
> > to hardware issues and reindex has been recommended by default as a
> > solution/work around..
>
> I do not perceive much similarity.  The bug I've raised can produce wrong
> query results transiently.  It might permit injecting a tuple into the
> wrong
> spot in the tree, yielding persistent wrong results.  It would not
> introduce
> tree-structural anomalies like sibling pointers directed at zeroed pages or
> internal pages in an 1-level tree.  Given the symptoms you reported, I
> share
> Robert's suspicion of WAL replay in your scenario.
>

Thanks for taking the time to analyze this Noah.

Regards,
Nikhils


Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Sandro Santilli
On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:

> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples
> >FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
> > FROM random_pages('some_table', 50) AS rnd_pgtids;
> 
> But I don't understand the reservoir_sample call, what is it supposed to do ?

Ok got it, that was probably to avoid:

 ERROR:  more than one row returned by a subquery used as an expression

But this also works nicely:

 SELECT * FROM lots_of_points
 WHERE ctid = ANY ( ARRAY[(SELECT random_tids('lots_of_points', 10))] );

and still uses tidscan.

The advanced TID operator would be for random_tids to only return pages rather
than full tids...

--strk; 

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/PGSQL bug in handling composite types

2012-04-24 Thread Boszormenyi Zoltan

Hi,

we have found a way to make pl/pgsql throw an error for
a legitimate use case that works in plain SQL.

Minimal test case:

create table x1 (id serial primary key, d timestamptz);
create table x2 (id serial primary key, d timestamptz);
insert into x2 (d) values ('now');
create type mytype as (id bigint, d timestamptz);

Casting a set of values to "mytype" works in SQL:

=# select (max(id),min(d))::mytype from x2;
 row
-
 (1,"2012-04-24 09:04:07.475315+02")
(1 row)

=# select (max(id),min(d))::mytype from x1;
 row
-
 (,)
(1 row)

Even empty data in plain text casted to "mytype" works
and individual fields can be tested for IS NULL:

=# select '(,)'::mytype;
 mytype

 (,)
(1 row)

=# select ('(,)'::mytype).id is null;
 ?column?
--
 t
(1 row)

Now, try this from PL/PGSQL:

create or replace function testfunc1() returns mytype as $$declare
  v_sql text;
  mt mytype;
begin
  v_sql := 'select (max(id),min(d))::mytype from x1';
  execute v_sql into mt;
  return mt;
end;$$ language plpgsql;

=# select testfunc1();
ERROR:  invalid input syntax for integer: "(,)"
CONTEXT:  PL/pgSQL function "testfunc1" line 6 at EXECUTE statement

The same error happens with table "x2" with data in it:

create or replace function testfunc2() returns mytype as $$declare
  v_sql text;
  mt mytype;
begin
  v_sql := 'select (max(id),min(d))::mytype from x2';
  execute v_sql into mt;
  return mt;
end;$$ language plpgsql;

=# select testfunc2();
ERROR:  invalid input syntax for integer: "(1,"2012-04-24 09:04:07.475315+02")"
CONTEXT:  PL/pgSQL function "testfunc2" line 6 at EXECUTE statement

Same happens also with non-dynamic queries:

create or replace function testfunc1a() returns mytype as $$declare
  mt mytype;
begin
  select (max(id),min(d))::mytype into mt from x1;
  return mt;
end;$$ language plpgsql;

=# select testfunc1a();
ERROR:  invalid input syntax for integer: "(,)"
CONTEXT:  PL/pgSQL function "testfunc1a" line 4 at SQL statement

create or replace function testfunc2a() returns mytype as $$declare
  mt mytype;
begin
  select (max(id),min(d))::mytype into mt from x2;
  return mt;
end;$$ language plpgsql;

=# select testfunc2a();
ERROR:  invalid input syntax for integer: "(1,"2012-04-24 09:04:07.475315+02")"
CONTEXT:  PL/pgSQL function "testfunc2a" line 4 at SQL statement

Best regards,
Zoltán Böszörményi

--
--
Zoltán Böszörményi
Cybertec Schönig&  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/PGSQL bug in handling composite types

2012-04-24 Thread Boszormenyi Zoltan

2012-04-24 09:59 keltezéssel, Boszormenyi Zoltan írta:

Hi,

we have found a way to make pl/pgsql throw an error for
a legitimate use case that works in plain SQL.

Minimal test case:

create table x1 (id serial primary key, d timestamptz);
create table x2 (id serial primary key, d timestamptz);
insert into x2 (d) values ('now');
create type mytype as (id bigint, d timestamptz);

Casting a set of values to "mytype" works in SQL:

=# select (max(id),min(d))::mytype from x2;
 row
-
 (1,"2012-04-24 09:04:07.475315+02")
(1 row)

=# select (max(id),min(d))::mytype from x1;
 row
-
 (,)
(1 row)

Even empty data in plain text casted to "mytype" works
and individual fields can be tested for IS NULL:

=# select '(,)'::mytype;
 mytype

 (,)
(1 row)

=# select ('(,)'::mytype).id is null;
 ?column?
--
 t
(1 row)

Now, try this from PL/PGSQL:

create or replace function testfunc1() returns mytype as $$declare
  v_sql text;
  mt mytype;
begin
  v_sql := 'select (max(id),min(d))::mytype from x1';
  execute v_sql into mt;
  return mt;
end;$$ language plpgsql;

=# select testfunc1();
ERROR:  invalid input syntax for integer: "(,)"
CONTEXT:  PL/pgSQL function "testfunc1" line 6 at EXECUTE statement

The same error happens with table "x2" with data in it:

create or replace function testfunc2() returns mytype as $$declare
  v_sql text;
  mt mytype;
begin
  v_sql := 'select (max(id),min(d))::mytype from x2';
  execute v_sql into mt;
  return mt;
end;$$ language plpgsql;

=# select testfunc2();
ERROR:  invalid input syntax for integer: "(1,"2012-04-24 09:04:07.475315+02")"
CONTEXT:  PL/pgSQL function "testfunc2" line 6 at EXECUTE statement

Same happens also with non-dynamic queries:

create or replace function testfunc1a() returns mytype as $$declare
  mt mytype;
begin
  select (max(id),min(d))::mytype into mt from x1;
  return mt;
end;$$ language plpgsql;

=# select testfunc1a();
ERROR:  invalid input syntax for integer: "(,)"
CONTEXT:  PL/pgSQL function "testfunc1a" line 4 at SQL statement

create or replace function testfunc2a() returns mytype as $$declare
  mt mytype;
begin
  select (max(id),min(d))::mytype into mt from x2;
  return mt;
end;$$ language plpgsql;

=# select testfunc2a();
ERROR:  invalid input syntax for integer: "(1,"2012-04-24 09:04:07.475315+02")"
CONTEXT:  PL/pgSQL function "testfunc2a" line 4 at SQL statement

Best regards,
Zoltán Böszörményi



Sorry, I didn't mention the version tested: 9.0.6 and 9.1.3.

Best regards,
Zoltán Böszörményi

--
--
Zoltán Böszörményi
Cybertec Schönig&  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/PGSQL bug in handling composite types

2012-04-24 Thread Boszormenyi Zoltan

2012-04-24 10:01 keltezéssel, Boszormenyi Zoltan írta:

2012-04-24 09:59 keltezéssel, Boszormenyi Zoltan írta:

Hi,

we have found a way to make pl/pgsql throw an error for
a legitimate use case that works in plain SQL.

Minimal test case:

create table x1 (id serial primary key, d timestamptz);
create table x2 (id serial primary key, d timestamptz);
insert into x2 (d) values ('now');
create type mytype as (id bigint, d timestamptz);

Casting a set of values to "mytype" works in SQL:

=# select (max(id),min(d))::mytype from x2;
 row
-
 (1,"2012-04-24 09:04:07.475315+02")
(1 row)

=# select (max(id),min(d))::mytype from x1;
 row
-
 (,)
(1 row)

Even empty data in plain text casted to "mytype" works
and individual fields can be tested for IS NULL:

=# select '(,)'::mytype;
 mytype

 (,)
(1 row)

=# select ('(,)'::mytype).id is null;
 ?column?
--
 t
(1 row)

Now, try this from PL/PGSQL:

create or replace function testfunc1() returns mytype as $$declare
  v_sql text;
  mt mytype;
begin
  v_sql := 'select (max(id),min(d))::mytype from x1';
  execute v_sql into mt;
  return mt;
end;$$ language plpgsql;

=# select testfunc1();
ERROR:  invalid input syntax for integer: "(,)"
CONTEXT:  PL/pgSQL function "testfunc1" line 6 at EXECUTE statement

The same error happens with table "x2" with data in it:

create or replace function testfunc2() returns mytype as $$declare
  v_sql text;
  mt mytype;
begin
  v_sql := 'select (max(id),min(d))::mytype from x2';
  execute v_sql into mt;
  return mt;
end;$$ language plpgsql;

=# select testfunc2();
ERROR:  invalid input syntax for integer: "(1,"2012-04-24 09:04:07.475315+02")"
CONTEXT:  PL/pgSQL function "testfunc2" line 6 at EXECUTE statement

Same happens also with non-dynamic queries:

create or replace function testfunc1a() returns mytype as $$declare
  mt mytype;
begin
  select (max(id),min(d))::mytype into mt from x1;
  return mt;
end;$$ language plpgsql;

=# select testfunc1a();
ERROR:  invalid input syntax for integer: "(,)"
CONTEXT:  PL/pgSQL function "testfunc1a" line 4 at SQL statement

create or replace function testfunc2a() returns mytype as $$declare
  mt mytype;
begin
  select (max(id),min(d))::mytype into mt from x2;
  return mt;
end;$$ language plpgsql;

=# select testfunc2a();
ERROR:  invalid input syntax for integer: "(1,"2012-04-24 09:04:07.475315+02")"
CONTEXT:  PL/pgSQL function "testfunc2a" line 4 at SQL statement

Best regards,
Zoltán Böszörményi



Sorry, I didn't mention the version tested: 9.0.6 and 9.1.3.


The same also happens with current 9.2 GIT.



Best regards,
Zoltán Böszörményi




--
--
Zoltán Böszörményi
Cybertec Schönig&  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-tree page deletion boundary cases

2012-04-24 Thread Simon Riggs
On Sat, Apr 21, 2012 at 5:52 PM, Noah Misch  wrote:

> As I mentioned[1] peripherally back in November, that algorithm has been
> insufficient since the introduction of non-XID-bearing transactions in
> PostgreSQL 8.3.  Such transactions do not restrain RecentXmin.  If no running
> transaction has an XID, RecentXmin == ReadNewTransactionId() and the page
> incorrectly becomes available for immediate reuse.

Good observation.

> The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin.  We
> already use RecentGlobalXmin when wal_level = hot_standby.  If no running
> transaction has an XID and all running transactions began since the last
> transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId().
> Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <=
> RecentGlobalXmin as we have today.  This also cleanly removes the need for the
> bit of code in _bt_getbuf() that decrements btpo.xact before sending it down
> for ResolveRecoveryConflictWithSnapshot().  I suggested[2] that decrement on
> an unprincipled basis; it was just masking the off-by-one of using "<=
> RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().

Looks like the right fix. I'll apply this to 9.0/9.1/HEAD.

> This change makes empty B-tree pages wait through two generations of running
> transactions before reuse, so some additional bloat will arise.

Could arise, in some circumstances. But that assumes VACUUMs are
fairly frequent and that they would be delayed/rendered less effective
by this, which I don't think will be the case.

I note that we don't take any account of the number of pages that may
be reused when we VACUUM, so when HOT avoids a VACUUM we may
accumulate pages for a longer period. Looks like there is more work to
be done yet in cleaning indexes.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ECPG FETCH readahead

2012-04-24 Thread Michael Meskes
> OK, I will implement #2. Another question popped up: what to do
> with FETCH ALL? The current readahead window size or temporarily
> bumping it to say some tens of thousands can be used. We may not
> know how much is the "all records". This, although lowers performance,
> saves memory.

I would say doing a large fetch in two or three batches won't cost too much in
terms of performance.

> Please, don't apply this patch yet. I discovered a rather big hole
> that can confuse the cursor position tracking if you do this:
> ...
> That will also need a new round of review. Sorry for that.

No problem, better to find it now instead of after release. 

Anyway, I moved the patch to 2012-next (I hope I did it correctly) so 2012-1
can be closed. Let's try to get this patch done in the next commit fest.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Ants Aasma
On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli  wrote:
> On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
>> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
>
>> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples
>> >    FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
>> > FROM random_pages('some_table', 50) AS rnd_pgtids;
>>
>> But I don't understand the reservoir_sample call, what is it supposed to do ?
>
> Ok got it, that was probably to avoid:
>
>  ERROR:  more than one row returned by a subquery used as an expression

No, it's to avoid bias towards tuples on more sparsely populated
pages. See http://en.wikipedia.org/wiki/Reservoir_sampling or
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=ff271644e0f93ee99bfe9c1f536f3dd48455d8d2;hb=HEAD#l1027

> The advanced TID operator would be for random_tids to only return pages rather
> than full tids...

Exactly. But when mainly IO bound (ie. sampling from a large table on
spinning rust) the overhead of probing with TID scan as opposed to
sequentially scanning the pages should be small enough. When CPU bound
I suspect that the function call machinery overhead for
reservoir_sample is going to become a large issue, so a built in
tablesample also has an edge there.


Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign table scan estimates

2012-04-24 Thread Albe Laurenz
Tom Lane wrote:
>> While playing around with ANALYZE on foreign tables, I noticed
>> that the row count estimate for foreign scans is still
>> initialized to 1000 even if there are statistics for the
>> foreign table.  I think that this should be improved.

>> The attached patch illustrates my suggestion.

> I don't think this is appropriate; it will just waste cycles because
> the FDW will have to repeat the calculations after obtaining a real
> estimate of the foreign table size.  If we trusted pg_class.reltuples
> to be up to date, there might be some value in this.  But we don't
> trust that for regular tables (cf. plancat.c), and I don't see why
> we would do so for foreign tables.
> 
> I think on the whole it's better to abdicate responsibility here and
> require the FDW to do something in its GetForeignRelSize function.
> It's not like we'd be saving the FDW a lot of code in the (unlikely)
> case that this is exactly what it would do anyway.

I agree.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-tree page deletion boundary cases

2012-04-24 Thread Noah Misch
On Tue, Apr 24, 2012 at 09:08:36AM +0100, Simon Riggs wrote:
> On Sat, Apr 21, 2012 at 5:52 PM, Noah Misch  wrote:
> > The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. 
> > ?We
> > already use RecentGlobalXmin when wal_level = hot_standby. ?If no running
> > transaction has an XID and all running transactions began since the last
> > transaction that did bear an XID, RecentGlobalXmin == 
> > ReadNewTransactionId().
> > Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact 
> > <=
> > RecentGlobalXmin as we have today. ?This also cleanly removes the need for 
> > the
> > bit of code in _bt_getbuf() that decrements btpo.xact before sending it down
> > for ResolveRecoveryConflictWithSnapshot(). ?I suggested[2] that decrement on
> > an unprincipled basis; it was just masking the off-by-one of using "<=
> > RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().
> 
> Looks like the right fix. I'll apply this to 9.0/9.1/HEAD.

Thanks.  8.3 and 8.4 need it, too, albeit simplified due to some of the
affected code not existing yet.

> > This change makes empty B-tree pages wait through two generations of running
> > transactions before reuse, so some additional bloat will arise.
> 
> Could arise, in some circumstances. But that assumes VACUUMs are
> fairly frequent and that they would be delayed/rendered less effective
> by this, which I don't think will be the case.
> 
> I note that we don't take any account of the number of pages that may
> be reused when we VACUUM, so when HOT avoids a VACUUM we may
> accumulate pages for a longer period. Looks like there is more work to
> be done yet in cleaning indexes.

Agreed.  Taking one VACUUM visit to mark the page BTP_DELETED and another to
add it to the FSM is fairly pessimistic; perhaps we can do better.

nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DOMAINs and CASTs

2012-04-24 Thread Robert Haas
On Tue, Apr 24, 2012 at 3:00 AM, Jaime Casanova  wrote:
> are we going to put this warning in this release?

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dump EXTENSION sequences too

2012-04-24 Thread Robert Haas
On Thu, Apr 19, 2012 at 6:42 AM, Gianni Ciolli
 wrote:
> currently an EXTENSION can mark some of its tables as "configuration
> tables" using pg_catalog.pg_extension_config_dump(), so that pg_dump
> "does the right thing".
>
> I think it would be useful to mark sequences too, but unfortunately it
> is not possible; hence, each time a dump is reloaded, all the
> sequences in the extension are reset to 1, causing all the related
> problems.
>
> Moreover, the error message that we get if we try to mark a sequence
> does not mention the requirement that the relation is a table. The
> "OID %u does not refer to a table" error message seems to be wrongly
> restricted to the case when get_rel_name can't find a relation.
>
> Is there any objection to the above proposal? I did a little search of
> the archives, but I couldn't find any related discussions; I apologise
> if I missed something.

Seems like a good idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres install program files folder

2012-04-24 Thread Robert Haas
On Wed, Apr 18, 2012 at 12:00 PM, Flavio Vodzinski
 wrote:
> Hello,
> Windows environment, has a problem in installing postgres in the Program
> Files folder? I have this doubt because Windows works with locking system to
> this folder.

This question is off-topic for pgsql-hackers, which is for discussion
of PostgreSQL-related development.   Please try pgsql-general.

I'm not sure whether you are saying that you think there might be a
problem, or that you are actually having a problem.  The installers do
work in general; if you have a specific problem, please see
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Usage of planner_ctx

2012-04-24 Thread Tom Lane
Amit Kapila  writes:
> So what is the use of having  PlannerInfo->planner_ctx which only contains
> CurrentMemoryContext?

It might be clearer if you read up on the memory management in GEQO
planning mode.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timsort performance, quicksort (was: Re: Memory usage during sorting)

2012-04-24 Thread Robert Haas
On Wed, Apr 18, 2012 at 9:31 PM, Peter Geoghegan  wrote:
> Thoughts?

Interesting work.  I thought about trying to code up timsort at one
point, but I've been running short of round tuits.

I did some quick tests of quicksort using half a million random
strings.  On my MacBook Pro, if the strings are in random order,
quicksort takes about 12 seconds.  If they are presorted, it takes
about 800 ms.  If they are in sorted order with an empty string
appended onto the end, it takes about 25 seconds.

If I modify gen_qsort_tuple.pl to perform the "presorted" input check
only when n > 40, the for the
presorted-except-the-last-element-is-small test drops from 25 seconds
to about 21.5 seconds, without apparently harming either of the other
two cases.  If I remove the check altogether, the time further drops
to about 13.5 seconds, the time to sort completely-ordered data rises
to about 10-10.5 seconds, and the time to sort randomly ordered data
still doesn't seem to change much.

Based on that, I'm inclined to propose rejiggering things so that the
presorted-input check runs only at the top level, and not during any
recursive steps.  The theory is that it won't cost much because if the
data is randomly ordered we won't do many comparisons before falling
out, and that seems to be true.  But the only point of doing it in the
recursive steps is to win when the data is partially ordered, and we
actually seem to be losing big-time in that case, perhaps because when
the data is partially ordered, the presort check will frequently to
run through a significant part of the array - wasting cycles - but
fall out before it actually gets to the end.

Of course, even if we did that, it might not be as good as your
timsort numbers, but that doesn't mean we shouldn't do it...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New sync commit mode remote_write

2012-04-24 Thread Robert Haas
On Fri, Apr 20, 2012 at 3:58 PM, Fujii Masao  wrote:
> On Sat, Apr 21, 2012 at 12:20 AM, Simon Riggs  wrote:
>> On Thu, Apr 19, 2012 at 7:50 PM, Robert Haas  wrote:
>>> On 4/19/12, Jeff Janes  wrote:
 The work around would be for the master to refuse to automatically
 restart after a crash, insisting on a fail-over instead (or a manual
 forcing of recovery)?
>>>
>>> I suppose that would work, but I think Simon's idea is better: don't
>>> let the slave replay the WAL until either (a) it's promoted or (b) the
>>> master finishes the fsync.   That boils down to adding some more
>>> handshaking to the replication protocol, I think.
>>
>> It would be 8 bytes on every data message sent to the standby.
>
> There seems to be another problem to solve. In current design of streaming
> replication, we cannot send any WAL records before writing them locally.
> Which would mess up the mode which makes a transaction wait for remote
> write but not local one. We should change walsender so that it can send
> WAL records before they are written, e.g., send from wal_buffers?

In theory, writing WAL should be quick, since it's only going out to
the OS cache, and flushing it should be the slow part, since that
involves waiting for the actual disk write to complete.  Some
instrumentation I shoved in here reveals that there actually are some
cases where the write can take a long time, when Linux starts to get
worried about the amount of dirty data in cache and punishes anyone
who tries to write anything, but I'm not sure whether that's common
enough to warrant a change here.

One thing that does seem to be a problem is using WALWriteLock to
cover both the WAL write and the WAL flush.  Suppose that we're
writing WAL very quickly, so that wal_buffers fills up.  We can't
continue writing WAL until some of what's in the buffer has been
*written*, but the WAL writer process will grab WALWriteLock, write
*and flush* a chunk of WAL, and everybody who wants to insert WAL has
to wait for both the write and the flush.  It's probably possible to
do better, here.  Streaming directly from wal_buffers would allow sync
rep to dodge this problem altogether, but it's a general performance
problem as well so it would be nice to have a general solution that
would improve latency and throughput across the board, if such a
solution is possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New sync commit mode remote_write

2012-04-24 Thread Jeff Janes
On Thu, Apr 19, 2012 at 11:50 AM, Robert Haas  wrote:
> On 4/19/12, Jeff Janes  wrote:
>> The work around would be for the master to refuse to automatically
>> restart after a crash, insisting on a fail-over instead (or a manual
>> forcing of recovery)?
>
> I suppose that would work, but I think Simon's idea is better: don't
> let the slave replay the WAL until either (a) it's promoted or (b) the
> master finishes the fsync.   That boils down to adding some more
> handshaking to the replication protocol, I think.

Alternative c) is that the master automatically recovers from a crash,
but doesn't replay that particular wal record because it doesn't find
it on disk, so the slave has to be instructed to throw it away.  (Or
perhaps the slave could feed the wal back to the master, so the master
could replay it?)

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timsort performance, quicksort (was: Re: Memory usage during sorting)

2012-04-24 Thread Greg Stark
On Tue, Apr 24, 2012 at 4:17 PM, Robert Haas  wrote:
> Based on that, I'm inclined to propose rejiggering things so that the
> presorted-input check runs only at the top level, and not during any
> recursive steps.

Just a thought. What about running only every nth step. Maybe
something like every 4th step.

But actually I'm confused. This seems to be misguided to me. Quicksort
isn't stable so even if you have a partially sorted data set the
recursive partitions are going to be at best partially sorted after a
pivot. I haven't walked through it but suspect even your
all-but-one-sorted data set is not finding
the data sorted in either partition on the next iteration.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timsort performance, quicksort (was: Re: Memory usage during sorting)

2012-04-24 Thread Peter Geoghegan
On 24 April 2012 16:17, Robert Haas  wrote:
> If they are in sorted order with an empty string
> appended onto the end, it takes about 25 seconds.

That's exactly what I'd have expected, but was surprised to have not
found with my own test. Perhaps it was same kind of fluke (i.e. a
re-creatable one - I'm quite confident that my benchmark was not
methodologically flawed, at least in execution).

> Based on that, I'm inclined to propose rejiggering things so that the
> presorted-input check runs only at the top level, and not during any
> recursive steps.  The theory is that it won't cost much because if the
> data is randomly ordered we won't do many comparisons before falling
> out, and that seems to be true.  But the only point of doing it in the
> recursive steps is to win when the data is partially ordered, and we
> actually seem to be losing big-time in that case, perhaps because when
> the data is partially ordered, the presort check will frequently to
> run through a significant part of the array - wasting cycles - but
> fall out before it actually gets to the end.

That makes sense to me, but obviously more data is needed here.

> Of course, even if we did that, it might not be as good as your
> timsort numbers, but that doesn't mean we shouldn't do it...

The frustrating thing about my timsort numbers, as I mentioned in
reply to Dimitri (He modified the subject a bit, so that might appear
to be a different thread to you), is that they appear to be almost
consistently winning when you consider the number of comparisons, but
in fact lose when you measure the duration of execution or TPS or
whatever. I expected a certain amount of this, but not enough to
entirely derail the case for replacing quicksort with timsort when
sorting a single key of text, which is obviously the really compelling
case for optimisation here. This situation is only going to be made
"worse" by the work you've done on SortSupport for text, which,
incidentally, I agree is worthwhile.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Welcome 2012 GSOC students

2012-04-24 Thread Josh Berkus
Hackers,

We've chosen the 5 GSOC projects for this year:

* JDBC Foreign Data Wrapper, by Atri, mentored by Merlin Moncure
* Document Collection Foreign Data Wrapper, by Zheng Yang (a returning
student), mentored by Satoshi Nagayasu
* Implementing TABLESAMPLE, by Qi, mentored by Stephen Frost
* Better Indexing for Ranges, by Alexander (returning), mentored by
Heikki Linnakangas
* xReader Streaming xLog Reader, by Aakash, mentored by Kevin Grittner

Congratulations to the selected students, and expect to see them here on
-hackers working on their projects.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

2012-04-24 Thread Fujii Masao
On Tue, Apr 24, 2012 at 3:53 PM, Heikki Linnakangas
 wrote:
> On 23.04.2012 02:59, Fujii Masao wrote:
>>
>> On Thu, Apr 19, 2012 at 2:20 PM, Kyotaro HORIGUCHI
>>   wrote:
>>>
>>> Hello, this is new version of standby checkpoint_segments patch.
>>
>>
>> Thanks for the patch!
>
>
> This still makes catching up in standby mode slower, as you get many more
> restartpoints. The reason for ignoring checkpoint_segments during recovery
> was to avoid that. Maybe it's still better than what we have currently, I'm
> not sure, but at least it needs to be discussed.

I see your point. Agreed.

Another aspect of this problem is that if we ignore checkpoint_segments during
recovery, a restartpoint would take long time, and which prevents WAL files from
being removed from pg_xlog for a while. Which might cause the disk to fill up
with WAL files. This trouble is unlikely to happen in 9.1 or before because the
archived WAL files are always restored with a temporary name. OTOH, in 9.2,
cascading replication patch changed the recovery logic so that the archived
WAL files are restored with the correct name, so the number of WAL files in
pg_xlog keeps increasing until a restartpoint removes them. The disk is more
likely to fill up, in 9.2.

To alleviate the above problem, at least we might have to change the recovery
logic so that the archived WAL files are restored with a temporary name,
if cascading replication is not enabled (i.e., !standby_mode || !hot_standby ||
max_wal_senders <= 0). Or we might have to remove the restored WAL file
after replaying it and before opening the next one, without waiting for
a restartpoint to remove the restored WAL files. Thought?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Checkpointer on hot standby runs without looking checkpoint_segments

2012-04-24 Thread Fujii Masao
On Tue, Apr 24, 2012 at 3:47 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
>> >  - xlog.c: Make StandbyMode shared.
>> >
>> >  - checkpointer.c: Use IsStandbyMode() to check if postmaster is
>> >   under standby mode.
>>
>> IsStandbyMode() looks overkill to me. The standby mode flag is forcibly
>> turned off at the end of recovery, but its change doesn't need to be shared
>> to the checkpointer process, IOW, the shared flag doesn't need to change
>> since startup like XLogCtl->archiveCleanupCommand, I think. So we can
>> simplify the code to share the flag to the checkpointer. See the attached
>> patch (though not tested yet).
>
> Hmm. I understood that the aim of the spinlock and volatil'ize of
> the pointer in reading shared memory is to secure the memory
> consistency on SMPs with weak memory consistency and to make
> compiler help from over-optimization for non-volatile pointer
> respectively.
>
> You removed both of them in the patch.
>
> If we are allowed to be tolerant of the temporary lack of
> coherence in shared memory there, the spinlock could be removed.
> But the possibility to read garbage by using XLogCtl itself to
> access standbyMode does not seem to be tolerable. What do you
> think about that?

I'm not sure if we really need to worry about that for such shared variable
that doesn't change since it's been initialized at the start of recovery.
Anyway, if we really need to worry about that, we need to protect the
shared variable RecoveryTargetTLI and archiveCleanupCommand with
the spinlock because they are in the same situation as standbyMode.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Welcome 2012 GSOC students

2012-04-24 Thread Atri Sharma
On Tue, Apr 24, 2012 at 10:36 PM, Josh Berkus  wrote:
> Hackers,
>
> We've chosen the 5 GSOC projects for this year:
>
> * JDBC Foreign Data Wrapper, by Atri, mentored by Merlin Moncure
> * Document Collection Foreign Data Wrapper, by Zheng Yang (a returning
> student), mentored by Satoshi Nagayasu
> * Implementing TABLESAMPLE, by Qi, mentored by Stephen Frost
> * Better Indexing for Ranges, by Alexander (returning), mentored by
> Heikki Linnakangas
> * xReader Streaming xLog Reader, by Aakash, mentored by Kevin Grittner
>
> Congratulations to the selected students, and expect to see them here on
> -hackers working on their projects.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Hi Josh,

Thanks a lot for welcoming me.

I would also like to thank the entire community and yourself for
giving me this wonderful opportunity.I shall definitely try my hardest
for achieving the goal and meeting your expectations.

I shall be regularly posting updates on the -hackers list on the
progress of my project.

I would be requiring the support from the entire community.

Thanks once again,

Atri


-- 
Regards,

Atri
l'apprenant

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] remove dead ports?

2012-04-24 Thread Peter Eisentraut
I propose that we remove support for the following OS ports from our
source tree.  They are totally dead, definitely don't work, and/or
probably no one remembers what they even were.  The code just bit rots
and is in the way of future improvements.

* Dead/remove:

dgux
nextstep
sunos4
svr4
ultrix4
univel

* Dubious, but keep for now:

bsdi
irix
osf
sco

* No concern:

aix
cygwin
darwin
freebsd
hpux
linux
netbsd
openbsd
solaris
unixware
win32


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove dead ports?

2012-04-24 Thread Stefan Kaltenbrunner
On 04/24/2012 08:29 PM, Peter Eisentraut wrote:
> I propose that we remove support for the following OS ports from our
> source tree.  They are totally dead, definitely don't work, and/or
> probably no one remembers what they even were.  The code just bit rots
> and is in the way of future improvements.
> 
> * Dead/remove:
> 
> dgux
> nextstep
> sunos4
> svr4
> ultrix4
> univel

+1

> 
> * Dubious, but keep for now:
> 
> bsdi
> irix
> osf
> sco

I'm pretty sure I have seen at least semi recent reports of users on at
least irix and sco, for the rest of them I think bruce recently stopped
using bsdi so the only ever known user of that platform is gone...


Stefan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: summary of corruption detection / checksums / CRCs discussion

2012-04-24 Thread Robert Haas
On Sat, Apr 21, 2012 at 5:40 PM, Jeff Davis  wrote:
> * In addition to detecting random garbage, we also need to be able to
> detect zeroing of pages. Right now, a zero page is not considered
> corrupt, so that's a problem. We'll need to WAL table extension
> operations, and we'll need to mitigate the performance impact of doing
> so. I think we can do that by extending larger tables by many pages
> (say, 16 at a time) so we can amortize the cost of WAL and avoid
> contention.

I think that extending tables in larger chunks is probably a very good
idea for performance reasons completely apart from checksums.
However, I don't think that WAL-logging relation extension will allow
you to treat a zero page as an error condition unless you not only
WAL-log the operation but also *flush WAL* before performing the
actual table-extension operation.  Otherwise, we might crash after the
actual extension and before the WAL record hits the disk, and now
we're back to having a zero'd page in the file.  And the impact of
writing *and flushing* WAL for every extension seems likely to be more
than we're willing to pay.  If we extended 16 pages at a time, that
means waiting for 8 WAL fsyncs per MB of relation extension.  On my
MacBook Pro, which is admittedly a pessimal case for fsyncs, that
would work out to an extra half second of elapsed time per MB written,
so pgbench -i -s 100 would probably take about an extra 640 seconds.
If that's a correct analysis, that sounds pretty bad, because right
now it's taking 143 seconds.

> * Should we try to use existing space in header? It's very appealing to
> be able to avoid the upgrade work by using existing space in the header.
> There was a surprising amount of discussion about which space to use:
> pd_pagesize_version or pd_tli. There was also some discussion of using a
> few bits from various places.

It still seems to me that pd_tli is the obvious choice, since there is
no code anywhere in the system that relies on that field having any
particular value, so we can pretty much whack it around at will
without breaking anything.  AFAICS, the only reason to bump the page
format is if we want a longer checksum - 32 bits, say, instead of 16.
But I am inclined to think that 16 ought to be enough to detect the
vast majority of cases of corruption.

> * Table-level, or system level? Table-level would be appealing if there
> turns out to be a significant performance impact. But there are
> challenges during recovery, because no relcache is available. It seems
> like a relatively minor problem, because pages would indicate whether
> they have a checksum or not, but there are some details to be worked
> out.

I think the root of the issue here is that it's hard to turn checksums
on and off *online*.  If checksums were an initdb-time option, the
design here would be pretty simple: store the flag in the control
file.  And it wouldn't even be hard to allow the flag to be flipped
while the cluster is shut down: just write a utility to checksum and
rewrite all the blocks, fsync everything, and then flip the flag in
the control file and fsync that; also, force the user to rebuild all
their standbys.  This might not be very convenient, but it would be
comparatively simple to implement.

However, I've been assuming that we do want to handle turning
checksums on and off without shutting down the cluster, and that
definitely makes things more complicated.  In theory it could work
more or less the same way as in the off-line case: you launch some
command or function that visits every data block in the cluster and
rewrites them all, xlogging as it goes.  However, I'm a little
concerned that it won't be very usable in that form for some of the
same reasons that the off-line solution might not be very usable: you
essentially force the user to do a very large data-rewriting operation
all at once.  We already know that having a kicking off a big vacuum
in the background can impact the performance of the whole cluster, and
that's just one table; this would be the whole cluster all at once.

Moreover, if it's based on the existing vacuum code, which seems
generally desirable, it seems like it's going to have to be something
you run in one database at a time until you've hit them all.  In that
case, you're going to need some way to indicate which tables or
databases have been processed already and which haven't yet ... and if
you have that, then it's not a very big step to think that maybe we
ought to just control it on a per-table or per-database level to begin
with, since that would also have some of the collateral benefits you
mention.  The alternative is to have a new command, or a new form of
vacuum, that processes every table in every cluster regardless of
which DB you're connected to at the moment.  That's a lot of new,
fairly special-purpose code and it doesn't seem very convenient from a
management perspective, but it is probably simpler otherwise.

I'm not sure what the right decision is here.  I 

Re: [HACKERS] 9.3: summary of corruption detection / checksums / CRCs discussion

2012-04-24 Thread Robert Haas
On Sat, Apr 21, 2012 at 7:08 PM, Greg Stark  wrote:
> The earlier consensus was to move all the hint bits to a dedicated
> area and exclude them from the checksum. I think double-write buffers
> seem to have become more fashionable but a summary that doesn't
> describe the former is definitely incomplete.

I don't think we ever had any consensus that moving the hint bits
around was a good idea.  For one thing, they are only hints in one
direction.  It's OK to clear them by accident, but it's not OK to set
them by accident.  For two things, it's not exactly clear how we'd
rearrange the page to make this work at all: where are those hint bits
gonna go, if not in the tuple headers?  For three things, index pages
have hint-type changes that are not single-bit changes.

> That link points to the MVCC-safe truncate patch. I don't follow how
> optimizations in bulk loads are relevant to wal logging hint bit
> updates.

That patch actually has more than one optimization in it, I think, but
the basic idea is that if we could figure out a way to set
HEAP_XMIN_COMMITTED when loading data into a table created or
truncated within the same transaction, the need to set hint bits on
first scan of the table would be eliminated.  Writing the xmin as
FrozenTransactionId would save even more, though it introduces some
additional complexity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove dead ports?

2012-04-24 Thread Robert Haas
On Tue, Apr 24, 2012 at 2:29 PM, Peter Eisentraut  wrote:
> I propose that we remove support for the following OS ports from our
> source tree.  They are totally dead, definitely don't work, and/or
> probably no one remembers what they even were.  The code just bit rots
> and is in the way of future improvements.

I have no position on whether those operating systems are dead enough
to warrant removing support, but on a related point, I would like it
if we could get rid of as many spinlock implementations as are
applicable only to platforms that are effectively defunct.  I'm
suspicious of s_lock.h's support for National Semiconductor 32K,
Renesas' M32R, Renesas' SuperH, UNIVEL, SINIX / Reliant UNIX,
Nextstep, and Sun3, all of which are either on your list above, or
stuff I've never heard of.  I have no problem keeping whatever people
are still using, but it would be nice to eliminate anything that's
actually dead for the reasons you state.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New sync commit mode remote_write

2012-04-24 Thread Robert Haas
On Tue, Apr 24, 2012 at 12:21 PM, Jeff Janes  wrote:
> On Thu, Apr 19, 2012 at 11:50 AM, Robert Haas  wrote:
>> On 4/19/12, Jeff Janes  wrote:
>>> The work around would be for the master to refuse to automatically
>>> restart after a crash, insisting on a fail-over instead (or a manual
>>> forcing of recovery)?
>>
>> I suppose that would work, but I think Simon's idea is better: don't
>> let the slave replay the WAL until either (a) it's promoted or (b) the
>> master finishes the fsync.   That boils down to adding some more
>> handshaking to the replication protocol, I think.
>
> Alternative c) is that the master automatically recovers from a crash,
> but doesn't replay that particular wal record because it doesn't find
> it on disk, so the slave has to be instructed to throw it away.

Right.  Which kind of stinks.

> (Or
> perhaps the slave could feed the wal back to the master, so the master
> could replay it?)

Yes, that would be a very nice enhancement, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timsort performance, quicksort (was: Re: Memory usage during sorting)

2012-04-24 Thread Robert Haas
On Tue, Apr 24, 2012 at 12:30 PM, Greg Stark  wrote:
> On Tue, Apr 24, 2012 at 4:17 PM, Robert Haas  wrote:
>> Based on that, I'm inclined to propose rejiggering things so that the
>> presorted-input check runs only at the top level, and not during any
>> recursive steps.
>
> Just a thought. What about running only every nth step. Maybe
> something like every 4th step.

If there's actually some advantage to that, sure.  But so far, it
looks like less is more.

> But actually I'm confused. This seems to be misguided to me. Quicksort
> isn't stable so even if you have a partially sorted data set the
> recursive partitions are going to be at best partially sorted after a
> pivot.

Exactly.  That's why I think we should do it only at the topmost
level, before we've done any pivoting.  Doing it at any lower level is
apparently a waste of energy and counterproductive.

> I haven't walked through it but suspect even your
> all-but-one-sorted data set is not finding
> the data sorted in either partition on the next iteration.

I suspect that, too.  Actually, I'm a bit confused about why it's
picking such terrible pivots.  Our median-of-medians optimization
should be doing better than this, I would think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timsort performance, quicksort (was: Re: Memory usage during sorting)

2012-04-24 Thread Robert Haas
On Tue, Apr 24, 2012 at 12:51 PM, Peter Geoghegan  wrote:
> On 24 April 2012 16:17, Robert Haas  wrote:
>> If they are in sorted order with an empty string
>> appended onto the end, it takes about 25 seconds.
>
> That's exactly what I'd have expected, but was surprised to have not
> found with my own test. Perhaps it was same kind of fluke (i.e. a
> re-creatable one - I'm quite confident that my benchmark was not
> methodologically flawed, at least in execution).

Oh, I read your results as showing something quite similar.

>> Based on that, I'm inclined to propose rejiggering things so that the
>> presorted-input check runs only at the top level, and not during any
>> recursive steps.  The theory is that it won't cost much because if the
>> data is randomly ordered we won't do many comparisons before falling
>> out, and that seems to be true.  But the only point of doing it in the
>> recursive steps is to win when the data is partially ordered, and we
>> actually seem to be losing big-time in that case, perhaps because when
>> the data is partially ordered, the presort check will frequently to
>> run through a significant part of the array - wasting cycles - but
>> fall out before it actually gets to the end.
>
> That makes sense to me, but obviously more data is needed here.

What more data do you think is needed?  I've been suspicious of that
code since the first time I looked at it, and I'm now fairly well
convinced that it's full of suckitude.  Honestly, I'm not sure I could
manage to contrive a case where that code wins if I set out to do so.

>> Of course, even if we did that, it might not be as good as your
>> timsort numbers, but that doesn't mean we shouldn't do it...
>
> The frustrating thing about my timsort numbers, as I mentioned in
> reply to Dimitri (He modified the subject a bit, so that might appear
> to be a different thread to you), is that they appear to be almost
> consistently winning when you consider the number of comparisons, but
> in fact lose when you measure the duration of execution or TPS or
> whatever. I expected a certain amount of this, but not enough to
> entirely derail the case for replacing quicksort with timsort when
> sorting a single key of text, which is obviously the really compelling
> case for optimisation here. This situation is only going to be made
> "worse" by the work you've done on SortSupport for text, ...

That is quite baffling.  Have you profiled it at all?

> ...which,
> incidentally, I agree is worthwhile.

Cool, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timsort performance, quicksort (was: Re: Memory usage during sorting)

2012-04-24 Thread Robert Haas
On Tue, Apr 24, 2012 at 7:16 PM, Peter Geoghegan  wrote:
>>> That makes sense to me, but obviously more data is needed here.
>>
>> What more data do you think is needed?  I've been suspicious of that
>> code since the first time I looked at it, and I'm now fairly well
>> convinced that it's full of suckitude.  Honestly, I'm not sure I could
>> manage to contrive a case where that code wins if I set out to do so.
>
> Yeah, I thought that the rationale for introducing the pre-sorted
> check, as it appeared in a commit message, was a little weak. I don't
> know that I'd go as far as to say that it was full of suckitude. The
> worst thing about that code to my mind is that despite being highly
> performance critical, it has exactly no comments beyond a brief
> description, and the names of variables are arcanely curt.

Well, what I don't like about it is that it doesn't work.  Having a
special case for pre-sorted input makes sense to me, but doing that
check recursively at every level is pointless unless it helps with
almost-sorted input, and doubling the runtime doesn't meet my
definition of helping.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: summary of corruption detection / checksums / CRCs discussion

2012-04-24 Thread Greg Stark
On Tue, Apr 24, 2012 at 9:40 PM, Robert Haas  wrote:
>  For three things, index pages
> have hint-type changes that are not single-bit changes.

? Just how big are these? Part of the reason hint bit updates are safe
is because one bit definitely absolutely has to be entirely in one
page. You can't tear a page in the middle of a bit. In reality the
size is much larger, probably 4k and almost certainly at least 512
bytes. But the postgres block layout doesn't really offer much
guarantees about the location of anything relative those 512 byte
blocks so probably anything larger than a word is unsafe to update.

The main problem with the approach was that we kept finding more hint
bits we had forgotten about. Once the coding idiom was established it
seems it was a handy hammer for a lot of problems.
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove dead ports?

2012-04-24 Thread Greg Stark
On Tue, Apr 24, 2012 at 9:49 PM, Robert Haas  wrote:
>  I'm
> suspicious of s_lock.h's support for National Semiconductor 32K,
> Renesas' M32R, Renesas' SuperH, UNIVEL, SINIX / Reliant UNIX,
> Nextstep, and Sun3

Were there ever multiprocessor Nextstep or Sun3 machines anyways?
Wouldn't someone on these OSes want spinlocks to immediately sleep
anyways?

I did experiment a while back with getting a vax emulator going to
build postgres on it but even then I was running NetBSD.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: summary of corruption detection / checksums / CRCs discussion

2012-04-24 Thread Josh Berkus
On 4/21/12 2:40 PM, Jeff Davis wrote:
> If we do use WAL for hint bit updates, that has an impact on Hot
> Standby, because HS can't write WAL. So, it would seem that HS could not
> set hint bits.

If we're WAL-logging hint bits, then the standby would be receiving
them, so it doesn't *need* to write them.

However, I suspect that WAL-logging hint bits would be prohibitively
expensive.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Temporary tables under hot standby

2012-04-24 Thread Noah Misch
A key barrier to migrations from trigger-based replication to WAL-based
replication is the lack of temporary tables under hot standby.  I'd like to
close that gap; the changes needed will also reduce the master-side cost of
temporary table usage.  Here is a high-level design for your advice and
comments.  Much of this builds on ideas from past postings noted below.

Our temporary tables are cataloged and filled like permanent tables.  This has
the major advantage of making most code operate on tables with minimal regard
for their relpersistence.  It also brings disadvantages:

1. High catalog turnover in rapid create/drop workloads.  Heavy temporary
   table users often need to REINDEX relation-oriented catalogs.  Hot standby
   cannot assign OIDs or modify system catalogs at all.
2. Consumption of permanent XIDs for DML on the table.  This increases COMMIT
   cost on the master and is a non-starter under hot standby.
3. autovacuum cannot VACUUM temporary tables, but their relfrozenxid values
   delay pg_clog truncation and can trigger a wraparound-prevention shutdown.
4. sinval traffic from every CREATE TEMP TABLE et al.
5. CREATE TEMP TABLE records AccessExclusiveLock in the global lock table.
6. We don't automatically drop temporary tables that existed at the point of a
   crash, because they look much like permanent tables.

To resolve points 2 and 3, let's change the XID values stored in temporary
tables from regular TransactionId to LocalTransactionId.  This entails an lxid
counterpart for clog.c.  Functions like GetCurrentTransactionId() and
HeapTupleSatisfiesVisibility() will take a Relation argument to identify the
XID type.  One open question is whether to add conditional logic to functions
like HeapTupleSatisfiesMVCC() or to have parallel implementations like
HeapTupleSatisfiesMVCCLocal().  I lean toward the latter, perhaps with the
help of some code generation.  I don't think a counterpart for pg_subtrans
will be necessary; the backend knows its own XID tree, and the
TRANSACTION_STATUS_SUB_COMMITTED interlock is superfluous with only one
backend as reader and writer.  I'm also thinking the local clog can live
strictly in memory; a session that retains a temporary table across 2B local
transactions can afford 512 MiB of RAM.  With this change, VACUUM can ignore
relfrozenxid of temporary tables when calculating a new datfrozenxid.  This
change can form an independent patch.

I do not see a clean behind-the-scenes fix for points 1, 4 and 5.  We can
resolve those by adding a new variety of temporary table, one coincidentally
matching the SQL standard's notion of a temporary table.  The developer will
declare it once, after which all sessions observe it as an initially-empty
table whose contents remain local to the session.  Most relation catalog
entries, including all OIDs, are readily sharable among sessions.  The
exceptions are relpages, reltuples, relallvisible, relfrozenxid, and
pg_statistic rows.  I will handle the pg_class columns by introducing new
backend APIs abstracting them.  Those APIs will consult the relcache for
permanent tables and a local-memory hash for temporary tables.  For
statistics, add a new catalog pg_temp_statistic, an inheritance child of
pg_statistic and itself one of these new-variety temporary tables.

Past discussions have raised the issue of interaction between commands like
ALTER TABLE and sessions using the new-variety temporary table.  As a first
cut, let's keep this simple and have ongoing use of the table block operations
requiring AccessExclusiveLock.  Note that you can always just make a new
temporary table with a different name to deploy a change quickly.  Implement
this with a heavyweight lock having a novel life cycle.  When a session first
takes an ordinary relation lock on the table, acquire the longer-term lock and
schedule it for release on transaction abort.  On TRUNCATE, schedule a release
on transaction commit.  Of course, also release the lock at session end.

For new-variety temporary tables, change file names from "relfilenode[_fork]"
to "refilenode[_fork].pid.localnode".  During crash recovery, delete all files
conforming to that pattern for refilenodes of known temporary tables.  This
also lets backends share pg_class.relfilenode.  The "localnode" field would
distinguish multiple generations of a table across VACUUM FULL, CLUSTER, and
TRUNCATE.  We could build on this strategy to safely resolve point 6 for the
existing kind of temporary table, but that entails enough other details to
probably not mix it into the same patch.

A third patch will permit the following commands in read-only transactions,
where they will throw an error if the subject is not a temporary table:

INSERT
UPDATE
DELETE
COPY ... FROM
TRUNCATE
ANALYZE
VACUUM (including VACUUM FULL)
CLUSTER (without USING clause)
REINDEX

I considered whether to instead separate the set of commands allowed in a
read-only transaction from the set allowed under hot standby.  This proposal
is cl

Re: [HACKERS] remove dead ports?

2012-04-24 Thread Tom Lane
Robert Haas  writes:
> I have no position on whether those operating systems are dead enough
> to warrant removing support, but on a related point, I would like it
> if we could get rid of as many spinlock implementations as are
> applicable only to platforms that are effectively defunct.  I'm
> suspicious of s_lock.h's support for National Semiconductor 32K,
> Renesas' M32R, Renesas' SuperH, UNIVEL, SINIX / Reliant UNIX,
> Nextstep, and Sun3, all of which are either on your list above, or
> stuff I've never heard of.  I have no problem keeping whatever people
> are still using, but it would be nice to eliminate anything that's
> actually dead for the reasons you state.

The Renesas implementations were added pretty darn recently, so I think
there are users for those.  The others you mention seem dead to me.
On the other hand, exactly how much is it costing us to leave those
sections of s_lock.h in there?  It's not like we have any plans to
redefine the spinlock interfaces.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Welcome 2012 GSOC students

2012-04-24 Thread Qi Huang
Thanks for the hackers' support. The discussion on the mailing is quite 
encouraging. Expecting to enjoy the 3 months' fun with Postgres. I'm still 
under the final period of my university, will participate more after the exams 
finish.
Thanks!

Sent from my Windows Phone

From: Josh Berkus
Sent: 25/4/2012 1:06 AM
To: PostgreSQL-development
Subject: [HACKERS] Welcome 2012 GSOC students

Hackers,

We've chosen the 5 GSOC projects for this year:

* JDBC Foreign Data Wrapper, by Atri, mentored by Merlin Moncure
* Document Collection Foreign Data Wrapper, by Zheng Yang (a returning
student), mentored by Satoshi Nagayasu
* Implementing TABLESAMPLE, by Qi, mentored by Stephen Frost
* Better Indexing for Ranges, by Alexander (returning), mentored by
Heikki Linnakangas
* xReader Streaming xLog Reader, by Aakash, mentored by Kevin Grittner

Congratulations to the selected students, and expect to see them here on
-hackers working on their projects.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers