from:"Florian G. Pflug"

[HACKERS] Warm-Standby using WAL archiving / Seperate pg_restorelog application

2006-07-10 Thread Florian G. Pflug


Hi

I've now setup a warm-standby machine by using wal archiving. The 
restore_command on the
warm-standby machine loops until the wal requested by postgres appears, instead 
of
returning 1. Additionally, restore_command check for two special flag-files 
abort
and take_online. If take_online exists, then it exists with code 1 in case 
of a
non-existant wal - this allows me to take the slave online if the master fails.

This methods seems to work, but it is neither particularly fool-proof nor
administrator friendly. It's not possible e.g. to reboot the slave without 
postgres
abortint the recovery, and therefor processing all wals generated since the last
backup all over again.

Monitoring this system is hard too, since there is no easy way to detect errors
while restoring a particular wal.

I think that all those problems could be solved if postgres provided a 
standalone application
that could restore one wal into a specified data-dir. It should be possible to 
call this
application repeatedly to restore wals as they are received from the master. Since 
pg_restorelog
would be call seperately for every wal, I'd be easy to detect errors recovering 
a specific wal.

Do you think this idea is feaseable? How hard would it be to turn the current 
archived-wal-recovery-code
into a standalone executable (That of course needs to be called when postgres 
is _not_ running.)

greetings, Florian Pflug



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Warm-Standby using WAL archiving / Seperate pg_restorelog

2006-07-10 Thread Florian G. Pflug


Merlin Moncure wrote:

On 7/10/06, Florian G. Pflug [EMAIL PROTECTED] wrote:

This methods seems to work, but it is neither particularly fool-proof nor
administrator friendly. It's not possible e.g. to reboot the slave 
without postgres
abortint the recovery, and therefor processing all wals generated 
since the last

backup all over again.

Monitoring this system is hard too, since there is no easy way to 
detect errors

while restoring a particular wal.


what I would really like to see is to have the postmaster start up in
a special read only mode where it could auto-restore wal files placed
there by an external process but not generate any of its own.  This
would be a step towards a pitr based simple replication method.


I didn't dare to ask for being able to actually _access_ a wal-shipping
based slaved (in read only mode) - from how I interpret the code, it's
a _long_ way to get that working. So I figured a stand-alone executable
that just recovers _one_ archived wal would at least remove that administrative
burden that my current solution brings. And it would be easy to monitor
the slave - much easier than with any automatic pickup of wals.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

[HACKERS] Forcing wal rotation

2006-07-14 Thread Florian G. Pflug


Hi

For my warm-standby-cluster I'm now saving the currently used wal using rsync,
to avoid loosing data from a few hours (or days) ago, when there is little 
traffic,
and thus the wal isn't rotated. For online backups, the problem is even worse, 
because
a backup might me unuseable even hours after I called pg_stop_backup(), because 
the
wal segment needed to bring the backup to a consistent state might not have 
been archived
at that time.

I've now thought about how to fix that without doing that rather crude 
rsync-pg_xlog-hack.
I've read through the code, and learned that wal-segments are expected to have 
a specific size -
thus rotating them early is not that easy. But now I figured that another 
short-term solution
could probably be implemented easily.

I'd like to write a function that just fills the wal with nop records, until 
the current used
wal is full. Since I assume that there are already different kind of wal 
records, adding a
NOP-record that just takes up space, and does nothing else, should be quite 
easy. And even
if wals contain only one kind of record (Write this page to that datafile), I 
could just
repeat the last records over and over again, until the wal is filled, couldn't 
I?

Do you think that this is feasable? If so, I'd like to try it.

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Forcing wal rotation

2006-07-14 Thread Florian G. Pflug


A.M. wrote:

On Fri, July 14, 2006 11:20 am, Florian G. Pflug wrote:

Hi


For my warm-standby-cluster I'm now saving the currently used wal using
rsync, to avoid loosing data from a few hours (or days) ago, when there is
little traffic, and thus the wal isn't rotated. For online backups, the
problem is even worse, because a backup might me unuseable even hours
after I called pg_stop_backup(), because the wal segment needed to bring
the backup to a consistent state might not have been archived at that
time.




How about an SQL-level function that calls the wal scripts? This would
also allow important transactions to push data to the standy server
regardless of the wal size.

That was the idea - providing pg_rotate_wal(), which would guarantee that
the wal is rotatted at least once if called. Thinking further about this,
for a first prove of concept, I'd be enough to write a C function
pg_current_walsegment(). pg_rotate_wal() could then be a plpgsql function,
that e.g. creates a temporary table, and fills it with data, until the
return value of pg_current_walsegment() changes.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Forcing wal rotation

2006-07-14 Thread Florian G. Pflug


Martijn van Oosterhout wrote:

On Fri, Jul 14, 2006 at 05:36:58PM +0200, Florian G. Pflug wrote:

That was the idea - providing pg_rotate_wal(), which would guarantee that
the wal is rotatted at least once if called. Thinking further about this,
for a first prove of concept, I'd be enough to write a C function
pg_current_walsegment(). pg_rotate_wal() could then be a plpgsql function,
that e.g. creates a temporary table, and fills it with data, until the
return value of pg_current_walsegment() changes.


Temporary tables don't get xlogged. It would probably be easier to hook
into the xlog machinery and create NOP records, like you originally
suggested.


From further sourcecode reading, I got the following implementation plan
.) Add new entry to RmgrTable (Should I add it at the end, or use one of the 
reserved entries?)
.) Create nop_redo and nop_desc - for nop_redo an empty function should be 
sufficient
.) Create pg_rotate_wal, which calls
   XLogInsert(RM_NOP_ID, XLOG_NO_TRAN, rdata)
   with
 rdata.data = pointer to WAL_SEGMENT_SIZE zero bytes
 rdata.len = WAL_SEGMENT_SIZE
 rdata.buffer = InvalidBuffer
 rdata.next = NULL

Since I insert WAL_SIZE bytes, I shouldn't even have to loop, because that 
records
has no chance to fit into the current wal segment, right?

Am I overlooking something?

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Forcing wal rotation

2006-07-14 Thread Florian G. Pflug


Simon Riggs wrote:

On Fri, 2006-07-14 at 12:09 -0400, Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

I've now thought about how to fix that without doing that rather crude 
rsync-pg_xlog-hack.
I've read through the code, and learned that wal-segments are expected to have 
a specific size -
thus rotating them early is not that easy.

Simon was working on a patch for this at the code sprint; I think it's
submitted to -patches already.  


Slightly different patch. I'm working on this one still.

Cool - what are the chances of this being included in 8.2?


Explicitly filling the segment as you
propose would be really bad for performance.


Yes, current approach I am taking is better than that.

Well, my proposal wasn't really a long-term solution - I was thinking about I 
quick fix
that I could implement for 8.1, basically to let my warm-standby-setup feel 
less like
as house of cards as someone put it ;-)
I didn't care too much about the performance hit - I don't expect the database 
I indent
to use it for to have much load, otherwise the wal segments are rotated quite 
often anyway.
But I agree that for a general solution, my approach is not really ideal ;-)

Since we just ported the application in question to 8.1, I'm not sure that we 
will switch
to 8.2 when it is released - so I'm still interested in finding a solution for 
8.1

Do you think I could backport your patch to 8.1 - or does it depend on some 
other new features
of 8.2?

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] UPDATE/DELETE XXX WHERE CURRENT OF cursor_name

2006-07-24 Thread Florian G. Pflug


Gavin Sherry wrote:

On Mon, 24 Jul 2006, Golden Liu wrote:


begin;
declare foo cursor for select * from bar for update;
fetch foo;
update bar set abc='def' where current of foo;
fetch foo;
delete from bar where current of foo;
commit;



No one has stepped up to do this for 8.2 so unfortunately you will most
likely not see this within the next year or so :-(.


Couldn't this be emulated by doing
begin;
declare foo cursor for select * from bar for update;
fetch foo into v_foo ;
update bar set abc='def' where ctid = v_foo.ctid;
fetch foo into v_foo ;
delete from bar where ctid = v_foo.ctid;
commit;

Or could a concurrent vacuum run lead to the wrong
rows being updated/deleted?

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] UPDATE/DELETE XXX WHERE CURRENT OF cursor_name

2006-07-24 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Couldn't this be emulated by doing
begin;
declare foo cursor for select * from bar for update;
fetch foo into v_foo ;
update bar set abc='def' where ctid = v_foo.ctid;


That wouldn't follow the expected semantics if there's a concurrent
update, because the updated row would always fail the WHERE clause,
and thus the update would just silently not happen.  (I'm thinking
about READ COMMITTED mode of course --- in SERIALIZABLE you'd just get
the expected error.)  You'd have to find some way to pump the row's most
up-to-date version through the cursor's query plan, a la EvalPlanQual,
to see if it still met the cursor's WHERE condition.


How could there be a concurrent update of the _same_ row, when
I do select * from bar *for update*. Or are you talking about
concurrent updates to the same page that could somehow alter
the ctid of _another_ tuple?

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] UPDATE/DELETE XXX WHERE CURRENT OF cursor_name

2006-07-24 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

How could there be a concurrent update of the _same_ row, when
I do select * from bar *for update*.


AFAICT the spec doesn't require one to have written FOR UPDATE
in order to use WHERE CURRENT OF.  (In effect, they expect FOR UPDATE
to be the default, which is certainly not a change we're going to
want to make to DECLARE CURSOR.)  If we did make that restriction
then we could probably skip the EvalPlanQual mess.


But if the expect for update to be default, then essentially they
do require that one to use a cursor with for update semantics when
using where current of - or do they allow where current of even
for not for update cursors?

If one would restrict in implementation of where current of to
for update, without hold cursors, the only non-trivial problem that 
I can see is how to support more than one update of the same row.


Because as far as I can see, if you'd do
begin;
declare foo cursor select * from bar for update;
fetch foo into v_foo ;
update bar set ... where ctid = v_foo.ctid ;
update bar set ... where ctid = v_foo.ctid ;
commit;

the second update would silently be ignored. But since only
updates happing in the same transaction would somehow need to be
tracked, this should be much easier to do than supporting
the non-for-update case.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] [PATCHES] Patch for VS.Net 2005's strxfrm() bug

2006-07-26 Thread Florian G. Pflug


Bruce Momjian wrote:

Why is this better than:

  #if _MSC_VER == 1400


Surely this will not be true if _MSC_VER is undefined?

I experienced injustice and the reason of in OSX for it.


What was the problem with OSX?  Did it throw a warning of you did an
equality test on an undefined symbol?


The following if evaluated to true on osx, although I'm pretty sure that
_MSC_VER isn't defined on osx ;-)
#if (_MSC_VER  1300)
...
#endif

replacing it with
#ifdef WIN32
#if (_MSC_VER  1300)
...
#endif
#endif

fixed the problem.

greetings, Florian Pflug



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES] Patch for VS.Net 2005's strxfrm() bug

2006-07-26 Thread Florian G. Pflug


[EMAIL PROTECTED] wrote:

Bruce Momjian wrote:

Why is this better than:

  #if _MSC_VER == 1400

Surely this will not be true if _MSC_VER is undefined?

I experienced injustice and the reason of in OSX for it.

What was the problem with OSX?  Did it throw a warning of you did an
equality test on an undefined symbol?

The following if evaluated to true on osx, although I'm pretty sure that
_MSC_VER isn't defined on osx ;-)
#if (_MSC_VER  1300)
...
#endif

replacing it with
#ifdef WIN32
#if (_MSC_VER  1300)
...
#endif
#endif

fixed the problem.


No doubt, but that's quite a different test.

I mainly posted this to show what the offending ifdef in pgadmin3 looked like,
since someone referenced it, not as an argument against #if _MSC_VER = 1400.

I guess _MSC_VER  1300 gets interpreted as 0  1300 if _MSC_VER is 
undefined,
so _MSC_VER = 1400 would actually work.

But it still suprised me a lot that _MSC_VER  1300 evaluated to true if 
_MSC_VER
is undefined - maybe thats the _real_ reason why some people don't like the 
tri-state
logic in sql - it's because they get confused when trying to use the c 
preprocessor ;-)

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] GUC with units, details

2006-07-27 Thread Florian G. Pflug


Tom Lane wrote:

Peter Eisentraut [EMAIL PROTECTED] writes:

Tom Lane wrote:

Peter's not said exactly how he plans to deal with
this, but I suppose it'll round off one way or the other ...


It'll get truncated by integer division.  I wouldn't mind if someone 
proposed a patch to create a warning or error in this case, but I 
wanted to keep the initial version simple.


I'd recommend against that.  Apple recently changed OS X so that
it rejects SHMMAX settings that aren't an exact multiple of
something-or-other, and I've found that to be a *serious* PITA.
Of course part of the problem is that there's no helpful message,
but it's still a big loss from a usability standpoint, and quite
unnecessary (every other Unix seems willing to round off...)

One thought is that maybe we should round up not down?  I'm having
a hard time making a specific case either way, though.


Rounding up would have the advantage that you could just specify 0
in the config file, and have postgres use the smallest value possible.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] GUC with units, details

2006-07-27 Thread Florian G. Pflug


Peter Eisentraut wrote:

Florian G. Pflug wrote:

Rounding up would have the advantage that you could just specify 0
in the config file, and have postgres use the smallest value
possible.


In most algebras, dividing zero by something is still zero, so there'd 
be no need to round anything.


I guess a clicked the send button a little too fast. You're right, of course.

greetings, Florian Pflug



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [PATCHES] extension for sql update

2006-07-27 Thread Florian G. Pflug


Tom Lane wrote:

Susanne Ebrecht [EMAIL PROTECTED] writes:

... We could provide the mixed update syntax and leave the
typed row value expression for the next release. Do you agree?


I don't really see the point --- the patch won't provide any new
functionality in anything like its current form, because you can
always just write the separate expressions in the simple one to
one way.  If we do offer the row-on-the-left syntax then people
will try to put sub-selects on the right, and won't get anything
beyond an unhelpful syntax error message.  So my vote would be
to leave it alone until we have a more complete implementation.


It has the advantage that inserts and updates look more alike.
If your sql statements are generated by code, then that removes
the need of a special case for updates.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] [PATCHES] Forcing current WAL file to be archived

2006-08-01 Thread Florian G. Pflug


Albe Laurenz wrote:

Tim Allen wrote:

Patch included to implement xlog switching, using an xlog record
processing instruction and forcibly moving xlog pointers.

1. Happens automatically on pg_stop_backup()


Oh - so it will not be possible to do an online backup
_without_ forcing a WAL switch any more?
Well, previously, you would have always had to simulate a wal 
switch, by 
working out which is the current wal file and copying that. Otherwise 
your online backup wouldn't be complete.


What Simon is describing sounds like a big step forward from that 
situation. It should let me delete half the code in my pitr 
backup/failover scripts. Definitely a Good Thing.


Certainly a Good Thing, and it should be on by default.

But couldn't there be situations where you'd like to do an
online backup without a WAL switch? To avoid generating an
archive WAL every day on a database with few changes, e.g.?


But the online backup would be impossible to restore, if you don't
have enough wal archived to recover past the point where you called
pg_stop_backup().

So, doing a wal switch when pg_stop_backup() is called greatly reduces 
the risk of a user error that leads to broken backups.


greetings, Florian Pflug




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Constraint exclusion is not general enough

2006-08-07 Thread Florian G. Pflug


Tom Lane wrote:

Rod Taylor [EMAIL PROTECTED] writes:

A simple way of doing this might be to use a minimum cost number?


But you don't have any cost numbers until after you've done the plan.


Couldn't this work similar to geqo_effort? The planner could
try planning the query using only cheap algorithmns, and if
the cost exceeds a certain value, it'd restart, and use
more sophisticated methods.

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Constraint exclusion is not general enough

2006-08-08 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Tom Lane wrote:

But you don't have any cost numbers until after you've done the plan.



Couldn't this work similar to geqo_effort? The planner could
try planning the query using only cheap algorithmns, and if
the cost exceeds a certain value, it'd restart, and use
more sophisticated methods.


AFAICS this would be a net loss on average.  Most of the time, the
constraint exclusion code doesn't win, and so throwing away all your
planning work to try it is going to be a loser most of the time.


On the other hand, if the consider-replanning threshold is high enough,
than that additional time really doesn't matter - If a query runs for minutes,
or even hours, a few wasted cycles during planning don't hurt.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] An Idea for planner hints

2006-08-08 Thread Florian G. Pflug


Hi

Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.

When the topic of optimizer hints comes up, people often suggest
that there should be a way to force postgres to use a certain
index, or do joins in a certain order. AFAIK, this mimics what
oracle does - you can put comments into your query that specify
what index to use. This approach has two major drawbacks
.) Plans that seem good now might not seem that good a few months
later - your data might have changed, and other execution plans
might fit better now
.) You have to change all your queries to make use of features
in new postgres versions, like bitmap scans.

My experience with the postgres optimizer is that it usually performs
great - and if it doesn't, that always boiled down to two problems
(at least for me)
.) The query is autogenerated, and includes complex, and highly inter-
dependent where (or join) conditions. This leads to wrong estimates
of where selectivity, and thus to bad plans.
.) There are correlations between columns and/or tables that postgres
doesn't know about (and has no chance of knowing about). Again, this
leads to vastly wrong estimates of row counts, and to bad plans.

I think that those bad estimates of the selectivity of where-clauses
(or on-clauses for joins) is where postgres could use hints.

Image a query like select ... from t1 join t2 on t1.t2_id = t2.id and 
expr. Lets say that expr is true for only 1% of the rows in t2 -

but those are exactly the rows that have matching rows in t1.

Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!) 
times more rows.


Now, I'd like to hand that information to postgres. I wouldn't want
to force any particular access method or join order, but rather I'd
just tell it hey, this expression has selectivity 1 in this context,
not 0.01 as you might think.

Could that work?

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Florian G. Pflug


Tom Lane wrote:

Martijn van Oosterhout kleptog@svana.org writes:

ISTM theat the easiest way would be to introduce a sort of predicate
like so:



SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);


The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries.  This throws that away
again.

I think for this to be really effective, you'd actually need both - a
query-independent way specifying selectivities, and a way to influence
the estimates for a _single_ query.

Image a complex, autogenerated query with looks something like this
select 
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
  big, complicated expression derived from some user input.

This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression. This leads to weird join orders, and generally very bad 
performance. Of course, *I* don't know the selectivity of this 
expression myself - but experience tells me that on average it's 
something like 50%, and not 1% as postgres believes. So, in that case,

being able to write

select ... join  where pg_selectivity(expression, 0.5)
would be a big win.


The thing I object to about the I want to decorate my queries with
planner hints mindset is that it's coming at it from the wrong
direction.  You should never be thinking in terms of fix this one
query, because that just leads back into the same dead end that your
fix doesn't work tomorrow.  What you *should* be thinking about is why
did the planner get this wrong, and how do I fix the generic problem?.
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.


Fixing the generic problem is surely the best _if_ there is a fix for
the generic problem at all. But if your where-conditions involves fields
from 10 different tables, then IMHO there is no way to _ever_ guarantee
that postgres will get correct selectivity estimates. But since (at 
least for me) overestimating selectivity hurts fare more than 
underestimating it, forcing postgres to just assume a certain 
selectivity could help.


I'm not in any way saying that there should _only_ be selectivity
annotations inside the query - a query-independent mechanism would
be a very nice thing to have. But a query-independent mechanism
wont be sufficient in all cases IMHO.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] An Idea for planner hints

2006-08-09 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Image a complex, autogenerated query with looks something like this
select 
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
   big, complicated expression derived from some user input.



This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression.


This is a straw man.  There is no way that your application can throw in
a chosen-at-random selectivity value for a join condition that it
doesn't understand and have that be more likely to be right than the
planner's guess.


No, my application probably won't get it right, _but_
.) I can at least _choose_ what selectivity to use. My experience is 
that a selectivity that is too small (meaning that postgres 
underestimates the number of records resulting for a join or where)
is usually much worse than a overly large selectivity (meaning that 
postgres expects more records than it actually finds). Forcing a

high selectivity (thus letting postgres expect a lot of records)
therefore should lead to better plans then letting postgres 
underestimating the selectivity.


.) Often, my application (or I) *can* guess betten then postgres. My
application, for example, executes the same set of about 100 queries
every day to build cache tables. Since I _know_ how many records the
query returned yesterday, I can use that value to get a *very*
good approximation of the selectivity. This is something my app
can do easily, while postgres would have really a hard time to figure
that out.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] An Idea for planner hints

2006-08-17 Thread Florian G. Pflug


Peter Eisentraut wrote:

Arturo Pérez wrote:

The DBA therefore pokes the
right information into
the planner's statistical tables (or, perhaps, a more human-
manageable one that gets
compiled into the planner's stats).


I think we're perfectly capable of producing a system that can collect 
the statistics.  We just don't want to collect every possible 
statistic, but just those that someone declared to be interesting 
beforehand.  There need not be any manual poking.  Just manual 
declaring.


But we need to work this from the other end anyway.  We need to 
determine first, what sort of statistics the planner could make use of.  
Then we can figure out the difficulties in collecting them.


I've been told that oracle has an interesting feature regarding
materialized views that gave me an idea how to declare what statistics
to gather. It seems as if oracle is able to figure out that it can
use a certain materialized view to speed up execution of a certain
query, even if the query doesn't use that view explicitly. So, e.g.
if you do

1) create materialized view v as select * from t1 join t2 on t1.t2_id = 
t2.id.

2) select * from t1 join t2 on t1.t2_id = t2.id join t3 on t3.t2_id = t2.id

then oracle seems to be able to use the already-joined tuples in v, and
only needs to join t3 to those, instead of having to rejoin t1 and t2.

That gave me the idea that something similar could be used to declare
what statistics to gather, in a very general way. Imagine that I could
do.

1) create statistics for select * from t1 join t2 on t1.t2_id and 
t1.flag = TRUE.
2) select * from t1 join t2 on t1.t2_id and t1.flag = TRUE join t3 on 
...  join t4 on ...


The command 1) would basically gather the same statistics for the result
of the query as it would gather for a normal table with the same signature.
When planning 2), postgres would recognize that it can use those
statistics (similar to how oracle recognizes that it can use a certain
materialized view), and would thus. know the selectivity of that
particular join very accurately.

I think there might even be a way to do (1) without actually executing 
the (whole) query. If every access-method in the query plan could be

told to deliver only say 10% of the rows it would deliver normally,
but the rest of the plan was executed normally, then the result should
have the same statistical properties as the complete result would have.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] [PATCHES] WIP archive_timeout patch

2006-08-17 Thread Florian G. Pflug


Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

Revised patch enclosed, now believed to be production ready. This
implements regular log switching using the archive_timeout GUC.



Further patch enclosed implementing these changes plus the record type
version of pg_xlogfile_name_offset()


Applied with minor changes --- it seemed better to me to put tracking of
the last xlog switch time directly into xlog.c, instead of having the
bgwriter code try to determine whether a switch had happened recently.

I noticed a minor annoyance while testing: when the system is completely
idle, you get a forced segment switch every checkpoint_timeout seconds,
even though there is nothing useful to log.  The checkpoint code is
smart enough not to do a checkpoint if nothing has happened since the
last one, and the xlog switch code is smart enough not to do a switch
if nothing has happened since the last one ... but they aren't talking
to each other and so each one's change looks like something happened
to the other one.  I'm not sure how much trouble it's worth taking to
prevent this scenario, though.  If you can't afford a WAL file switch
every five minutes, you probably shouldn't be using archive_timeout
anyway ...


Actually, this behaviour IMHO even has it's advantages - if you can be
sure that at least one wal will be archived every 5 minutes, then it's
easy to monitor the replication - you can just watch the logfile if the
slave, and send a failure notice if no logfile is imported at least
every 10 minutes or so.

Of course, for this to be useful, the documentation would have to tell
people about that behaviour, and it couldn't easily be changed in the next
release...

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] PostgreSQL on 64 bit Linux

2006-08-21 Thread Florian G. Pflug


[EMAIL PROTECTED] wrote:

This is what I mean by after thought. PostgreSQL is designed for
32-bit processors. Which is fine. I'm not complaining. The question
was whether there is an interest in pursuing 64-bit specific
optimizations. In the PostgreSQL code, a quick check points me only to
has long int 64 as a 64-bit source code #ifdef. Of the six places
that reference this, five of them actually slow down the code, as they
check for overflow of the 'long int' result beyond 4 bytes of
data. The sixth place is used to define the 64-bit type in use by
PostgreSQL, which I suspect is infrequently used.

I believe the answer is no. No or few 64-bit optimization possibilities
have been chased down, probably because some or many of these would:

1) require significant re-architecture

2) reduce the performance in a 32-bit world


Just out of intereset - what areas in postgres do you think could be
improved (performance wise) on 64-bit machines? The only area that
I can see is the int64 datatype - it's stored in palloc()'ed memory
on 32-bit machines AFAIK - I'm not sure if it uses the long long
datatype on 64-bit archs.. But I can't imagine any other area that
could be tuned by making use of (native) 64-bit ints.

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Patch: Tie stats options to autovacuum in postgresql.conf

2006-09-28 Thread Florian G. Pflug


Jim C. Nasby wrote:

On Thu, Sep 28, 2006 at 03:07:39PM -0700, David Wheeler wrote:

PostgreSQLers,

I just ran into an issue where a client thought that autovacuum was  
running but it wasn't. This is because it's not fatal when autovacuum  
is on but stats_start_collector and/or stats_row_level is off. I  
suspect that there's a reason that it's not fatal, so I thought that  
it might be useful to give folks just a little bit of help by telling  
them in postgresql.conf that they need to enable them in order for  
autovacuum to work.


+1. I was just at a client today that had run into this problem.

Actually, I'm in favor of refusing to start if autovac is on but the
proper stats settings aren't. I'd rather that then people ending up with
bloated databases and crappy performance.

If think that setting autovacuum to on should even force
stats_collector and stats_row_level to on - together with a warning if
they would otherwise be off.

The risk of autovacuum being disabled by accident seems to risk a much
worse performance penatly then having the statistics collector running
by accident. Additionally, the statistics collector can easily be turned
off within seconds even _if_ it was on accidentally, but if vacuuming was
disabled by accident, the user might have to run vacuum full - with all
the concurrency issues that this implies..

greetings, Florian flug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] array_accum aggregate

2006-10-07 Thread Florian G. Pflug


Tom Lane wrote:

Stephen Frost [EMAIL PROTECTED] writes:

* Tom Lane ([EMAIL PROTECTED]) wrote:

It looks like it should work to have just one polymorphic aggregate
definition, eg, array_accum(anyelement) returns anyarray.



I was hoping to do that, but since it's an aggregate the ffunc format is
pre-defined to require accepting the 'internal state' and nothing else,
and to return 'anyelement' or 'anyarray' one of the inputs must be an
'anyelement' or 'anyarray', aiui.


Hmm ... I hadn't been thinking about what the state type would need to
be, but certainly bytea is a lie given what you're really doing.
We've run into this same problem in contrib/intagg: sometimes you'd like
to use a state data structure that isn't any regular SQL datatype, and
in particular isn't just a single blob of memory.  That's a problem from
nodeAgg's point of view because it expects to be responsible for copying
the state value from one context to another.  Don't have an immediate
idea for a solution ...


I used an int8 as state type for an implementation of a kind of
array_accum_unique aggregate. I know that it's a ugly hack, but
it has been running on a production machine for months now, without
a problem...

The memory is alloc'd from the aggcontext, btw.

Note that I only use this aggregate in one particular query -
so there might be problems with my approach that just don't
manifest in my particular situation. For example, the aggregate
is used only on a table that is never updated, and it is only
used in select queries. So there might be problems if the executor
decides that it has to restart a query...

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-13 Thread Florian G. Pflug


Heikki Linnakangas wrote:

BTW, we haven't talked about how to acquire a snapshot in the slave.
 You'll somehow need to know which transactions have not yet
committed, but will in the future. In the master, we keep track of
in-progress transaction in the ProcArray, so I suppose we'll need to
do the same in the slave. Very similar to prepared transactions,
actually. I believe the Abort records, which are not actually needed
for normal operation, become critical here. The slave will need to
put an entry to ProcArray for any new XLogRecord.xl_xid it sees in
the WAL, and remove the entry at a Commit and Abort record. And clear
them all at a shutdown record.


For reference, here is how I solved the snapshot problem in my
Summer-of-Code project last year, which dealt exactly with executing
read-only queries on PITR slaves (But sadly never came out of alpha
stage due to both my and Simon's lack of time)

The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering in-progress, the array contained all the xid's 
xmin that are to be considered completed.

The current read-only snapshot (which current meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries  xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache-nextXid.

The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...

regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Florian G. Pflug


Simon Riggs wrote:

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:


The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering in-progress, the array contained all the xid's 
xmin that are to be considered completed.



The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...


Why do it inverted? That clearly has problems.


Because it solves the problem of sponteaously apprearing XIDs in the 
WAL. At least prior to 8.3 with virtual xids, a transaction might have 
allocated it's xid long before actually writing anything to disk, and 
therefore long before this XID ever shows up in the WAL. And with a 
non-inverted snapshot such an XID would be considered to be completed 
by transactions on the slave... So, one either needs to periodically log 
a snapshot on the master or log XID allocations which both seem to cause 
considerable additional load on the master. With an inverted snapshot, 
it's sufficient to log the current RecentXmin - a values that is readily 
available on the master, and therefore the cost amounts to just one 
additional 4-byte field per xlog entry.


regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Florian G. Pflug


Simon Riggs wrote:

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:


The current read-only snapshot (which current meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries  xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache-nextXid.


Snapshots only need to know which transactions are currently running
during WAL apply. The standby can't remove any tuples itself, so it
doesn't need to know what the master's OldestXmin is. 


I used the logged xmin value to track the shared snapshot's xmin, which 
in turn allowed me to prune the xid array, eliminating all xids  that xmin.


regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Listen / Notify - what to do when the queue is full

2009-11-19 Thread Florian G. Pflug


Heikki Linnakangas wrote:

Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 4:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Yes, I have been thinking about that also. So what should happen
when you prepare a transaction that has sent a NOTIFY before?


From the user's point of view, nothing should happen at prepare.

At a quick glance, it doesn't seem hard to support 2PC. Messages should
be put to the queue at prepare, as just before normal commit, but the
backends won't see them until they see that the XID has committed.


Yeah, but if the server is restarted after the PREPARE but before the 
COMMIT, the notification will be lost, since all notification queue 
entries are lost upon restart with the slru design, no?


best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Listen / Notify - what to do when the queue is full

2009-11-19 Thread Florian G. Pflug


Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
A better approach is to do something similar to what we do now: at 
prepare, just store the notifications in the state file like we do 
already. In notify_twophase_postcommit(), copy the messages to the 
shared queue. Although it's the same approach we have now, it

becomes a lot cleaner with the patch, because we're not
piggybacking the messages on the backend-private queue of the
current transaction, but sending the messages directly on behalf of
the prepared transaction being committed.


This is still ignoring the complaint: you are creating a clear risk 
that COMMIT PREPARED will fail.


I'm not sure that it's really worth it, but one way this could be
made safe would be for PREPARE to reserve the required amount of
queue space, such that nobody else could use it during the window
from PREPARE to COMMIT PREPARED.


I'd see no problem with COMMIT PREPARED failing, as long as it was
possible to retry the COMMIT PREPARED at a later time. There surely are
other failure cases for COMMIT PREPARED too, like an IO error that
prevents the clog bit from being set, or a server crash half-way through
COMMIT PREPARED.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Listen / Notify - what to do when the queue is full

2009-11-19 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

Tom Lane wrote:

This is still ignoring the complaint: you are creating a clear
risk that COMMIT PREPARED will fail.



I'd see no problem with COMMIT PREPARED failing, as long as it
was possible to retry the COMMIT PREPARED at a later time. There
surely are other failure cases for COMMIT PREPARED too, like an IO
error that prevents the clog bit from being set, or a server crash
half-way through COMMIT PREPARED.


Yes, there are failure cases that are outside our control.  That's no
 excuse for creating one that's within our control.


True. On the other hand, people might prefer having to deal with (very
unlikely) COMMIT PREPARED *transient* failures over not being able to
use NOTIFY together with 2PC at all. Especially since any credible
distributed transaction manager has to deal with COMMIT PREPARED
failures anyway.

Just my $0.02, though.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

[HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-19 Thread Florian G. Pflug


Hi

It seems that pl/pgsql ignores the DEFAULT value of domains for local
variables. With the following definitions in place

create domain myint as int default 0;
create or replace function myint() returns myint as $body$
declare
  v_result myint;
begin
  return v_result;
end;
$body$ language plpgsql immutable;

issuing
select myint();
returns NULL, not 0 on postgres 8.4.1

If the line
  v_result myint;
is changes to
  v_result myint default 0;
than 0 is returned as expected.

I've tried to create a patch, but didn't see how I'd convert the result
from get_typedefault() (A Node*, presumeably the parsetree corresponding
to the default expression?) into a plan that I could store in a
PLpgSQL_expr. I guess I'd need something like SPI_prepare_plan that
takes a parse tree instead of a query string. Or am I on a completely
wrong track there?

While trying to cook up a patch I've also stumbled over what I perceive
as a bug relating to DOMAINS and column DEFAULTs. I'll write that up in
a second E-Mail to avoid confusion.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

[HACKERS] column DEFAULTs and prepared statements

2009-11-19 Thread Florian G. Pflug


Hi

While trying to come up with a patch to handle domain DEFAULTs in
plpgsql I've stumbled across the following behavior regarding domain
DEFAULTs and prepared statements.

session 1: create domain myint as int default 0 ;
session 1: create table mytable (i myint) ;
session 2: prepare ins as insert into mytable (i) values (default);
session 2: execute ins;
session 1: alter domain myint set default 1;
session 2: execute ins;

select * from mytable returns:
 i
---
 0
 0


while I'd have expected:
 i
---
 0
 1

After doing the same without using a domain
session 1: create table mytable (i myint default 0) ;
session 2: prepare ins as insert into mytable (i) values (default);
session 2: execute ins;
session 1: alter table mytable alter column i default 1;
session 2: execute ins;

select * from mytable returns:
 i
---
 0
 1

As far as I understand the code this happens because the dependency on
the domain (for the default value) is not recorded in the plan cache
entry. This would imply that the same error also occurs if the INSERT
happens from a pl/pgsql function instead of a manually prepared
statement, but I haven't tested that.

If someone gives me a general idea where to start, I could try to come
up with a patch

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-20 Thread Florian G. Pflug


Robert Haas wrote:

On Thu, Nov 19, 2009 at 9:06 PM, Florian G. Pflug f...@phlo.org wrote:

I've tried to create a patch, but didn't see how I'd convert the result
from get_typedefault() (A Node*, presumeably the parsetree corresponding
to the default expression?) into a plan that I could store in a
PLpgSQL_expr. I guess I'd need something like SPI_prepare_plan that
takes a parse tree instead of a query string. Or am I on a completely
wrong track there?

While trying to cook up a patch I've also stumbled over what I perceive
as a bug relating to DOMAINS and column DEFAULTs. I'll write that up in
a second E-Mail to avoid confusion.


I suggest adding this to the open CommitFest (2010-01) at
https://commitfest.postgresql.org/action/commitfest_view/open


Hm, but I don't (yet) have a patch to add...

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-20 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

It seems that pl/pgsql ignores the DEFAULT value of domains for
local variables.


The plpgsql documentation seems entirely clear on this:

The DEFAULT clause, if given, specifies the initial value assigned to
 the variable when the block is entered. If the DEFAULT clause is not
 given then the variable is initialized to the SQL null value.


Hm, must have missed that paragraph :-(. Sorry for that.

Would a patch that changes that have any chance of being accepted? Or is
the gain (not having to repeat the DEFAULT clause, and being able to
maintain it at one place instead of many) considered too small compared
to the risk of breaking existing code?

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-21 Thread Florian G. Pflug


Tom Lane wrote:

Josh Berkus j...@agliodbs.com writes:

(2) this change, while very useful, does change what had been a
simple rule (All variables are NULL unless specifically set
otherwise) into a conditional one (All variables are NULL unless
set otherwise OR unless they are declared as domain types with
defaults).  Do people feel that the new behavior would be
sufficiently intuitive to avoid user confusion?


I'm inclined to leave it alone.  It complicates the mental model, and
 frankly attaching defaults to domains was not one of the SQL
committee's better ideas anyway.  It's *fundamentally*
non-orthogonal.


I've always though of domains as being a kind of subtype of it's base
type. In this picture, DEFAULTs for domains correspond to overriding the
default constructor of the type (thinking C++ now), and seem like a
natural thing to have. But maybe that's more a C++ programmers than a
database designers point of view...

I've just checked how rowtypes behave, and while the set to null unless
specifically set otherwise rule kind of holds for them, their NULL
value seems to be special-cased enough to blur the line quite a bit

create or replace function myt() returns t as $body$
declare
  r t;
begin
  raise notice 'r: %, r is null: %', r, (r is null);
  return r;
end;
$body$ language plpgsql immutable;
select myt(),myt() is null;

gives:

NOTICE:  r: (,), r is null: t
NOTICE:  r: (,), r is null: t
 myt | ?column?
-+--
 (,) | f

Strange I think... And at least half of an exception to the simple
always null unless specifically set otherwise rule

It also seems that while domain DEFAULTs are ignored, the resulting
(null-initialized) variable is still checked against the domain's
constraints, including a potential NOT NULL constraint

create domain myint as int not null;
create or replace function myint() returns myint as $body$
declare
  i myint;
begin
  return i;
end;
$body$ language plpgsql immutable;

raises

ERROR:  domain myint does not allow null values
CONTEXT:  PL/pgSQL function myint line 3 during statement block local
variable initialization

This has the potential to cause some headache I think if you use domains
to prohibit NULL values because they make no semantic sense for your
application, and depend on DEFAULT to fill in some other value (like an
empty string or an empty array).

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-21 Thread Florian G. Pflug


Gurjeet Singh wrote:
On Sat, Nov 21, 2009 at 7:26 AM, Josh Berkus j...@agliodbs.com 
mailto:j...@agliodbs.com wrote: However, there are some other

issues to be resolved:

(1) what should be the interaction of DEFAULT parameters and domains 
with defaults?


The function's DEFAULT parameter should take precedence over the
default of the domain.


I think Josh was pondering whether

create domain myint as int default 0;
create function f(i myint) ...;

should behave like

create function f(i myint default 0) ...;

and hence call f(0) if you do select f();, or instead
raise an error because no f with zero parameters is defined (as it does
now).

I'd say no, because no default should be treated the same as default
null, so for consistency we'd then have to also support

create function g(i int) ...;
select g();

And of course throw an error if there was another function defined as
create function g() ...;

This way leads to madness...

If one really wanted to do that, there'd have to be an OPTIONAL clause
for function parameters that works like DEFAULT, but doesn't take a
default value and instead uses the type's default (NULL except for
domains with DEFAULT clause). But I wouldn't got that far, personally...

best regards,
Florian Pflug




smime.p7s
Description: S/MIME Cryptographic Signature

[HACKERS] ALTER TABLE, find_composite_type_dependencies and locking

2009-11-25 Thread Florian G. Pflug


Hi

I'm currently investigating how much work it'd be to implement arrays of
domains since I have a client who might be interested in sponsoring that
work.

The comments around the code handling ALTER DOMAIN ADD CONSTRAINT are
pretty clear about the lack of proper locking in that code - altering a
domain while simultaneously add a column with that domain as a type
might result in inconsistencies between the data in that column and the
domain's constraints after both transactions committed.

I do, however, suspect that ALTER TABLE is plagued by similar problems.
Currently, during the rewrite phase of ALTER TABLE,
find_composite_type_dependencies is used to verify that the table's row
type (or any type directly or indirectly depending on that type) is not
used as a column's type anywhere in the database.

But since this code does not take any permanent locks on the visited
types, it seems that adding such a column concurrently is not prevented.
 If the original ALTER TABLE changed a column's type, data inserted into
the newly added column before the original ALTER TABLE committed will
have a type different from what the catalog says after the original
ALTER TABLE commits. Or at least so I think - I haven't yet tested that
theory...

I am aware that since a commit fest is currently running, now might not
be the best time to bring up this topic. Since I feared forgetting this
all together, I decided to still post now, though. I figured people
still have to option to ignore this for now if they're busy with getting
those patches committed.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] ALTER TABLE, find_composite_type_dependencies and locking (Confirmed)

2009-11-25 Thread Florian G. Pflug


Florian G. Pflug wrote:
I do, however, suspect that ALTER TABLE is plagued by similar 
problems. Currently, during the rewrite phase of ALTER TABLE, 
find_composite_type_dependencies is used to verify that the table's 
row type (or any type directly or indirectly depending on that type) 
is not used as a column's type anywhere in the database.


But since this code does not take any permanent locks on the visited
 types, it seems that adding such a column concurrently is not 
prevented. If the original ALTER TABLE changed a column's type, data 
inserted into the newly added column before the original ALTER TABLE 
committed will have a type different from what the catalog says after

 the original ALTER TABLE commits. Or at least so I think - I haven't
 yet tested that theory...


I was able to confirm that this is an actual bug in 8.5. I did, however,
need to use an array-of-composite type. With only nested composite types
it seems that CheckAttributeType() protects against the race, because it
follows the dependency chain and opens each type's relation in
AccessShareLock mode. This blocks once the traversal hits the type which
is being altered, hence forcing the table creation to wait for the
concurrent alter table to complete.

Create two types in session 1
session 1: create table t1 (t1_i int);
session 1: create type t2 as (t2_t1 t1);

Warm the type cache in session 2
(A simple select array[row(row(-1))::t2] would probably suffice)
session 2: create table bug (bug_t2s t2[]);
session 2: insert into bug (bug_t2s) values (array[row(row(-1))::t2]);
session 2: select bug.bug_t2s[1].t2_t1.t1_i from bug;
[select correctly returns one row containing -1]
session 2: drop table bug;

Alter type of t1_i in session 1
session 1: alter table t1 alter column t1_i type varchar;
[Pause session 1 using gdb *right* after the call to
 find_composite_type_dependencies in ATRewriteTable
 returned]

Create the bug table in session 2, and insert record
session 2: create table bug (bug_t2s t2[]);
session 2: insert into bug (bug_t2s) values (array[row(row(-1))::t2]);
session 2: select bug.bug_t2s[1].t2_t1.t1_i from bug;
[select correctly returns one row containing -1]

Complete the alter table in session 1
[Resume session 1 using gdb]
session 1: select bug.bug_t2s[1].t2_t1.t1_i from bug;
[select returns bogus string. On my 8.5 debug+cassert build,
 its a long chain of \x7F\x7F\x7F\x...]

Don't have any good idea how to fix this, yet. If CheckAttributeType()
really *does* offer sufficient protected in the non-array case,
extending that to the general case might work. But OTOH it might equally
well be that a more sophisticated race exists even in the non-array
case, and I simply didn't manage to trigger it...

best regards, Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] File IO - why does PG do things in pages?

2009-11-26 Thread Florian G. Pflug


Dan Eloff wrote:
At the lower levels in PG, reading from the disk into cache, and 
writing from the cache to the disk is always done in pages.


Why does PG work this way? Is it any slower to write whole pages 
rather than just the region of the page that changed? Conversely, is 
it faster? From what I think I know of operating systems, reading 
should bring the whole page into the os buffers anyway, so reading
the whole page instead of just part of it isn't much more expensive. 
Perhaps writing works similarly?


First, data fetched from the disk is (except for data in temporary
tables, I believe) not stored in private memory of the backend process
doing the read(), but instead a a shared memory segment accessible by
all backend processes. This allows two different backend processes to
work modify the data concurrently without them stepping on each other's
toes. Note that immediatly writing back any changes is *not* an option,
since WAL logging mandates that all changes got to the WAL *first*.
Hence, if you were to write out each changed tuple immediately, you'd
have to first write the changes to the WAL *and* fsync the WAL to
guarantee they hit the disk first.

Sharing the data between backend processes requires a fair amount of
infrastructure. You need a way to locate a given chunk of on-disk data
in the shared memory buffer cache, and be able to acquire and release
locks on those buffers to prevent two backends from wrecking havoc when
they try to update the same piece of information. Organizing data in
fixed-sized chunks (which is what pages are) helps with keeping the
complexity of that infrastructure manageable, and the overhead
reasonably low.

There are also things like tracking the free space in a data file, which
also gets easier if you only have to track it page-wise (Is there free
space on this page or not), instead of having to track arbitrary ranges
of free space.

Finally, since data writes happen in units of blocks (and not bytes),
you need to guarantee that you do your IO in some multiple of that unit
anyway, otherwise you'd have a very hard time guaranteeing data
consistency after a crash. Google for torn page writes, that should
give you more details about this problem.

Note, however, that a postgres page (usually 8K) is usually larger than
the filesystem's blocksize (usually 512b). So always reading in full
pages induces *some* IO overhead. Just not that much - especially since
the blocks comprising a page are extremely likely to be arranges
consecutively on disk, so there is no extra seeking involved.

This, at least, are what I believe to be the main reasons for doing
things in units of pages - hope this helps at least somewhat.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Application name patch - v4

2009-11-29 Thread Florian G. Pflug


Tom Lane wrote:

: One possibility would be to make it possible to issue SETs that
behave : as if set in a startup packet - imho its an implementation
detail that : SET currently is used.

I think there's a good deal of merit in this, and it would't be hard
at all to implement, seeing that we already have SET LOCAL and SET
SESSION. We could add a third keyword, say SET DEFAULT, that would
have the behavior of setting the value in a fashion that would
persist across resets.  I'm not sure that DEFAULT is exactly le mot
juste here, but agreeing on a keyword would probably be the hardest
part of making it happen.


Hm, but without a way to prevent the users of a connection pool from
issuing SET DEFAULT, that leaves a connection pool with no way to
revert a connection to a known state.

How about SET CONNECTION, with an additional GUC called
connection_setup which can only be set to true, never back to false.
Once connection_setup is set to true, further SET CONNECTION attempts
would fail.

In a way, this mimics startup-packet SETs without actually doing things
in the startup packet.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

[HACKERS] Compiling HEAD with -Werror int 64-bit mode

2009-12-15 Thread Florian G. Pflug


Hi

HEAD fails to compile in 64-bit mode on Mac OS X 10.6 with gcc 4.2 and
-Werror.

What happens is that INT64_FORMAT gets defined as %ld (which is
correct - long and unsigned long are 64 bits wide on x86_64), but
the check for a working 64-bit int fails, causing INT64_IS_BUSTED to get
defined and int64 becoming actually a 32-bit type. This is turn causes
warnings when these pseudo int64s are passed to printf with format
specified INT64_FORMAT, which get turned to errors by -Werror.

configure fails to recognize long as a working 64-bit type because the
does_int64_work configure test produces warning due to a missing return
value declaration for main() and a missing prototype for
does_int64_work(). (Aain, those warning are turned into errors by -Werror).

I use the following envvar settings (when running ./configure) to force
64-bit mode and -Werror
CC=gcc-4.2 CFLAGS=-arch x86_64 -Werror LDFLAGS=-arch x86_64

The following patch fixed the problem for me - though I didn't yet try
it on any other platform that Mac OS X 10.6 with gcc 4.2 and in 64-bit mode.

--
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 9ac2c30..c6bd523 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -35,7 +35,7 @@ AC_CACHE_CHECK([whether $1 is 64 bits], [Ac_cachevar],
 ac_int64 a = 2001;
 ac_int64 b = 4005;

-int does_int64_work()
+static int does_int64_work()
 {
   ac_int64 c,d;

@@ -49,8 +49,8 @@ int does_int64_work()
 return 0;
   return 1;
 }
-main() {
-  exit(! does_int64_work());
+int main() {
+  return(! does_int64_work());
 }],
 [Ac_cachevar=yes],
 [Ac_cachevar=no],
--

best regards,
Florian Pflug




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Compiling HEAD with -Werror int 64-bit mode

2009-12-15 Thread Florian G. Pflug


On 15.12.09 16:02 , Tom Lane wrote:

Florian G. Pflugf...@phlo.org  writes:

configure fails to recognize long as a working 64-bit type
because the does_int64_work configure test produces warning due to
 a missing return value declaration for main() and a missing
prototype for does_int64_work(). (Aain, those warning are turned
into errors by -Werror).


autoconf's test programs tend to be sufficiently sloppy that I would
expect -Werror to break a whole lot of things, not just this. We can
possibly neaten up the particular test case but there are many tests
whose expansion we don't have much control over.


Yeah, I expected all hell to break loose - only to be pleasantly
surprised by this being the only issue I encountered. So I figured
fixing this might be worthwhile - even if this surely does not fix
-Werror builds on all platforms and/or compilers.

Alternatively - is there a way to use -Werror only for building the
actual sources, not the configure tests? I didn't find one, but my
autoconf-fu is pretty limited...

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Range types

2009-12-15 Thread Florian G. Pflug


On 15.12.09 15:52 , Tom Lane wrote:

to...@tuxteam.de writes:

(and as Andrew Dunstan pointed out off-list: I was wrong with my
bold assertion that one can squeeze infinitely many (arbitrary
length) strings between two given. This is not always the case).


Really?  If the string length is unbounded I think you were right.


One example is a and aa (assuming a is minimal character in your
alphabet). The general case is the strings A and Aaa...a I think -
it doesn't get any more exciting than this.

This *is* a bit surprising, since one usually assumes that the ordering
of strings and reals is fairly similar, since both are lexical.

But note that the mapping of strings into the reals this intuition is
based on (simply prefix a the string with 0. and interpret as a real,
or something similar if the alphabet isn't {0,1}) isn't one-to-one - the
strings 1, 10, 100, ... are all mapped to the *same* real number 0.1

So for reals, the statement is reduced to the trivial fact that for
every x there is no y with x  y  x. Which is of course true..

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Compiling HEAD with -Werror int 64-bit mode

2009-12-15 Thread Florian G. Pflug


On 15.12.09 23:38 , Tom Lane wrote:

Peter Eisentrautpete...@gmx.net  writes:

So to summarize, this is just a bad idea.  Creating a less obscure
way to use -Werror might be worthwhile, though.


I suppose we could add --with-Werror but it seems pretty
specialized to me.  A more appropriate solution would allow the user
to provide flags that get added to CFLAGS only after we do all the
configure tests (implying that it's on the user's head that these
flags are right and don't break anything, but then again that's
pretty much true of up-front CFLAGS too).  And that basically
describes COPTS ... the only thing lacking is documentation.


For what it's worth, I agree. Though we might want to arrange for
configure to store the value of COPT somewhere so that
  COPT=-Werror ./configure
  make
works which it currently doesn't seem to.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

[HACKERS] [WIP] Inspection of row types in pl/pgsq l and pl/sql

2009-12-20 Thread Florian G. Pflug


Hi

I've completed a (first) working version of a extension that allows
easier introspection of composite types from SQL and pl/PGSQL.

The original proposal and ensuing discussion can be found here:
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00695.php

The extension can be found on:
http://github.com/fgp/pg_record_inspect

This is what the extension currently provides (all in schema
record_inspect).

* fieldinfo [composite type]
  Used to by fieldinfos() to describe a record's fields.
  Contains the fields
fieldname (name),
fieldtype (regclass),
fieldtypemod (varchar)

* fieldinfo[] fieldinfos(record)
  Returns an array of fieldinfos describing the record''s fields

* anyelement fieldvalue(record, field name, defval anyelement,
coerce boolean)
  Returns the value of the field field, or defval should the value
  be null. If coerce is true, the value is coerced to defval's type
  if possible, otherwise an error is raised if the field''s type and
  defval's type differ.

* anyelement fieldvalues(record, defval anyelement, coerce boolean)
  Returns an array containing values of the record'' fields. NULL
  values are replaced by defval. If coerce is false, only the
  fields with the same type as defval are considered. Otherwise, the
  field'' values are coerced if possible, or an error is raised if not.

The most hacky part of the code is probably coerceDatum() - needed to
coerce a field's value to the requested output type. I wanted to avoid
creating and parsing an actual SQL statement for every cast, and instead
chose to use coerce_to_target_type() to create the expression trees
representing casts. I use the noe type CoerceToDomainValue to inject the
source value into the cast plan upon execution - see makeCastPlan() and
execCastPlan() for details. If anyone has a better idea, please speak up

I personally would like to see this becoming a contrib module one day,
but that of course depends on how much interest there is in such a feature.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Serializable implementation

2009-12-28 Thread Florian G. Pflug


On 28.12.09 18:54 , Kevin Grittner wrote:

To give some idea of the scope of development, Michael Cahill added
SSI to InnoDB by modifying 250 lines of code and adding 450 lines of
 code; however, InnoDB already had the S2PL option and the prototype
 implementation isn't as sophisticated as I feel is necessary for
real production use (particularly regarding the granularity of SIREAD
locks).  I'm assuming it would take more to reach real production
quality in PostgreSQL.  My SWAG would be to multiply by two or
three.


I believe the hard part of implementing true serializability is not the
actual SSI or S2PL algorithm, but rather the necessary predicate locking
strategy.

So I think checking how InnoDB tackles that and how much of it's code is
invovled might give a more realistic estimate of the effort required.

best regards,
Florian Plug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] GSoC - proposal - Materialized Views in PostgreSQL

2010-04-11 Thread Florian G. Pflug


On 11.04.10 20:47 , Robert Haas wrote:

On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Robert Haas wrote:

2010/4/10 Andrew Dunstanand...@dunslane.net:

Heikki Linnakangas wrote:

1. Keep the materialized view up-to-date when the base tables
change. This can be further divided into many steps, you can
begin by supporting automatic updates only on very simple
views with e.g a single table and a where clause. Then extend
that to support joins, aggregates, subqueries etc. Keeping it
really limited, you could even require the user to write the
required triggers himself.

That last bit doesn't strike me as much of an advance. Isn't
the whole point of this to automate it? Creating greedy
materialized views is usually not terribly difficult now, but
you do have to write the triggers.


Yeah, I agree.


It doesn't accomplish anything interesting on its own. But if you
do the planner changes to automatically use the materialized view
to satisfy queries (item 2. in my previous email), it's useful.


But you can't do that with a snapshot view, only a continuous updated
one.


If continuous updates prove to be too hard initially, you could instead
update the view on select if it's outdated. Such a materialized view
would be a kind of inter-session cache for subselects.

The hard part would probably be to figure out how to decide whether the
view is outdated or not, and to deal with two concurrent transactions
trying to use an outdates view (and both trying to refresh it). What
makes the second problem hard is that you wouldn't want one of the
transactions to wait for the other to complete, because this is not how
SELECTs traditionally behave.

best regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [HACKERS] Reduction in WAL for UPDATEs

2007-03-28 Thread Florian G. Pflug


Kenneth Marshall wrote:

We use DSPAM as one of our anti-spam options. Its UPDATE pattern is to
increment a spam counter or a not-spam counter while keeping the user and
token information the same. This would benefit from this optimization.
Currently we are forced to use MySQL with MyISM tables to support the
update load, although PostgreSQL 8.2 performance is right at the I/O
break-even point for switching databases. With HOT and more optimized
UPDATE I/O, 8.3 would give us enough I/O headroom to switch to PostgreSQL.


Interesting. I've switched from MySQL to PostgreSQL for dspam, because
of concurrency issues with MyISAM which caused bad performance.

I am eager to see how much HOT speeds of my setup, though ;-)

BTW, the COMMIT NOWAIT feature Simon Riggs proposed should provide
a huge speedup too, since dspam runs one transaction for each token
it has to update.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-28 Thread Florian G. Pflug


Simon Riggs wrote:

On Wed, 2007-03-28 at 22:24 +0530, Pavan Deolasee wrote:

Just when I thought we have nailed down CREATE INDEX, I realized 
that there something more to worry. The problem is with the HOT-chains

created by our own transaction which is creating the index. We thought
it will be enough to index the tuple at the head-of-the-chain since
that 
would be the visible copy once the transaction commits. We thought

of keeping the index unavailable for queries in pre-existing
transactions
by setting a new xid attribute in pg_index. The question is what
value 
to assign to xid. I though we would assign ReadNewTransactionId().



If you are indexing a table that hasn't just been created by you, set
the xcreate field on pg_index at the *end* of the build using
ReadNewTransactionId(). Any xid less than that sees the index as
invalid. If you created the table in this transaction (i.e.
createSubId != 0) then set xcreate to creating xid.


Couldn't you store the creating transaction's xid in pg_index, and
let other transaction check that against their snapshot like they
would for any tuple's xmin or xmax? (With one exception - the creating
transaction would consider indices it built itself invalid, which
is not how things usually work for xmin/xmax).

This would mean that any transaction that believes that the creating
transaction has committed also consideres the index to be valid.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-28 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Couldn't you store the creating transaction's xid in pg_index, and
let other transaction check that against their snapshot like they
would for any tuple's xmin or xmax?


What snapshot?  I keep having to remind people that system catalog
operations are SnapshotNow.  In the particular context here, the
place where doing something else would break down is that the planner
has no idea when it makes a plan what snapshot(s) the plan might later
be used with.


Sorry - now that you say it, I remember that you've said that already
multiple times...

So the question is, why did this work until now, and CREATE INDEX+HOT
just doesn't seem to fit into this scheme?

I think the answer is that  all other DDL statements manage to assure
that any database objects they create or modify are usable for everybody
else immediatly after they are committed. This usually implies pretty
strong locking requirements - for example, I think that the core reason
why TRUNCATE needs an exclusive lock is precisely that guarantee it has
to make.

Maybe this could somehow be relaxed? Could, for example, the planner
be allowed to base some of it's decisions on the SerializableSnapshot
the every transaction (even read-only ones) posseses? It seems that
this would prevent plans from living longer than a transaction,
but maybe plan invalidation could help here?

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-28 Thread Florian G. Pflug


Pavan Deolasee wrote:

On 3/28/07, Tom Lane [EMAIL PROTECTED] wrote:


Florian G. Pflug [EMAIL PROTECTED] writes:
 Couldn't you store the creating transaction's xid in pg_index, and
 let other transaction check that against their snapshot like they
 would for any tuple's xmin or xmax?

What snapshot?  I keep having to remind people that system catalog
operations are SnapshotNow.  In the particular context here, the
place where doing something else would break down is that the planner
has no idea when it makes a plan what snapshot(s) the plan might later
be used with.


Tom, please correct me if I am wrong. But ISTM that this idea might
work in this context. In get_relation_info(), we would check if xcreate
xid stored in pg_index for the index under consideration is seen
committed with respect to the snapshot at that point of time.
Even if the snapshot changes later and index becomes valid, we
might not replan and hence not use index. But that doesn't seem
like a big problem to me.


That problem are usecases like
PREPARE my_plan  ;
BEGIN;
EXECUTE my_plan  ;
COMMIT ;

Is that PREPARE even run inside a transaction? Even if it is, it
probably won't have created a snapshot...

I think allowing the use of some sort of snapshot from inside the planner
would allow some locking to be relaxed, but there seems be a lot of
corner cases to consider :-(

OTOH, if you manage to make this work, a TRUNCATE that doesn't block
concurrent selects might become possible to do. This would for example
allow dropping and rebuilding subscriptions on a slony node while it
is in use.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug


Pavan Deolasee wrote:

In this specific context, this particular case is easy to handle because
we are only concerned about the serializable transactions started before
CREATE INDEX commits. If PREPARE can see the new index, it
implies that the CI transaction is committed. So the transaction
starting after than can only see the tuple version that we have indexed.


Yes, but the non-index plan PREPARE generated will be used until the end
of the session, nut only until the end of the transaction. Imagine that
it wasn't explicitly PREPARED (where you might say this is acceptable),
but rather just a query inside a plpgsql function, maybe even called
from some app using connection pooling. This means that the non-index
using plan might get used for a quite long time, which contradics the
work Tom did on plan invalidation I think.

Maybe Tom can comment on wheter it's possible to use plan invalidation
to eventually get rid of a stale plan in this context?

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug


Pavan Deolasee wrote:

On 3/29/07, Florian G. Pflug [EMAIL PROTECTED] wrote:


Yes, but the non-index plan PREPARE generated will be used until the end
of the session, nut only until the end of the transaction.


Frankly I don't know this works, but are you sure that the plan will
be used until the end of the session ? Even if thats the case, it can
happen even today if we create a new index, but the existing sessions
will use the stale plan (assuming what you said is true)


I've checked that:

test=# prepare myplan as select * from test where id=1 ;
PREPARE

test=# explain execute myplan ;
 QUERY PLAN

 Seq Scan on test  (cost=0.00..22897.70 rows=5421 width=36)
   Filter: (id = 1)
(2 rows)

 Now I create an index in another session 

test=# explain select * from test where id=1 ;
  QUERY PLAN
--
 Bitmap Heap Scan on test  (cost=95.11..8248.45 rows=5000 width=36)
   Recheck Cond: (id = 1)
   -  Bitmap Index Scan on idx  (cost=0.00..93.86 rows=5000 width=0)
 Index Cond: (id = 1)
(4 rows)

test=# explain execute myplan ; 
  QUERY PLAN


 Seq Scan on test  (cost=0.00..22897.70 rows=5421 width=36)
   Filter: (id = 1)
(2 rows)

!!! Index got used by the select ..  but not by execute myplan ... !!!

test=# prepare myplan2 as select * from test where id=1 ;
PREPARE
test=# explain execute myplan2 ;
   QUERY PLAN
-
 Index Scan using idx on test  (cost=0.00..8.38 rows=1 width=37)
   Index Cond: (id = 1)
(2 rows)

!!! A newly prepared plan of course uses the index !!!



So yes, plans get cached until the end of the session, and
yes, 8.2 won't notice index creation either ;-)

The open question is how CVS HEAD with plan invalidation behaves.
If it replans after the index-creating transaction commits, then
basing index validity on a snapshot will break this, because upon
replay they index might not be useable, but later on it may very
well be (but that plan invalidation machinery won't realize that)

So this might not introduce a regression compared to 8.2, but to
a future 8.3 with plan invalidation...

Sorry for being so unclear in my previous emails - I had confused
myself ;-)

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug


Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

ISTM that the run-another-transaction-afterwards idea is the only one
that does everything I think we need. I really do wish we could put in a
wait, like CIC, but I just think it will break existing programs.


Actually, there's a showstopper objection to that: plain CREATE INDEX
has to be able to run within a larger transaction.  (To do otherwise
breaks pg_dump --single-transaction, just for starters.)  This means
it can *not* commit partway through.


I believe the original idea was to invent some kind of on commit run
this transaction hook - similar to how files are deleted on commit,
I think. At least I understood the Run another transaction on commit
that way...

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug


Simon Riggs wrote:

On Thu, 2007-03-29 at 17:27 -0400, Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

ISTM that the run-another-transaction-afterwards idea is the only one
that does everything I think we need. I really do wish we could put in a
wait, like CIC, but I just think it will break existing programs.

Actually, there's a showstopper objection to that: plain CREATE INDEX
has to be able to run within a larger transaction.  (To do otherwise
breaks pg_dump --single-transaction, just for starters.)  This means
it can *not* commit partway through.


The idea is to make note that the transaction has created an index
within a transaction block, so that after the top level transaction
commits we sneak in an extra hidden transaction to update the pg_index
tuple with the xcreate of the first transaction. 


The only other alternative is to forcibly throw a relcache inval event
in the same circumstances without running the additional transaction,
but the solution is mostly the same.


I think one alternative might be to store a list of xid's together with
a cached plan, and replan if the commit status (as percieved by the
transaction the plan will be executed in) of one of those xid's changes.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug


Tom Lane wrote:

Pavan Deolasee [EMAIL PROTECTED] writes:

How about storing the snapshot which we used during planning in
CachedPlanSource, if at least one index was seen unusable because
its CREATE INDEX transaction was seen as in-progress ?


I'm getting tired of repeating this, but: the planner doesn't use a
snapshot.  System catalogs run on SnapshotNow.


But it would still do that - it would just compare the createxid of
the index against some snapshot, and the query would be replanned
if the cached result of this comparison differs from the one the
current snapshot yields.

It might well be that this won't work, because the planner is invoked
in situations where there is no active snapshot - I'm not sure if your 
comment refers to that case, or not.


greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug


Tom Lane wrote:

Pavan Deolasee [EMAIL PROTECTED] writes:

What I am suggesting is to use ActiveSnapshot (actually
Florian's idea) to decide whether the transaction that created
index was still running when we started. Isn't it the case that
some snapshot will be active when we plan ?


I do not think you can assume that the plan won't be used later with
some older snapshot.  Consider recursive plpgsql functions for a
counterexample: the inner occurrence might be the first to arrive at
a given line of the function, hence the first to plan it, yet when we
return to the outer instance we might revert to an older snapshot.


So maybe we'd need to use the SerializableSnapshot created at the start
of each transaction for this check, and not the ActiveSnapshot? Could
that work?

What about doing
PREPARE myplan select ... ;
outside of a transaction? Will this be execute inside a transaction?
Is is a query always planned upon it's first execution, and not when
PREPARE is issued?

greetings, Florian Pflug



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Tom Lane wrote:

I do not think you can assume that the plan won't be used later with
some older snapshot.



So maybe we'd need to use the SerializableSnapshot created at the start
of each transaction for this check, and not the ActiveSnapshot? Could
that work?


That might work, but it doesn't seem to address the core objection:
there's no mechanism to cause the query to be replanned once the
snapshot is new enough, because no relcache inval will happen.  So
most likely existing backends will keep using old plans that don't
consider the index.


Pavan suggested storing the IndexSnapshot in the cached plan, and to 
compare it to the IndexSnapshot when the query is executed.

If those two snapshots differ, the query would be replanned.

My idea was to store a list of xid's together with the cached plan that
are assumed to be uncommitted accoring to the IndexSnapshot. The query
is replanned if upon execution the IndexSnapshot assumes that one of
these xid's is committed.

Those two ideas seem to be mostly equivalent, mine seems to be a bit
more fine-grained, but at the cost of more work upon each query execution.

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug


Pavan Deolasee wrote:

On 3/30/07, Florian G. Pflug [EMAIL PROTECTED] wrote:


My idea was to store a list of xid's together with the cached plan that
are assumed to be uncommitted accoring to the IndexSnapshot. The query
is replanned if upon execution the IndexSnapshot assumes that one of
these xid's is committed.



Actually, if we are using Serializable Snapshot then there is no chance
to replan the query before the transaction completes and the next
transaction to start in the session must see the index and hence
we must replan. So it would be enough just to associate a transaction
id with the cached plan. If this xid is set and our transaction id is
different than that, we replan.


I believe this is true for the CREATE INDEX scenario. However, comparing
either the snapshot or the result of xid checks seems like it might
be useful for other things beside CREATE INDEX. I'm specifically 
thinking about TRUNCATE here - the create index + HOT problems sound

quite similar to the problems a non-exclusive-locking TRUNCATE would face.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Minor changes to Recovery related code

2007-03-30 Thread Florian G. Pflug


Simon Riggs wrote:

On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

2. pg_stop_backup() should wait until all archive files are safely
archived before returning

Not sure I agree with that one.  If it fails, you can't tell whether the
action is done and it failed while waiting for the archiver, or if you
need to redo it.


There's a slight delay between pg_stop_backup() completing and the
archiver doing its stuff. Currently if somebody does a -m fast straight
after the pg_stop_backup() the backup may be unusable.

We need a way to plug that small hole.

I suggest that pg_stop_backup() polls once per second until
pg_xlog/archive_status/LOG.ready disappears, in which case it ends
successfully. If it does this for more than 60 seconds it ends
successfully but produces a WARNING.


I fear that ending sucessfully despite having not archived all wals
will make this feature less worthwile. If a dba knows what he is
doing, he can code a perfectly safe backup script using 8.2 too.
He'll just have to check the current wal position after pg_stop_backup(),
(There is a function for that, right?), and wait until the corresponding
wal was archived.

In realitly, however, I feare that most people will just create a script
that does 'echo select pg_stop_backup | psql' or something similar.
If they're a bit more carefull, they will enable ON_ERROR_STOP, and check
the return value of pgsql. I believe that those are the people who would
really benefit from a pg_stop_backup() that waits for archiving to complete.
But they probably won't check for WARNINGs.

Maybe doing it the other way round would be an option?
pg_stop_backup() could wait for the archiver to complete forever, but
spit out a warning every 60 seconds or so WARNING: Still waiting
for wal archiving of wal ??? to complete. If someone really wants
a 60-second timeout, he can just use statement_timeout.

Anyway, just my 0.02 eurocents, maybe I'm totally mistaken about the
postgresql dba's out there...

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

[HACKERS] Fate of pgsnmpd

2007-04-06 Thread Florian G. Pflug


Hi

Does anyone know if pgsnmpd is still actively developed?
The last version (0.1b1) is about 15 months old.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] select ('{}'::text[])[1] returns NULL -- is it correct?

2007-04-10 Thread Florian G. Pflug


Nikolay Samokhvalov wrote:

On 4/10/07, Tom Lane [EMAIL PROTECTED] wrote:

Nikolay Samokhvalov [EMAIL PROTECTED] writes:
 I remember several cases when people (e.g. me :-) ) were spending some
 time trying to find an error in some pl/pgsql function and the reason
 lied in incorrect work with arrays (i.e. messages like index is out
 of bounds and index cannot be negative number would help, surely).

Well, if indexes *couldn't* be negative numbers then that might be
helpful, but they can.


Ooops :-) OK, my proposal is narrowing to very simple one: what about
triggering WARNINGs when user tries to access nonexistent element of
array?


Please don't ;-)
There are two sane options - return an error, or return NULL. Both are
sensible, and different programming languages make different choices.

The only reason for a WARNING would be a long-term plan to change the
existing behaviour. But this will cause lots of pain, for no real gain,
because no matter which behaviour you pick, there are always situations
where the other would be more convenient.

Just look at the mess PHP has created by altering fundamental aspects
of the language (4.4 - 5.0).

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Florian G. Pflug


Neil Conway wrote:

On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
The problem is that most of the standard methods are platform dependent, as 
they require MAC addresses or a good random source, for instance.


http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php

ISTM random() or similar sources is a sufficient PSRNG for the purposes
of UUID generation -- I can't see anything in the RFC that would
contradict that.


Maybe a short-term solution could be a UUID-generated function that
takes some kind of seed as a parameter. People not concerned about
collisons could just pass some random value, while others could use
the mac-address of the client or something similar.

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

[HACKERS] What tools do people use to hack on PostgreSQL?

2007-04-12 Thread Florian G. Pflug


Hi

I'm very excited that my project for implementing read-only queries
on PITR slaves was accepted for GSoC, and I'm now trying to work
out what tools I'll use for that job.

I'd like to be able to create some sort of branches and tags for
my own work (only inside my local repository of course).

I've considered using git, but I couldn't make the cvs-git gateway
work - neither using the postgresql CVS repository directly, nor with
a private copy obtained with CVSup.

There is also svk, but I think I'd need a svn repo that mirrors
the postgresql CVS for that to work. I think Joshua Drake created
one once, but I don't now if it is kept up-to-date.

What do you guys use for your development work?

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] What tools do people use to hack on PostgreSQL?

2007-04-12 Thread Florian G. Pflug


Joshua D. Drake wrote:

Alexey Klyukin wrote:

Alvaro Herrera wrote:

But if you have a checked out tree, does it work to do an update after
the tree has been regenerated?  As far as I know, the repo is generated
completely every few hours, so it wouldn't surprise me that the checked
out copy is not compatible with the new repo.

I admit I haven't tried.
  

I have tried and svn up worked without issues.


As a note we will be updating this to subversion 1.4 shortly so people 
can do svnsync too.


Do I read this correctly as This repository will stay around for a while,
and isn't just an experiment that might be stopped tomorrow?. If so, I'll
try using it - and lots of thanks for providing that

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

[HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-15 Thread Florian G. Pflug


Hi

When I try to build CVS HEAD on OSX 10.4, compiling
src/interfaces/ecpg/preproc/preproc.c fails with:
In file included from preproc.y:6951:
pgc.l:3:20: error: config.h: No such file or directory
In file included from pgc.l:28,
 from preproc.y:6951:
preproc.h:996: error: conflicting types for 'base_yylloc'
y.tab.c:18673: error: previous declaration of 'base_yylloc' was here
In file included from preproc.y:6951:
pgc.l:43: error: 'MAX_PARSE_BUFFER' undeclared here (not in a function)

If I delete pgc.c, it is rebuilt automatically, and then
preproc.c compiles just fine.

I'm using gcc 4.0.1, flex 2.5.4 and bison 2.3

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-15 Thread Florian G. Pflug


Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

When I try to build CVS HEAD on OSX 10.4, compiling
src/interfaces/ecpg/preproc/preproc.c fails with:
...
If I delete pgc.c, it is rebuilt automatically, and then
preproc.c compiles just fine.
...
I'm using gcc 4.0.1, flex 2.5.4 and bison 2.3


Perhaps you changed bison versions and didn't force a rebuild?
Those line numbers don't seem to sync up with my copies of the
derived files.


I just realized that this file isn't even in the postgresql CVS
repo. But it _is_ part of the SVN mirror at
https://projects.commandprompt.com/public/pgsql/repo.

The version that shows up in the trunk of the SVN repo is
the revision 1.5 from CVS
(http://developer.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/ecpg/preproc/Attic/pgc.c?rev=1.5;content-type=text%2Fplain;hideattic=0)

This is the same as
https://projects.commandprompt.com/public/pgsql/repo/trunk/pgsql/src/interfaces/ecpg/preproc/pgc.c
modulo the expansion of the $Header macro.

Seems to be a bug in the CVS-SVN conversion process...

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-16 Thread Florian G. Pflug


Alvaro Herrera wrote:

Ah, it seems the SVN repo just got its first user ;-)  Congratulations.
Ask Joshua to send you a Command Prompt tee shirt, maybe he is excited
enough.


I hope the fact that I use the SVN repo just to get the changes into
git doesn't reduce my chances of getting that t-shirt ;-)

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-16 Thread Florian G. Pflug


Alvaro Herrera wrote:

Florian G. Pflug wrote:

Alvaro Herrera wrote:

Ah, it seems the SVN repo just got its first user ;-)  Congratulations.
Ask Joshua to send you a Command Prompt tee shirt, maybe he is excited
enough.

I hope the fact that I use the SVN repo just to get the changes into
git doesn't reduce my chances of getting that t-shirt ;-)


Hum, why don't you just use the CVS directly then?  That'd avoid this
sort of infelicities.


git-cvsimport didn't work for me - neither with the main CVS repo, nor
with a rsync'ed copy.
It complained about all sorts of problems - I don't have enough CVS knowhow
to judge if those were actual problems with the repo, or just deficiencies
of git-cvsimport. Plus I didn't find a way to import the current version of
HEAD as one revision, any only go incrementally from there. It always wanted
to mirror the whole history stores in the CVS in my git repo, which is
overkill.

For SVN, there is git-svn, which does just what I want - I started with some
revision a few days ago, and it just incrementally imports updates from there
into a special branch of my git repo, and doesn't care about what happened
before that revision.

It's all not perfect, but I think for me it works better than just doing my
changes in a CVS checkout.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] What tools do people use to hack on PostgreSQL?

2007-04-16 Thread Florian G. Pflug


Joshua D. Drake wrote:

http://projects.commandprompt.com/public/pgsql/browser

or do the anonymous checkout with:

svn co http://projects.commandprompt.com/public/pgsql/repo/


But if you have a checked out tree, does it work to do an update after
the tree has been regenerated?  As far as I know, the repo is generated
completely every few hours, so it wouldn't surprise me that the checked
out copy is not compatible with the new repo.

I admit I haven't tried.


Not a clue. Anyone try it yet?


git-svn seems to work fine against the SVN repo, apart from the problem with
the files deleted in CVS which still show up in SVN.

It's only running for about two days though...

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-16 Thread Florian G. Pflug


Martin Langhoff wrote:

Hi Florian,

I am right now running an rsync of the Pg CVS repo to my work machine to
get a git import underway. I'm rather keen on seeing your cool PITR Pg
project go well and I have some git+cvs fu I can apply here (being one
of the git-cvsimport maintainers) ;-)

Cool - I'm new to git, so I really appreciate any help that I can get.


For the kind of work you'll be doing (writing patches that you'll want
to be rebasing onto the latest HEAD for merging later) git is probably
the best tool. That's what I use it for... tracking my experimental /
custom branches of projects that use CVS or SVN :-)

Thats how I figured I'd work - though I don't yet understand what
the advantage of rebase is over merge.

Currently, I've setup a git repo that pulls in the changes from the SVN
repo, and pushed them to my main soc git repo. On that main repo I have
two branches, master and pgsql-head, and I call cg-merge pgsql-head
if I want to merge with CVS HEAD.


Initially, I'll post it on http://git.catalyst.net.nz/ and I can run a
daily import for you - once that's in place you can probably get a repo
with your work on http://repo.or.cz/

Having a git mirror of the pgsql CVS would be great.
BTW, I've just check out repo.or.cz, and noticed that there is already a
git mirror of the pgsql CVS: http://repo.or.cz/w/PostgreSQL.git

greetings + thanks
Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-16 Thread Florian G. Pflug


Aidan Van Dyk wrote:

Martin Langhoff wrote:
Well, now that more than one of us are working with git on PostgreSQL...

I've had a repo conversion running for a while...  I've only got it to what
I consider stable last week:
http://repo.or.cz/w/PostgreSQL.git
git://repo.or.cz/PostgreSQL.git

Ah - thats what I just stumbled over ;-)


For those interested int he conversion process, I've used a slightly
modified version of fromcvs (A ruby cvs to git/Hg tool), and it runs on all
of pgsql in about 20 minutes.

I gave up on git-svn (because of both speed and my in-ablility to
easy filter out Keywords, etc) and git-cvsimport (because cvsps doesn't
seem to like pgsql's repo)

Yeah, git-cvsimport didn't work for me either...

I update the git repo daily, based on an anonymous rsync of the cvsroot. 
If the anon-rsync is updated much more frequently, and people think my git

conversion should match it, I have no problem having cron run it more than
daily.

Also - note that I give *no* guarentees of it's integrity, etc.

I've diffed a CVS checkout and a git checkout, and the are *almost*
identical.  Almost, because it seems like my git repository currently has 3
files that a cvs checkout doesn't:
 backend/parser/gram.c |12088 +++
 interfaces/ecpg/preproc/pgc.c | 2887 ++
 interfaces/ecpg/preproc/preproc.c |16988 ++

And at this point, I haven't been bothered to see where those files came
from (and where they dissapear) in CVS and why my import isn't picking that
up...  I could probably be pushed if others find this repo really useful,
but those files problematic...

Thats interesting - the SVN mirror of the pgsql CVS at
http://projects.commandprompt.com/public/pgsql/browser
has exactly the same problem with those 3 files, as I found out the hard way ;-)

In the case of pgc.c, I've compared that revisions in CVS with the one in
SVN. SVN include the cvs-version 1.5 if this file in trunk, which seems to
be the last version of that file in CVS HEAD. Interestingly,
http://developer.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/ecpg/preproc/Attic/pgc.c
shows no trace of the file being deleted from HEAD either - it just shows
that it was removed from WIN32_DEV. But still a CVS checkout doesn't include 
that file...


Since 3 tools (cvsweb, git-cvsimport and whatever commandprompt uses to create
the SVN mirror) all come to the same conclusion regarding this file, I think
that this is caused by some corruption of the CVS repository - but I don't have
the cvs-fu to debug this...

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Re: IDENTITY/GENERATED v36 Re: [PATCHES] Final version of IDENTITY/GENERATED patch

2007-04-16 Thread Florian G. Pflug


Zoltan Boszormenyi wrote:

Tom Lane írta:

Zoltan Boszormenyi [EMAIL PROTECTED] writes:

Also, the current grammar is made to give a syntax error
if you say colname type GENERATED BY DEFAULT AS ( expr ).
But it makes the grammar unbalanced, and gives me:
bison -y -d  gram.y
conflicts: 2 shift/reduce


I'ts been quite a time since I last used bison, but as far as I
remember, you can tell it to write a rather details log about
it's analysis of the grammar. That log should include more
detailed information about those conflicts - maybe that helps
to figure out their exact cause, and to find a workaround.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-16 Thread Florian G. Pflug


Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

These files are generated (from gram.y, pgc.l and preproc.y
respectievly) and are not present in the CVS repo, though I think they
have been at some point.



It's strange that other generated files (that have also been in the repo
in the past) like preproc.h are not showing up.


The weird thing about these files is that the CVS history shows commits
on HEAD later than the file removal commit.  I don't recall if Vadim
unintentionally re-added the files before making those commits ... but
if he did, you'd think it'd have taken another explicit removal to get
rid of them in HEAD.  More likely, there was some problem in his local
tree that allowed a cvs commit to think it should update the
repository with copies of the derived files he happened to have.

I think this is a corner case that CVS handles in a particular way and
the tools people are using to read the repository handle in a different
way.  Which would be a bug in those tools, since CVS's interpretation
must be right by definition.


The question is if it'd be acceptable to manually remove that last commit
from the repository. I guess simply readding, and then removing the files
again should do the trick, though I'd be cleaner to fix remove the
offending commit in the first place. Should postgres ever decide to switch
to another version control system (which I don't advocate), that'd be
one obstacle less to deal with...

Or is the risk of causing breakage too high?

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] RESET command seems pretty disjointed now

2007-04-16 Thread Florian G. Pflug


Tom Lane wrote:

Mark Kirkwood [EMAIL PROTECTED] writes:

Tom Lane wrote:

The current documentation for RESET exhibits a certain lack of, um,
intellectual cohesiveness:

Synopsis

RESET configuration_parameter
RESET ALL
RESET { PLANS | SESSION | TEMP | TEMPORARY }


Maybe DISCARD for the plans etc might be more intuitive than extending 
RESET?


DISCARD PLANS and DISCARD TEMP seem pretty reasonable, but DISCARD SESSION
sounds a bit odd --- it seems like it might mean disconnect, which of
course is exactly what we're trying to avoid.  But possibly we could
rename RESET SESSION as DISCARD ALL.

Leastwise I haven't got any better ideas.  Anyone have another proposal?


What about
RESET parameter
RESET { PLANS | TEMP | TEMPORARY }
RESET ALL { PARAMETERS | STATE }

RESET ALL would become an abbreviation of RESET ALL PARAMETERS (for backwards
compatibility), while RESET SESSION would be renamed to RESET ALL STATE.

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-17 Thread Florian G. Pflug


Martin Langhoff wrote:

Aidan Van Dyk wrote:

And remember the warning I gave that my conversion is *not* a direct CVS
import - I intentionally *unexpand* all Keywords before stuffing them
into GIT so that merging and branching can ignore all the Keyword
conflicts... 


My import is unexpanding those as well to support rebasing and merging
better.

So - if you are committed to providing your gateway long term to
Florian, I'm happy to drop my gateway in favour of yours.


There seem to be other people than me who are interested in a git
mirror. Maybe we could declare one of those mirrors the
official one - I guess things would be easier if all people
interested in using git would use the same mirror...

What do you guys think?


(Florian, before basing your code on either you should get a checkout of
Aidan's and mine and check that the tips of the branches you are working
on match the cvs branches -- the cvsimport code is good but whereever
CVS is involved, there's a lot of interpretation at play, a sanity check
is always good).

I actually hoped that I could just take my current git repo, and rebase
my branch onto one of those two repos - or does rebase only work from
an ancestor to a descendant?

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

[HACKERS] [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

2007-04-19 Thread Florian G. Pflug


Hi

I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.

Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.

The patch is per discussion with Simon Riggs.

I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.

greetings, Florian Pflug
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c67821..93c86a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5060,10 +5060,13 @@ #endif
 		 * Perform a checkpoint to update all our recovery activity to disk.
 		 *
 		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
+		 * one. A slave will always create a restart point if it sees a
+		 * shutdown checkpoint, and will call all rm_cleanup() methods before
+		 * it does so. This guarantees that any actions taken by the master
+		 * in rm_cleanup will also be carried out on the slave.
+		 * Additionally, we may be assigning a new TLI, so using a shutdow
+		 * checkpoint allows us to have the rule that TLI only changes in shutdown
+		 * checkpoints, which allows some extra error checking in xlog_redo.
 		 */
 		CreateCheckPoint(true, true);
 
@@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo
  * restartpoint is needed or not.
  */
 static void
-RecoveryRestartPoint(const CheckPoint *checkPoint)
+RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint)
 {
 	int			elapsed_secs;
 	int			rmid;
 
 	/*
-	 * Do nothing if the elapsed time since the last restartpoint is less than
-	 * half of checkpoint_timeout.	(We use a value less than
-	 * checkpoint_timeout so that variations in the timing of checkpoints on
-	 * the master, or speed of transmission of WAL segments to a slave, won't
-	 * make the slave skip a restartpoint once it's synced with the master.)
-	 * Checking true elapsed time keeps us from doing restartpoints too often
-	 * while rapidly scanning large amounts of WAL.
+	 * If the checkpoint we saw in the wal was a shutdown checkpoint, it might
+	 * have been written after the recovery following a crash of the master.
+	 * In that case, the master will have completed any actions that were
+	 * incomplete when it crashed *during recovery*, and these completions
+	 * are therefor *not* logged in the wal.
+	 * To prevent getting out of sync, we follow what the master did, and
+	 * call the rm_cleanup() methods. To be on the safe side, we then create
+	 * a RestartPoint, regardless of the time elapsed. Note that asking
+	 * the resource managers if they have partial state would be redundant
+	 * after calling rm_cleanup().
 	 */
-	elapsed_secs = time(NULL) - ControlFile-time;
-	if (elapsed_secs  CheckPointTimeout / 2)
-		return;
+	if (shutdownCheckpoint) {
+		for (rmid = 0; rmid = RM_MAX_ID; rmid++)
+		{
+			if (RmgrTable[rmid].rm_cleanup != NULL)
+RmgrTable[rmid].rm_cleanup();
+		}
+	}
+	else {
+		/*
+		 * Do nothing if the elapsed time since the last restartpoint is less than
+		 * half of checkpoint_timeout.	(We use a value less than
+		 * checkpoint_timeout so that variations in the timing of checkpoints on
+		 * the master, or speed of transmission of WAL segments to a slave, won't
+		 * make the slave skip a restartpoint once it's synced with the master.)
+		 * Checking true elapsed time keeps us from doing restartpoints too often
+		 * while rapidly scanning large amounts of WAL.
+		 */
+		elapsed_secs = time(NULL) - ControlFile-time;
+		if (elapsed_secs  CheckPointTimeout / 2)
+			return;
 
-	/*
-	 * Is it safe to checkpoint?  We must ask each of the resource managers
-	 * whether they have any partial state information that might prevent a
-	 * correct restart from this point.  If so, we skip this opportunity, but
-	 * return at the next checkpoint record for another try.
-	 */
-	for (rmid = 0; rmid =

Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

2007-04-23 Thread Florian G. Pflug


Simon Riggs wrote:

On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote:

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.


I agree the problem exists. It is somewhat rare because the idea is that
if the Primary crashes we would failover to the Standby, which would
mean that both Primary and Standby have executed rm_cleanup(), if
needed.

So in the case where the Primary fails and we choose *not* to failover,
there is a potential difficulty on the Standby.


It occured to me today that this might plague 8.1 too. As you explain below,
the problem is not really connected to restartpoints - failing to create
them is merely a sympton of this. On 8.1, this might still lead to rm_cleanup()
being called much later (if you consider the wal position to be the time)
on the slave than on the master. I dunno if this really causes trouble - I
don't yet understand the btree code well enough to judge this.


The rationale for this fix could be described somewhat differently:

When we shutdown, we know for certain that safe_restartpoint() is true.
However, we don't know whether it is true because we successfully did a
clean shutdown, or because we crashed, recovered and then issued a
shutdown checkpoint following recovery. In the latter case we *must*
execute rm_cleanup() on the standby because it has been executed on the
primary. Not doing so at this point *might* be safe, but is not certain
to be safe. We don't *need* to log a restartpoint at this time, but it
seems sensible to do so.


While creating the patch, I've been thinking if it might be worthwile
to note that we just did recovery in the ShutdownCheckpoint
(or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
for more error checking, because then the slave could check that
safe_restartpoint() is true for all ShutdownCheckpoints that were not
after recovering.


We need to check that rm_cleanup() routines don't assume that they will
only ever be called once or this will clearly fail. There is also no
need to call rm_cleanup() unless rm_safe_restartpoint() is false.


But a non-idempotent rm_cleanup() routine will cause trouble anyway,
if postgres crashes after having called rm_cleanup() but before creating
the ShutdownCheckpoint.

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-30 Thread Florian G. Pflug


Martin Langhoff wrote:

So - if you are committed to providing your gateway long term to
Florian, I'm happy to drop my gateway in favour of yours.



(Florian, before basing your code on either you should get a checkout of
Aidan's and mine and check that the tips of the branches you are working
on match the cvs branches -- the cvsimport code is good but whereever
CVS is involved, there's a lot of interpretation at play, a sanity check
is always good).


Sorry for responding so late - I was rather busy during the last 1 1/2 weeks
with university stuff, and had only very little time to spend on SoC.

I've tried to switch my repo to both git mirrors, but there seems to be 
something strange happening. The checkout pulls a _lot_ of objects (

a few hunder thousands), and then takes ages to unpack them all, bloating
my local repository (Just rm-ing my local repo takes a few minutes after
the checkout).

It seems as if git pulls all revisions of all files during the pull -
which it shouldn't do as far as I understand things - it should only
pull those objects referenced by some head, no?

The interesting thing is that exactly the same problem occurs with
both if your mirrors...

Any ideas? Or is this just how things are supposed to work?

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-30 Thread Florian G. Pflug


Aidan Van Dyk wrote:

* Florian G. Pflug [EMAIL PROTECTED] [070430 08:58]:
 

It seems as if git pulls all revisions of all files during the pull -
which it shouldn't do as far as I understand things - it should only
pull those objects referenced by some head, no?


Git pulls full history to a common ancestor on the clone/pull.   So the
first pull on a repo *will* necessarily pull in the full object history.
So unless you have a recent common ancestor, it will pull lots.  Note
that because git uses crypto hashes to identify objects, my conversion
and Martin's probably do not have a recent common ancestor (because my
header munging probably doesn't match Martin's exactly).

Ah, OK - that explains things.


The interesting thing is that exactly the same problem occurs with
both if your mirrors...

Any ideas? Or is this just how things are supposed to work?


Until you have a local repository of it, you'll need to go through the
full pull/clone.  If you're really not interested in history you can
truncate history with the --depth option to git clone.  That will give
you a shallow repository, which you can use, develop, branch, etc in,
but won't give you all the history locally.

I'll retry with the --depth option - I'm doing development on my powerbook,
and OSX seems to cope badly with lots of little files - the initial unpacking
took hours - literally..


Also - what version of GIT are you using?  I *really* recommend using at
least 1.5 (1.5.2.X is current stable).  Please, do your self a favour,
and don't use 1.4.4.

I'm using 1.5.0  currently - it was the latest stable release when I began
to experiment with git.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Heap page diagnostic functions

2007-05-01 Thread Florian G. Pflug


Zdenek Kotala wrote:
I did not find forensics in translator and It mentions in Oxford 
vocabulary but explanation is not clear for me. I agree with Bruce It is 
not good name. What about short form of diagnostic diag?


Doesn't forensics basically mean to find the cause of something
*after* it happened, based on traces that the event left behind?
Like finding the culprit of a crime done using for example
fingerprints he left, or tracing the actions of an intruder
by analyzing logfiles or modified binaries?

In that case, it doesn't accuratly describe those functions anyway
I think, because you call them from inside the database while it's
running, not from the outside after it was stopped or crashed.

Just the 2 eurocents of a non-native speaker...
Greetings, Florian Pflug



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Concurrently updating an updatable view

2007-05-14 Thread Florian G. Pflug


Richard Huxton wrote:

Richard Huxton wrote:

Heikki Linnakangas wrote:
The problem is that the new tuple version is checked only against the 
condition in the update rule, id=OLD.id, but not the condition in the 
original update-claus, dt='a'.


Yeah, that's confusing :(.


Bit more than just normal rule confusion I'd say. Try the following 
two statements in parallel (assuming you've just run the previous):


UPDATE test SET dt='c';
UPDATE test SET dt='x' FROM test t2 WHERE test.id=t2.id AND t2.dt='b';

This isn't a problem with the view mechanism - it's a problem with 
re-checking clauses involving subqueries or joins I'd guess.


I'm trying to decide if it's unexpected or just plain wrong, and I 
think I'd have to argue wrong.


Or perhaps I'd not argue that :-/

Well, src/backend/executor/README agrees with you that it's wrong..

Note a fundamental bogosity of this approach: if the relation containing
the original tuple is being used in a self-join, the other instance(s) of
the relation will be treated as still containing the original tuple, whereas
logical consistency would demand that the modified tuple appear in them too.
But we'd have to actually substitute the modified tuple for the original,
while still returning all the rest of the relation, to ensure consistent
answers.  Implementing this correctly is a task for future work.

This is really about MVCC in read committed mode, and the just right 
for simpler cases:
http://www.postgresql.org/docs/8.2/interactive/transaction-iso.html#XACT-READ-COMMITTED 

Clearly there needs to be a change to the sentence: Because of the 
above rule, it is possible for an updating command to see an 
inconsistent snapshot: it can see the effects of concurrent updating 
commands that affected the same rows it is trying to update


Not true if there's a subquery/join involved.

If the cited part of the README is correct, then all joins and subqueries
are fine, except if they refer to the table being updated.

I think there should be a big, fat warning that self-referential
updates have highly non-obvious behaviour in read-committed mode,
and should be avoided.

greetings, Florian Pflug



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Concurrently updating an updatable view

2007-05-15 Thread Florian G. Pflug


Richard Huxton wrote:

Hiroshi Inoue wrote:

Florian G. Pflug wrote:


I think there should be a big, fat warning that self-referential
updates have highly non-obvious behaviour in read-committed mode,
and should be avoided.


It seems pretty difficult for PostgreSQL rule system to avoid such
 kind of updates. I'm suspicious if UPDATABLE VIEWS can be implemented
 using the rule system.


Remember this affects all self-referential joins on an UPDATE (and 
DELETE?) not just views. It's just that a rule is more likely to produce 
that type of query.


Is there consensus what the correct behaviour should be for
self-referential updates in read-committed mode? Does the SQL Spec
have anything to say about this?

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Not ready for 8.3

2007-05-19 Thread Florian G. Pflug


Andrew Dunstan wrote:
What would making a branch actually do for you? The only advantage I can 
see is that it will give you a way of checkpointing your files. As I 
remarked upthread, I occasionally use RCS for that. But mostly I don't 
actually bother. I don't see how you can do it reasonably off a local 
cvs mirror - rsync will just blow away any changes you have checked in 
next time you sync with the master.


I don't think we can make CVS behave like a distributed SCM system, and 
ability to create local branches seems to me one of the fundamental 
points of such systems. If that's what the demand is for, then we should 
look again at moving to something like Mercurial.


I think the great thing about DCVS systems is that not everybody
necessarily needs to use the *same* system. And it doesn't really
matter what the central repository runs on - I think they are
gateway from/to nearly everything available...

I currently use GIT for my SoC project, and it works quite well -
I can create an abitrary number of local branches, and syncing
the currently active branch with CVS is archived by just doing
cg-update pgsql-head.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: Do we need a TODO? (was Re: [HACKERS] Concurrently updating anupdatable view)

2007-06-01 Thread Florian G. Pflug


Simon Riggs wrote:

On Mon, 2007-05-28 at 19:56 -0400, Bruce Momjian wrote:

Added to TODO:

* Fix self-referential UPDATEs seeing inconsistent row versions in
  read-committed mode

  http://archives.postgresql.org/pgsql-hackers/2007-05/msg00507.php



I'm sorry guys but I don't agree this is a TODO item.


Maybe the TODO suggested has a too narrow focus, but I think that
that *something* has to be done about this.


IMHO this follows documented behaviour, even if y'all are shocked.

Yes, but documented != sensible  documented != intuitive 
documented != logical.


If you don't want the example cases to fail you can
- use SERIALIZABLE mode to throw an error if inconsistency is detected
- use SELECT FOR SHARE to lock the rows in the subselect
e.g.

UPDATE foo
SET pkcol = 'x'
WHERE pkcol IN 
(SELECT pkcol

 FROM foo
 
 FOR SHARE);

In the case of concurrent UPDATEs the second UPDATE will normally
perform the subSELECT then hang waiting to perform the UPDATE. If you
use FOR SHARE the query will hang on the subSELECT (i.e. slightly
earlier), which makes the second query return zero rows, as some of you
were expecting.


Sure, but with a similar argument you could question the whole
update-in-read-committed-mode logic. After all, you wouldn't need
that logic if you always obtained a share lock on the rows to be updated
*before* you started updating them.


Maybe we need a way of specifying that the non-UPDATE relation should be
locked FOR SHARE in a self-referencing UPDATE? Though that syntax could
seems to look pretty weird from here, so I'd say cover this situation in
a code example and be done.

Also, methinks we should have agreed behaviour before we make something
a TODO item. That would help us uncover this type of thing in more
detail, or at least force TODO to read investigate whether 


Ack. Thats why I initially asked if there was consesus on what the
correct behaviour is.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] recovery_target_xid crashes on the master

2007-06-04 Thread Florian G. Pflug


Hi

I'm currently working on splitting StartupXLog into smaller
parts, because I need to reuse some of the parts for concurrent
wal recovery (for my GSoC project)

The function recoveryStopsHere in xlog.c checks if we should
stop recovery due to the values of recovery_target_xid and
recovery_target_time. For recovery_target_xid, we stop if
we see a commit or abort record for the given xid.

Now I wonder what happens if an (admittely rather confused) DBA
uses an xid of a transaction that was aborted because of a
crash of the master as recovery_target_xid. The way I read the
code, postgres will just recover until it reaches the end of
the xlog in that case because neither an COMMIT nor an ABORT
for that xid exists in the WAL.

I'm not sure if this is worth fixing - it seems like a rather
contrived corner case - but I though I'd bring it up...

greetings, Florian Pflug



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-06 Thread Florian G. Pflug


Matthew T. O'Connor wrote:

Florian G. Pflug wrote:

Work done so far:
-
 .) Don't start autovacuum and bgwriter. 


Do table stats used by the planner get replicated on a PITR slave?  I 
assume so, but if not, you would need autovac to do analyzes.


Yes - everything that get wal-logged on the master gets replicated to
the slave. In my design, it isn't possible to do analyze on the slave,
because all datafiles are strictly readonly (well, with the small
exception of hit-bit updates actually).

greetings, Florian Pflug



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-06 Thread Florian G. Pflug


Jeff Davis wrote:

On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:

   .) Since the slaves needs to track an Snapshot in shared memory, it cannot
  resize that snapshot to accomodate however many concurrent transactions
  might have been running on the master. My current plan is to detect if
  that global snapshot overflows, and to lock out readonly queries on the
  slave (and therefore remove the need of keeping the snapshot current)
  until the number of active xids on the master has dropped below
  max_connections on the slave. A warning will be written to the postgres
  log that suggest that the DBA increases the max_connections value on
  the slave.


If we did lock the slave while waiting for transactions to complete on
the master, we'd need to document some stronger warnings against idle
transactions so that administrators could notice and correct the
problem.


It's not exactly locking until it complete on the master, it's locking
the slave until we reach a position in the wal on the slave with less
than max_connections concurrent transactions. But yes, I agree, this
will need to be documented.


Are you referring to the size of the xip array being a problem? Would it
help to tie the size of the xip array to max_connections? I understand
that max_connections might be greater on the master, but maybe something
similar? 


Thats what I currently do - the xip array on the slave is sized to
hold max_connections entries (Actually, it's max_connections +
max_prepared_xacts I think). The problem occurs exactly if those
values are set too small on the slave - and since shared mem
objects are not resizeable, I don't see how the slave can handle
an xip overflow gracefully other than by not publishing the
information in shared memory as long as it doesn't fit there.

On a further thinking - maybe locking out transactions isn't even
necessary - they would just continue to see the old global snapshot,
so time wouldn't advance for them until the number of concurrent
transactions decreases again.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug


Simon Riggs wrote:

On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:


.) Added a new GUC operational_mode, which can be set to either
readwrite or readonly. If it is set to readwrite (the default),
postgres behaves as usual. All the following changes are only
in effect if operational_mode is set to readonly.


Do we need this? We are already InArchiveRecovery.

If I understand you correctly, you suggest that readonly queries
are allways allowed during archive recovery - so upon startup
postgres step through these states:
  .) Initial recovery (Until we reach a consistent state)
  .) Allow readonly queries
  .) Finish recovery in the background (might mean recovering forever
 on a PITR slave)
  .) Allow readwrite queries

My plan was to have a global switch, which lets you choose between
  .) All queries are readonly (Until the next postmaster restart at least),
 but you get background replay
  .) No background replay, but once replay is done, readwrite queries
 can be execute (Just what PG does now).

The main reason why I invented that global switch operational_mode was
to remove to need to switch between readonly mode and readwrite mode
on the fly.


.) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR)
if postgre is not in readwrite mode. This macro protects the
following functions to make sure that no writes occur in
readonly mode.
  SimpleLruWritePage, SLruPhysicalWritePage
  EndPrepare, FinishPreparedTransaction
  XLogInsert, XLogWrite, ShutdownXLog
  CreateCheckpoint
  MarkBufferDirty.


These are Asserts?

The macro ASSUME_OPMODE_READWRITE just does
if (!OperationalModeReadWrite)
  elog(ERROR, ...)


  .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay
 is started, and it takes over that role that bgwriter play in the
 shutdown process.


Autovacuum - understood.

What does bgreplay do? Why not just start bgwriter earlier and disable
some of its other functionality while InRecovery?

See above - it seemed simpler to clearly seperate


  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
 is considered to be later than any other xid.


So you are bumping FirstNormalTransactionId up by one for this?

In fact I changed MaxTransactionId to 0xFFFE, and set
ReadOnlyTransactionId to 0x. Additionally, I changed
IsNormalTransactionId to test not only for = FirstNormalTransactionid,
but also for = MaxTransactionId.


You're assuming then that we will freeze replay while we run a query?

No. My plan is to first get to a point where replay is freezes while
queries are running, and to then figure out a more intelligent way to do this.
I already have a few ideas how to do this, but I want to complete the simple 
version, before I start with that work.



Otherwise doing this will mean the snapshot changes as a query executes.

Why? It's only the xid of the transaction, not it's xmin and xmax that are
set to ReadOnlyTransactionId.


  .) A global ReadOnlySnapshot is maintained in shared memory. This is
 copied into backend local memory by GetReadonlySnapshotData (which
 replaces GetSnapshotData in readonly mode).
   .) Crash recovery is not performed in readonly mode - instead, postgres
  PANICs, and tells the DBA to restart in readwrite mode. Archive
  recovery of course *will* be allowed, but I'm not that far yet.


This is the very heart of the matter. This isn't just a technical issue,
it goes to the heart of the use case for this feature. Can we recover
while running queries?

Yes. My comment only applies only to crash recovery - i.e, recovery that happens
*without* a recovery.conf present, after a crash.
It only really matters if you do following
  .) Start pg in readwrite mode.
  .) Kill it / It crashes
  .) Restart in readonly mode.

The main different between crash recovery, and recovery from a filesystem-level
backup is the additional information that the backup label gives us in the
second case - more specifically, the minRecoveryLoc that we read from the
backup label. Only with that knowledge is recovering until we reach
a consistent state a welldefined operation. And readonly queries
can only be executed *after* we did this minimal recovery. So if there is
crash recovery to be done, we best we could do is to recover, and then start
in readonly mode. If this is *really* what the DBA wants, he can just start
in readwrite mode first, then cleanly shut PG down, and restart in readonly 
mode.

 If not, how much time will we spend in replay

mode v query mode? Will we be able to run long running queries *and*
maintain a reasonable time to recover? Is this a mechanism for providing
HA and additional query capacity, or is it just a mechanism for
additional query capacity only? Those are open questions to which I
don't have any answers yet myself.

My goal is to allow replay and queries to run concurrently, at least as
long as only inserts, updates

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug


Heikki Linnakangas wrote:

Florian G. Pflug wrote:

Jeff Davis wrote:

Are you referring to the size of the xip array being a problem? Would it
help to tie the size of the xip array to max_connections? I understand
that max_connections might be greater on the master, but maybe something
similar? 


Thats what I currently do - the xip array on the slave is sized to
hold max_connections entries (Actually, it's max_connections +
max_prepared_xacts I think). The problem occurs exactly if those
values are set too small on the slave - and since shared mem
objects are not resizeable, I don't see how the slave can handle
an xip overflow gracefully other than by not publishing the
information in shared memory as long as it doesn't fit there.


You could store the value of max_connections in the checkpoint xlog 
record, and read it from there in the slave. Though one could still 
change it on the master and restart without restarting the slave as well.


But AFAIK shmem allocation happens before recovery starts... Even if this
was solved, it would only be a partial solution since as you note, the
master might be restarted while the slave keeps running. So I think it's
better not too add too much complexity, and just tell the DBA to increase
max_connections on the slave, together with a comment in the documentation
never to sex max_connections smaller on the slave than on the master.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug


Jeff Davis wrote:

On Wed, 2007-06-06 at 22:36 +0100, Simon Riggs wrote:

  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
 is considered to be later than any other xid.

So you are bumping FirstNormalTransactionId up by one for this?

You're assuming then that we will freeze replay while we run a query?
Otherwise doing this will mean the snapshot changes as a query executes.


Is it possible to put a normal xmax for the snapshot?

It wouldn't be a real transaction on the slave, and also the master will
use that ID for a real transaction itself. However, I don't see a real
problem on the slave because it would only be used for the purpose of
the snapshot we need at that moment.


My plan is the following:
.) Initially, queries and recovery will run interleaved, but not concurrently.
   For that, an empty snapshot is sufficient, with
   xmin=xid=xmax=ReadOnlyTransactionId.
.) Then, I'll work on running them concurrently. The replay process will publish
   a current snapshot in shared memory, using real xmin and xmax values
   it generates by maintaining a list of currently active (as in: running when
   the wal was written on the master) transactions. In that case, only xid
   is set to ReadOnlyTransactionId.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Is this a feature?

2007-06-11 Thread Florian G. Pflug


Joshua D. Drake wrote:

Take the following:

INFO:  analyzing pg_catalog.pg_authid
INFO:  pg_authid: scanned 1 of 1 pages, containing 5 live rows and 0 
dead rows; 5 rows in sample, 5 estimated total rows


The above is completely redundant. Why not just say:

INFO:  pg_authid: scanned 1 of 1 pages, containing 5 live rows and 0 
dead rows; 5 rows in sample, 5 estimated total rows


If the first line is meant to be an indicator, then make the above line 
do this:


INFO: analyzing pg_catalog.pg_authid :

Don't add a new line, and when the next step of information comes up 
append it to the existing line to get:


INFO: analyzing pg_catalog.pg_authid: scanned 1 of 1 pages, containing 
5 live rows and 0 dead rows; 5 rows in sample, 5 estimated total rows


But then the line could only be pushed to the client *after* the analysis
of the table has finished, while with the current output you know what
postgres is currently doing, because you get analyzing ... *before*
the operation starts.

greetings, Florian Pflug



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Selecting a constant question

2007-06-12 Thread Florian G. Pflug


Dann Corbit wrote:

-Original Message-
From: Hannu Krosing [mailto:[EMAIL PROTECTED]

Since libpq function PQfsize returns -2 for all constant character
strings in SQL statements ... What is the proper procedure to determine
the length of a constant character column after query execution but
before fetching the first row of data?

Why not just get the first row and determine the width from it before
you actually use any of tha data ?


What if the second row is 1000x longer?


Thats exactly the point. Consider
select mytext from mytable ;

How can PostgreSQL possibly know the maximum length
of the returned values *before* it has scanned the
whole table?

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] EXPLAIN omits schema?

2007-06-13 Thread Florian G. Pflug


Gregory Stark wrote:

Joshua D. Drake [EMAIL PROTECTED] writes:

I agree. XML seems like a fairly natural fit for this. Just as people should
not try to shoehorn everything into XML, neither should they try to shoehorn
everything into a relational format either.

Now all we need is an XML schema for it ;-)

Well I am not a big fan of XML but it certainly seems applicable in this
case.


I'm not a fan either so perhaps I'm biased, but this seems like a good example
of where it would be an *awful* idea.

Once you have an XML plan what can you do with it? All you can do is parse it
into constituent bits and display it. You cant do any sort of comparison
between plans, aggregate results, search for plans matching constraints, etc.

How would I, with XML output, do something like:

SELECT distinct node.relation 
  FROM plan_table 
 WHERE node.expected_rows  node.actual_rows*2;


or

SELECT node.type, average(node.ms/node.cost)
  FROM plan_table 
 GROUP BY node.type;


I believe that XQuery actually supports such queries. So if postgres
supported XQuery (or does it already? I honestly don't know), writing
such a query wouldn't be that hard I think. The execution probably
won't be super-efficient, but for query plans that seems OK.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-13 Thread Florian G. Pflug


Heikki Linnakangas wrote:

Jim C. Nasby wrote:

On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the 
best way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.


I doubt it makes that much difference. If there was a significant amount 
of fragmentation, we'd hear more complaints about seq scan performance.


OTOH, extending a relation that uses N pages by something like
min(ceil(N/1024), 1024)) pages might help some filesystems to
avoid fragmentation, and hardly introduce any waste (about 0.1%
in the worst case). So if it's not too hard to do it might
be worthwhile, even if it turns out that most filesystems deal
well with the current allocation pattern.

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Worries about delayed-commit semantics

2007-06-22 Thread Florian G. Pflug


PFC wrote:

On Fri, 22 Jun 2007 16:43:00 +0200, Bruce Momjian [EMAIL PROTECTED] wrote:

Simon Riggs wrote:

On Fri, 2007-06-22 at 14:29 +0100, Gregory Stark wrote:
 Joshua D. Drake [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  untrustworthy disk hardware, for instance.  I'd much rather use names
  derived from deferred commit or delayed commit or some such.
 
  Honestly, I prefer these names as well as it seems directly related versus
  transaction guarantee which sounds to be more like us saying, if we turn it 
off
  our transactions are bogus.

That was the intention..., but name change accepted.

 Hm, another possibility: synchronous_commit = off

Ooo, I like that. Any other takers?


Yea, I like that too but I am now realizing that we are not really
deferring or delaying the COMMIT command but rather the recovery of
the commit.  GUC as full_commit_recovery?


commit_waits_for_fsync =

force_yes: makes all commits hard
yes: commits are hard unless specified otherwise [default]
no: commits are soft unless specified otherwise [should 
replace fsync=off use case]
force_no: makes all commits soft (controller with write cache 
emulator)


I think you got the last line backwards - without the fsync() after
a commit, you can't be sure that the data made it into the controller
cache. To be safe you *always* need the fsync() - but it will probably
be much cheaper if your controller doesn't have to actually write to
the disks, but can cache in battery-backed ram instead. Therefore,
if you own such a controller, you probably don't need deferred commits.

BTW, I like synchronous_commit too - but maybe asynchronous_commit
would be even better, with inverted semantics of course.
The you'd have asynchronous_commit = off as default.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Worries about delayed-commit semantics

2007-06-22 Thread Florian G. Pflug


Richard Huxton wrote:

Bruce Momjian wrote:

Tom Lane wrote:

What's wrong with synchronous_commit?  It's accurate and simple.


That is fine too.


My concern would be that it can be read two ways:
1. When you commit, sync (something or other - unspecified)
2. Synchronise commits (to each other? to something else?)*

It's obvious to people on the -hackers list what we're talking about, 
but is it so clear to a newbie, perhaps non-English speaker?


* I can see people thinking this means something like commit_delay.


OTOH, the concept of synchronous vs. asynchronous (function) calls
should be pretty well-known among database programmers and administrators.
And (at least to me), this is really what this is about - the commit
happens asynchronously, at the convenience of the database, and not
the instant that I requested it.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Still recommending daily vacuum...

2007-07-05 Thread Florian G. Pflug


Michael Paesold wrote:

Alvaro Herrera wrote:

So what you are proposing above amounts to setting scale factor = 0.05.
The threshold is unimportant -- in the case of a big table it matters
not if it's 0 or 1000, it will be almost irrelevant in calculations.  In
the case of small tables, then the table will be vacuumed in almost
every iteration if the threshold is 0, which is fine because the table
is small anyway.  So why not let the threshold be 0 and be done with it?


For very small tables, setting a threshold of 0 could mean a vacuum 
after every single row update (or every other row). I think that is just 
burning cycles. What about a threshold of 10 or 50, to have at least 
some sanity limit? Even though the cost of vacuum of a small table is 
low, it is still not free, IMHO, no?


A bit off-topic (because probably not realistic in a 8.3 timeframe) -
but maybe the threshold should be specified in terms of expected number of
pages to be freed, instead specifing a bias for the number of modified
rows as it is done now. Then 1 would probably be a reasonable default, because
a vacuum that won't free at least one page seems to be not really worth
the effort - it won't safe any future IO bandwith.

Just an idea I got while following this thread...

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

2007-07-05 Thread Florian G. Pflug


Tom Lane wrote:

[ back to dealing with this patch, finally ]

Florian G. Pflug [EMAIL PROTECTED] writes:

While creating the patch, I've been thinking if it might be worthwile
to note that we just did recovery in the ShutdownCheckpoint
(or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
for more error checking, because then the slave could check that
safe_restartpoint() is true for all ShutdownCheckpoints that were not
after recovering.


I concur that this is a good idea --- we should have a third checkpoint
record type that shows that a crash recovery occurred.  However, we can
probably only do that for 8.3 and beyond.  If we try to do it in
existing release branches then there's likelihood of trouble due to WAL
incompatibility between master and standby.  While we do advise people
to update their standbys first, I don't think it's worth risking such
problems just to add some more error checking.



Conclusion: we should apply Florian's patch as-is in 8.2, do something
morally equivalent in 8.1 and before, and invent a
CrashRecoveryCheckpoint record type in HEAD.


Sounds good.

Do you want me to code up such patches for 8.1 and 8.3 in the next days,
or is someone else already working on it?

greetings, Florian Pflug



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

1 2 3 4 >

1 - 100 of 303 matches

Mail list logo